Qwen3.6-35B-A3B outperforms Claude Opus 4.7 on image generation tasks in informal testing
A developer's informal comparison of two newly released models using visual generation benchmarks shows the smaller Qwen model producing superior outputs in specific creative tasks, though broader capabilities remain unclear.
1 source
- Simon Willison tested Qwen3.6-35B-A3B and Claude Opus 4.7 using his informal 'pelican riding a bicycle' benchmark, finding the Qwen model generated cleaner SVG illustrations.
- In a secondary test asking for a flamingo riding a unicycle, Qwen again produced output Willison rated as superior, with more creative SVG comments and visual flair.
- Willison cautioned against reading too much into the results, noting the pelican benchmark was designed as a joke and acknowledging he doubts the quantized 35B model is generally more capable than Anthropic's proprietary release.
- The comparison highlights how narrow task-specific performance can diverge from overall model utility, with Willison noting the long-standing correlation between pelican quality and general usefulness may have broken down.
Alibaba's Qwen3.6-35B-A3B, released alongside Claude Opus 4.7, was subjected to Willison's established—if tongue-in-cheek—visual generation test. Running a 20.9GB quantized version locally via LM Studio on an M5 MacBook Pro, the model produced an SVG drawing of a pelican on a bicycle that correctly rendered the bicycle frame and included contextual details like clouds and a caption. Anthropic's new flagship model stumbled on the same task, generating a distorted bicycle frame in its initial attempt and producing a similar error even when invoked with maximum reasoning mode.
Concerned that the labs might deliberately optimize for his quirky benchmark—a worry he has addressed before—Willison deployed an undisclosed secondary test asking for an SVG of a flamingo on a unicycle. Again, Qwen3.6-35B-A3B delivered output with greater visual personality and technical finesse, including a well-placed SVG comment. Opus 4.7 produced a competent but unremarkable illustration lacking decorative elements.
Willison framed the findings within the proper context of his benchmark's limitations. He emphasized that the pelican test originated as satire about the absurdity of model comparison itself, yet acknowledged that historically, pelican quality has tracked with general model utility improvements over time. However, he explicitly rejected the inference that a quantized 35B model now outperforms Anthropic's latest proprietary system in practical capability, noting skepticism that the correlation holds in this case.
- May 20, 2026 · TechCrunch
Stability AI releases Stable Audio 3.0 with models capable of generating six-minute compositions
Trust52 - May 20, 2026 · Allen Institute / Hugging Face
Allen Institute releases OlmoEarth v1.1, a satellite imagery model that cuts inference costs threefold
Trust74 - May 19, 2026 · Google AI — Blog
Google's AI Mode search feature surpasses one billion monthly active users one year after U.S. launch
Trust67