Models · Apr 18, 2026

Qwen3.6-35B-A3B outperforms Claude Opus 4.7 on image generation tasks in informal testing

A developer's informal comparison of two newly released models using visual generation benchmarks shows the smaller Qwen model producing superior outputs in specific creative tasks, though broader capabilities remain unclear.

Trust54

HypeSome hype

1 source

ShareX LinkedIn Email

TL;DR

Simon Willison tested Qwen3.6-35B-A3B and Claude Opus 4.7 using his informal 'pelican riding a bicycle' benchmark, finding the Qwen model generated cleaner SVG illustrations.
In a secondary test asking for a flamingo riding a unicycle, Qwen again produced output Willison rated as superior, with more creative SVG comments and visual flair.
Willison cautioned against reading too much into the results, noting the pelican benchmark was designed as a joke and acknowledging he doubts the quantized 35B model is generally more capable than Anthropic's proprietary release.
The comparison highlights how narrow task-specific performance can diverge from overall model utility, with Willison noting the long-standing correlation between pelican quality and general usefulness may have broken down.

Alibaba's Qwen3.6-35B-A3B, released alongside Claude Opus 4.7, was subjected to Willison's established—if tongue-in-cheek—visual generation test. Running a 20.9GB quantized version locally via LM Studio on an M5 MacBook Pro, the model produced an SVG drawing of a pelican on a bicycle that correctly rendered the bicycle frame and included contextual details like clouds and a caption. Anthropic's new flagship model stumbled on the same task, generating a distorted bicycle frame in its initial attempt and producing a similar error even when invoked with maximum reasoning mode.

Concerned that the labs might deliberately optimize for his quirky benchmark—a worry he has addressed before—Willison deployed an undisclosed secondary test asking for an SVG of a flamingo on a unicycle. Again, Qwen3.6-35B-A3B delivered output with greater visual personality and technical finesse, including a well-placed SVG comment. Opus 4.7 produced a competent but unremarkable illustration lacking decorative elements.

Willison framed the findings within the proper context of his benchmark's limitations. He emphasized that the pelican test originated as satire about the absurdity of model comparison itself, yet acknowledged that historically, pelican quality has tracked with general model utility improvements over time. However, he explicitly rejected the inference that a quantized 35B model now outperforms Anthropic's latest proprietary system in practical capability, noting skepticism that the correlation holds in this case.

Sources

01Simon Willison — everything — Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

Also on Models

Qwen3.6-35B-A3B outperforms Claude Opus 4.7 on image generation tasks in informal testing

Claude Code confirmed using Bun’s Rust port in production

Moonshot AI releases Kimi K3 open source model, touting frontier-level performance

OpenAI CFO proposes scorecard to measure AI ROI