Google DeepMind releases DiffusionGemma, an experimental open model for 4x faster text generation
The 26B MoE model uses diffusion-based generation to deliver up to 4x faster inference on GPUs, trading some output quality for speed in interactive local workflows.
1 source · cross-referenced
- Google DeepMind released DiffusionGemma, an experimental open model that uses diffusion-based text generation to achieve up to 4x faster inference on dedicated GPUs compared to autoregressive LLMs.
Google DeepMind today introduced DiffusionGemma, an experimental open model that explores text diffusion as a faster alternative to standard autoregressive large language models. Released under an Apache 2.0 license, the 26B Mixture of Experts (MoE) model departs from the token-by-token generation of typical LLMs by producing entire blocks of text simultaneously, enabling up to 4x faster text generation on GPUs.
The model is designed for researchers and developers focused on speed-critical, interactive local workflows such as in-line editing, rapid iteration, and generating non-linear text structures. While autoregressive Gemma 4 models remain the standard for high-quality production outputs, DiffusionGemma targets scenarios where low latency is paramount, even at the cost of some output quality.
DiffusionGemma achieves its speed gains by shifting the decode bottleneck from memory-bandwidth to compute, generating over 1,000 tokens per second on a single NVIDIA H100 and more than 700 tokens per second on an NVIDIA GeForce RTX 5090. As a 26B MoE model that activates only 3.8B parameters during inference, it fits within an 18GB VRAM footprint when quantized, making it accessible on high-end consumer GPUs.
A key architectural feature is bi-directional attention, which allows the model to generate 256 tokens in parallel per forward pass, with every token attending to all others. This enables advantages in non-linear domains such as in-line editing, code infilling, amino acid sequences, and mathematical graphs. The model also supports intelligent self-correction by iteratively refining its output in real time.
Google DeepMind notes that DiffusionGemma is experimental and its overall output quality is lower than standard Gemma 4. For applications requiring maximum quality, the company recommends deploying standard Gemma 4. Performance on specific tasks can be improved through fine-tuning, as demonstrated by an example where Unsloth fine-tuned DiffusionGemma to solve Sudoku—a task autoregressive models typically struggle with due to dependency chains.
- Jun 16, 2026 · Google DeepMind — Blog
Google DeepMind releases Gemini 3.5 Live Translate for near real-time speech-to-speech translation in over 70 languages
Trust79 - May 22, 2026 · arXiv cs.AI
New Method Improves LLM Reasoning About Conflicting Beliefs in Complex Social Scenarios
Trust79 - May 20, 2026 · OpenAI — News
OpenAI model resolves 80-year-old discrete geometry conjecture
Trust67