Skip to content
Research · Jun 16, 2026

Google DeepMind releases DiffusionGemma, an experimental open model for 4x faster text generation

The 26B MoE model uses diffusion-based generation to deliver up to 4x faster inference on GPUs, trading some output quality for speed in interactive local workflows.

Trust79
HypeLow hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • Google DeepMind released DiffusionGemma, an experimental open model that uses diffusion-based text generation to achieve up to 4x faster inference on dedicated GPUs compared to autoregressive LLMs.

Google DeepMind today introduced DiffusionGemma, an experimental open model that explores text diffusion as a faster alternative to standard autoregressive large language models. Released under an Apache 2.0 license, the 26B Mixture of Experts (MoE) model departs from the token-by-token generation of typical LLMs by producing entire blocks of text simultaneously, enabling up to 4x faster text generation on GPUs.

The model is designed for researchers and developers focused on speed-critical, interactive local workflows such as in-line editing, rapid iteration, and generating non-linear text structures. While autoregressive Gemma 4 models remain the standard for high-quality production outputs, DiffusionGemma targets scenarios where low latency is paramount, even at the cost of some output quality.

DiffusionGemma achieves its speed gains by shifting the decode bottleneck from memory-bandwidth to compute, generating over 1,000 tokens per second on a single NVIDIA H100 and more than 700 tokens per second on an NVIDIA GeForce RTX 5090. As a 26B MoE model that activates only 3.8B parameters during inference, it fits within an 18GB VRAM footprint when quantized, making it accessible on high-end consumer GPUs.

A key architectural feature is bi-directional attention, which allows the model to generate 256 tokens in parallel per forward pass, with every token attending to all others. This enables advantages in non-linear domains such as in-line editing, code infilling, amino acid sequences, and mathematical graphs. The model also supports intelligent self-correction by iteratively refining its output in real time.

Google DeepMind notes that DiffusionGemma is experimental and its overall output quality is lower than standard Gemma 4. For applications requiring maximum quality, the company recommends deploying standard Gemma 4. Performance on specific tasks can be improved through fine-tuning, as demonstrated by an example where Unsloth fine-tuned DiffusionGemma to solve Sudoku—a task autoregressive models typically struggle with due to dependency chains.

Sources
  1. 01Google DeepMind — BlogDiffusionGemma: 4x faster text generation
Also on Research

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.