Skip to content
Tools · Apr 28, 2026

Hugging Face releases Transformers v5.7.0 with new model support and bug fixes

The Transformers library v5.7.0 adds support for Laguna and DEIMv2 models alongside attention, tokenizer, and generation improvements. Released April 28, the update includes multiple kernel optimizations and bug fixes affecting models including T5Gemma2, Qwen3.5, and GraniteMoeHybrid.

Trust79
HypeLow hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • Transformers v5.7.0 added Poolside's Laguna mixture-of-experts model and DEIMv2 real-time object detection model to the library.
  • The release fixed cross-attention cache issues in T5Gemma2 for long inputs, linear attention bugs in Qwen3.5, and crashes in GraniteMoeHybrid attention-only variants.
  • Continuous batching generation received corrections for KV deduplication and memory estimation on long sequences (16K+), with misleading warnings removed.
  • Kernel support was improved for FP8 checkpoints like Qwen3.5-35B-A3B-FP8 and custom expert kernels from the Hugging Face Hub can now be loaded properly.
  • An AutoTokenizer bug that initialized the wrong tokenizer class, causing regressions in models like DeepSeek R1, was reverted.

Hugging Face released version 5.7.0 of its Transformers library on April 28, 2025, introducing support for two new model families and addressing multiple critical bugs. The release added Laguna, Poolside's mixture-of-experts language model, and DEIMv2, a real-time object detection architecture spanning eight model sizes from X to Atto. Laguna implements a sigmoid mixture-of-experts router with per-layer head count flexibility, while DEIMv2 extends DETR capabilities with features adapted from DINOv3, achieving 57.8 AP with only 50.3M parameters at the X variant.

The update fixed several attention-mechanism bugs affecting production models. T5Gemma2 had a cross-attention cache layer type error for long inputs, Qwen3.5's gated-delta-net linear attention had incorrect cached forward behavior, and GraniteMoeHybrid crashed when no Mamba layers were present. Attention function dispatch was updated to align with latest model implementations across the codebase.

Generation improvements targeted continuous batching, which enables concurrent request handling in production deployments. Changes corrected KV deduplication and memory estimation for sequences exceeding 16,000 tokens — a critical use case for long-context applications. The release removed stale warnings about num_return_sequences that were incorrectly firing even when functionality worked correctly, reducing false alarms in production logs.

Kernel handling was strengthened with fixes for FP8 checkpoint configuration reading and error handling, particularly for quantized models like Qwen3.5-35B-A3B-FP8. Custom expert kernels registered from the Hugging Face Hub can now be properly loaded, and a rotary kernel incompatibility affecting Gemma3n and Gemma4 was resolved. An AutoTokenizer bug that selected the wrong tokenizer class, causing regressions in DeepSeek R1 and other models, was reverted.

Sources
  1. 01GitHub · huggingface/transformers releasesRelease v5.7.0
Also on Tools

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.