Hugging Face releases Transformers v5.7.0 with new model support and bug fixes
The Transformers library v5.7.0 adds support for Laguna and DEIMv2 models alongside attention, tokenizer, and generation improvements. Released April 28, the update includes multiple kernel optimizations and bug fixes affecting models including T5Gemma2, Qwen3.5, and GraniteMoeHybrid.
1 source · cross-referenced
- Transformers v5.7.0 added Poolside's Laguna mixture-of-experts model and DEIMv2 real-time object detection model to the library.
- The release fixed cross-attention cache issues in T5Gemma2 for long inputs, linear attention bugs in Qwen3.5, and crashes in GraniteMoeHybrid attention-only variants.
- Continuous batching generation received corrections for KV deduplication and memory estimation on long sequences (16K+), with misleading warnings removed.
- Kernel support was improved for FP8 checkpoints like Qwen3.5-35B-A3B-FP8 and custom expert kernels from the Hugging Face Hub can now be loaded properly.
- An AutoTokenizer bug that initialized the wrong tokenizer class, causing regressions in models like DeepSeek R1, was reverted.
Hugging Face released version 5.7.0 of its Transformers library on April 28, 2025, introducing support for two new model families and addressing multiple critical bugs. The release added Laguna, Poolside's mixture-of-experts language model, and DEIMv2, a real-time object detection architecture spanning eight model sizes from X to Atto. Laguna implements a sigmoid mixture-of-experts router with per-layer head count flexibility, while DEIMv2 extends DETR capabilities with features adapted from DINOv3, achieving 57.8 AP with only 50.3M parameters at the X variant.
The update fixed several attention-mechanism bugs affecting production models. T5Gemma2 had a cross-attention cache layer type error for long inputs, Qwen3.5's gated-delta-net linear attention had incorrect cached forward behavior, and GraniteMoeHybrid crashed when no Mamba layers were present. Attention function dispatch was updated to align with latest model implementations across the codebase.
Generation improvements targeted continuous batching, which enables concurrent request handling in production deployments. Changes corrected KV deduplication and memory estimation for sequences exceeding 16,000 tokens — a critical use case for long-context applications. The release removed stale warnings about num_return_sequences that were incorrectly firing even when functionality worked correctly, reducing false alarms in production logs.
Kernel handling was strengthened with fixes for FP8 checkpoint configuration reading and error handling, particularly for quantized models like Qwen3.5-35B-A3B-FP8. Custom expert kernels registered from the Hugging Face Hub can now be properly loaded, and a rotary kernel incompatibility affecting Gemma3n and Gemma4 was resolved. An AutoTokenizer bug that selected the wrong tokenizer class, causing regressions in DeepSeek R1 and other models, was reverted.
- Apr 28, 2026 · Hugging Face / NVIDIA
NVIDIA Nemotron 3 Nano Omni adds audio and video capabilities to multimodal AI model
Trust63 - Apr 28, 2026 · TechCrunch — AI
OpenAI and Microsoft renegotiate deal, resolving exclusivity dispute with Amazon partnership
Trust54 - Apr 28, 2026 · NVIDIA (via Hugging Face)
NVIDIA releases NV-Raw2Insights-US, an AI system for adaptive ultrasound imaging from raw sensor data
Trust69