Tools · Jul 3, 2026

Hugging Face Transformers v5.13.0 adds multimodal, speech, and MoE models

The open-source library adds support for Kimi 2.5–2.7, MiMo-V2-Flash, Nemotron 3.5 ASR, Qwen3 ASR, ZAYA1, and VideoPrism, alongside efficiency-focused updates.

Trust79

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

Adds native multimodal agentic model support via Kimi 2.5–2.7 architecture
Introduces MiMo-V2-Flash, a 27T-token MoE model with 256K context and reduced KV-cache
Adds two NVIDIA speech recognition models (Nemotron 3.5 ASR and Nemotron ASR Streaming)
Includes Qwen3 ASR with forced aligner and ZAYA1 MoE model from Zyphra
Adds VideoPrism for general-purpose video understanding

Hugging Face Transformers v5.13.0 integrates six new models spanning multimodal agents, speech recognition, mixture-of-experts architectures, and video understanding. The Kimi 2.5 architecture underpins three new releases—Kimi 2.5, 2.6, and 2.7—positioned as open-source multimodal agentic models focused on long-horizon coding, autonomous execution, and swarm-based task orchestration. According to the release notes, Kimi 2.5 generalizes across programming languages including Rust, Go, and Python, and can transform simple prompts and visual inputs into production-ready interfaces and lightweight full-stack workflows.

MiMo-V2-Flash, developed by Xiaomi’s MiMo team, is a Mixture-of-Experts language model trained on 27 trillion tokens with native 32K sequence length and support for an extended 256K context window. The model is designed to balance long-context modeling with inference efficiency by reducing KV-cache storage compared to standard global attention models.

NVIDIA contributes two speech recognition models: Nemotron 3.5 ASR and Nemotron ASR Streaming, both 600M-parameter models targeting low-latency streaming and high-throughput batch transcription. The streaming variants offer configurable chunk sizes (80ms, 160ms, 560ms, 1120ms) to trade off latency against accuracy, leveraging a cache-aware FastConformer-RNNT architecture that reuses cached encoder context to minimize redundant computation and end-to-end delay.

Alibaba’s Qwen team adds Qwen3 ASR, an automatic speech recognition model combining a Whisper-style audio encoder with a Qwen3 language model decoder. It supports automatic language detection, multilingual transcription, and includes a forced aligner model for timestamping transcripts against audio. The release also adds ZAYA1, a 760M active / 8.4B total parameter MoE language model from Zyphra that incorporates Compressed Convolutional Attention, a nonlinear router, and residual scaling.

VideoPrism, proposed by Google DeepMind, is added as a general-purpose video encoder for diverse video understanding tasks. The model is pretrained on a large-scale, heterogeneous video dataset and is intended to serve as a foundational visual encoder with a single frozen model.

Sources

01GitHub · huggingface/transformers releases — Release v5.13.0

Also on Tools

Hugging Face Transformers v5.13.0 adds multimodal, speech, and MoE models

Meta launches experimental AI app Pocket for generating and sharing interactive mini-games

Hugging Face and Cerebras demo real-time speech-to-speech pipeline using Gemma 4 31B

Venice AI raises $65M Series A at $1B valuation to scale privacy-first AI platform