Tools · Apr 28, 2026

NVIDIA Nemotron 3 Nano Omni adds audio and video capabilities to multimodal AI model

A new open-weights multimodal model extends NVIDIA's Nemotron line with native audio processing, video understanding, and long-context support for document analysis and agentic tasks.

Trust63

HypeSome hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

NVIDIA released Nemotron 3 Nano Omni, an open-weights multimodal model combining text, image, audio, and video understanding for document analysis, speech recognition, and agentic computer use.
The model uses a hybrid Mamba-Transformer-MoE backbone with 30B parameters and claims top accuracy on benchmarks including OCRBenchV2 (65.8), MMLongBench-Doc (57.5), and VoiceBench (89.4).
The model reportedly delivers up to 9x higher throughput and 2.9x faster single-stream reasoning speed compared to alternatives on multimodal tasks.
Checkpoints are available in BF16, FP8, and NVFP4 formats on Hugging Face for download.

NVIDIA has announced Nemotron 3 Nano Omni, a multimodal AI model designed to handle text, images, audio, and video in a single unified architecture. The model extends prior Nemotron releases by adding native audio understanding and video processing capabilities, moving beyond the vision-language focus of previous versions. The 30B-parameter model combines a hybrid Mamba-Transformer Mixture-of-Experts backbone with C-RADIOv4-H vision encoding and Parakeet-TDT audio encoding to support long-context reasoning across mixed modality inputs.

The model is positioned for five primary use cases: analyzing complex multi-page documents with layout and cross-page reasoning, automatic speech recognition across diverse audio conditions, joint audio-video understanding for screen recordings and meetings, agentic computer use through screenshot interpretation and GUI reasoning, and general multimodal reasoning on structured evidence. NVIDIA notes the architecture preserves fine visual detail through dynamic resolution processing and uses Conv3D temporal compression for video inputs.

According to NVIDIA's benchmarks, Nemotron 3 Nano Omni ranks first or competitive on several established leaderboards. On document understanding tasks, it scores 65.8 on OCRBenchV2-En and 57.5 on MMLongBench-Doc. For audio, it achieves 89.4 on VoiceBench. On video understanding, it scores 72.2 on Video-MME and 74.1 on DailyOmni. The model reportedly surpasses Nemotron Nano V2 VL and competes with Qwen3-Omni-30B-A3B across these benchmarks.

NVIDIA claims the model achieves higher system efficiency than competing open omni-modal models—7.4x higher throughput for multi-document workflows and 9.2x for video workloads at fixed per-user interactivity thresholds. The model was trained using staged multimodal alignment, context extension, preference optimization, and multimodal reinforcement learning. BF16, FP8, and NVFP4 checkpoint versions are available for download via Hugging Face.

Sources

01Hugging Face / NVIDIA — Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

Also on Tools

NVIDIA Nemotron 3 Nano Omni adds audio and video capabilities to multimodal AI model

Hugging Face releases Transformers v5.7.0 with new model support and bug fixes

OpenAI and Microsoft renegotiate deal, resolving exclusivity dispute with Amazon partnership

NVIDIA releases NV-Raw2Insights-US, an AI system for adaptive ultrasound imaging from raw sensor data