NVIDIA NeMo AutoModel claims 3.4–3.7x higher training throughput and 29–32% less GPU memory for fine-tuning MoE models
NVIDIA’s new NeMo AutoModel integrates with Hugging Face Transformers v5 to accelerate fine-tuning of Mixture-of-Experts models using the same API, with measured speedups and memory reductions across single- and multi-node setups.
1 source · cross-referenced
- NVIDIA released NeMo AutoModel, an open library that accelerates fine-tuning of Mixture-of-Experts (MoE) models within Hugging Face Transformers v5.
- Benchmarks report 3.4–3.7x higher training throughput and 29–32% less GPU memory compared to native Transformers v5 on MoE fine-tuning tasks.
- The library adds Expert Parallelism, DeepEP fused all-to-all dispatch, and TransformerEngine kernels while maintaining API compatibility via AutoModelForCausalLM.
- Performance gains are demonstrated on models including NVIDIA Nemotron 3 Ultra 550B A55B (multi-node) and Qwen3-30B-A3B and Nemotron 3 Nano 30B A3B (single-node).
NVIDIA’s NeMo AutoModel is an open library within the NeMo framework designed to accelerate fine-tuning of Mixture-of-Experts (MoE) models while maintaining compatibility with Hugging Face Transformers v5. The library introduces Expert Parallelism, DeepEP fused all-to-all dispatch, and TransformerEngine kernels, all accessible through the same from_pretrained() API used in Hugging Face Transformers.
According to the announcement, NeMo AutoModel delivers 3.4–3.7x higher training throughput and reduces GPU memory usage by 29–32% compared to native Transformers v5 when fine-tuning MoE models. These gains are achieved without requiring changes to existing codebases, aside from importing NeMoAutoModelForCausalLM instead of the standard AutoModelForCausalLM.
The performance improvements are demonstrated across multiple model families and hardware configurations. For a full fine-tune of the NVIDIA Nemotron 3 Ultra 550B A55B model across 16 H100 nodes (128 GPUs), NeMo AutoModel reports an average of 815 tokens per second per GPU and approximately 293 TFLOP/s per GPU with a peak memory footprint of 58.2 GiB. Transformers v5 could not run this configuration due to out-of-memory constraints, according to the blog.
On single-node setups with 8x H100 80GB GPUs, NeMo AutoModel was benchmarked against Hugging Face Transformers v4 and v5 using models including Qwen3-30B-A3B and Nemotron 3 Nano 30B A3B. The announcement highlights that NeMo AutoModel’s optimizations—particularly its balanced routing gate and DeepEP dispatch—enable more efficient expert utilization and communication overlap, which are critical for MoE training performance.
The library’s design emphasizes API compatibility and scalability. It subclasses AutoModelForCausalLM, allowing existing code that works with Hugging Face models to function with NeMo AutoModel with minimal changes. For supported MoE architectures such as Qwen3, NVIDIA Nemotron, GPT-OSS, and DeepSeek V3, NeMo AutoModel provides hand-tuned implementations with TransformerEngine attention and fused linear layers, while falling back to optimizations like Liger kernel patching for other models.
NeMo AutoModel also integrates with PyTorch’s DeviceMesh for distributed training, enabling multi-GPU and multi-node setups via a distributed mesh configuration. The blog provides a code example showing how to initialize a distributed setup with Expert Parallelism and FSDP2 for an 8-GPU node, demonstrating the library’s focus on practical, scalable fine-tuning workflows.
- Jun 24, 2026 · TechCrunch — AI
MoEngage acquires Aampe to deploy AI agents for customer-level marketing decisions
Trust74 - Jun 23, 2026 · Hugging Face
Hugging Face’s Transformers.js experiments with proposed Cross-Origin Storage API to reduce redundant model downloads
Trust79 - Jun 23, 2026 · TechCrunch — AI
OpenAI and Trail of Bits launch "Patch the Planet" to audit open-source code with AI assistance
Trust71