NVIDIA releases Nemotron OCR v2, a multilingual text-recognition model trained on synthetic data
The updated model achieves 34.7 pages per second on single A100 GPUs and improves accuracy across six languages by using 12 million programmatically generated training images.
1 source
- NVIDIA released Nemotron OCR v2, a multilingual optical character recognition model that processes 34.7 pages per second on a single A100 GPU
- The model was trained on 12 million synthetic images covering six languages (Japanese, Korean, Russian, Simplified Chinese, Traditional Chinese, and English), reducing character-level error rates from 0.56–0.92 to 0.035–0.069 on non-English languages
- The synthetic data pipeline uses mOSCAR (a multilingual web corpus) for source text and a modified version of SynthDoG to generate pixel-precise annotations at word, line, and paragraph levels
- The architecture reuses convolutional feature maps across detection and recognition components to minimize computational overhead
- Both the model and the underlying dataset (nvidia/OCR-Synthetic-Multilingual-v1) are publicly available on Hugging Face
NVIDIA and Hugging Face released Nemotron OCR v2, an optical character recognition model capable of processing 34.7 pages per second on a single A100 GPU while maintaining multilingual accuracy. The model was developed to address limitations in its predecessor, which struggled with non-English text due to insufficient training data and character set constraints.
The core innovation involves training on 12 million synthetic images across six languages: English, Japanese, Korean, Russian, and both Simplified and Traditional Chinese. The predecessor model achieved normalized edit distance (NED) scores ranging from 0.56 to 0.92 on these languages, indicating severe recognition errors. The new version reduced those error rates to 0.035–0.069, demonstrating substantial improvements in character-level accuracy.
The synthetic data pipeline combines two primary components: mOSCAR, a multilingual web corpus spanning 163 language subsets, provides realistic source text distributions; a modified version of SynthDoG (Synthetic Document Generator) renders that text onto programmatic layouts with precise pixel-level annotations. The pipeline generates hierarchical bounding boxes at word, line, and paragraph levels, plus relation graphs encoding document structure and reading order, enabling the model to handle complex layouts including multi-column text, tables, and vertical scripts.
The model's speed derives from its architecture, which unifies text detection and recognition into a single network with a shared convolutional backbone. A RegNetX-8GF backbone processes each input image once, producing feature maps reused by both the text recognizer and a compact relational model component. This feature reuse eliminates redundant computation across downstream components. The synthetic training approach allowed developers to control layouts, fonts, colors, and augmentations systematically, enabling models trained entirely on synthetic data to generalize to real-world documents.
- May 21, 2026 · TechCrunch
Spotify launches ElevenLabs-powered audiobook creation tool for independent authors
Trust54 - May 20, 2026 · Hugging Face
Hugging Face releases six Ettin reranker models with distillation training recipe
Trust74 - May 19, 2026 · Google AI — Blog
Google announces voice features, image editor, and personal AI agent for Workspace
Trust77