Skip to content
Tools · Apr 18, 2026

NVIDIA releases Nemotron OCR v2, a multilingual text-recognition model trained on synthetic data

The updated model achieves 34.7 pages per second on single A100 GPUs and improves accuracy across six languages by using 12 million programmatically generated training images.

Trust66
HypeLow hype

1 source

ShareXLinkedInEmail
TL;DR
  • NVIDIA released Nemotron OCR v2, a multilingual optical character recognition model that processes 34.7 pages per second on a single A100 GPU
  • The model was trained on 12 million synthetic images covering six languages (Japanese, Korean, Russian, Simplified Chinese, Traditional Chinese, and English), reducing character-level error rates from 0.56–0.92 to 0.035–0.069 on non-English languages
  • The synthetic data pipeline uses mOSCAR (a multilingual web corpus) for source text and a modified version of SynthDoG to generate pixel-precise annotations at word, line, and paragraph levels
  • The architecture reuses convolutional feature maps across detection and recognition components to minimize computational overhead
  • Both the model and the underlying dataset (nvidia/OCR-Synthetic-Multilingual-v1) are publicly available on Hugging Face

NVIDIA and Hugging Face released Nemotron OCR v2, an optical character recognition model capable of processing 34.7 pages per second on a single A100 GPU while maintaining multilingual accuracy. The model was developed to address limitations in its predecessor, which struggled with non-English text due to insufficient training data and character set constraints.

The core innovation involves training on 12 million synthetic images across six languages: English, Japanese, Korean, Russian, and both Simplified and Traditional Chinese. The predecessor model achieved normalized edit distance (NED) scores ranging from 0.56 to 0.92 on these languages, indicating severe recognition errors. The new version reduced those error rates to 0.035–0.069, demonstrating substantial improvements in character-level accuracy.

The synthetic data pipeline combines two primary components: mOSCAR, a multilingual web corpus spanning 163 language subsets, provides realistic source text distributions; a modified version of SynthDoG (Synthetic Document Generator) renders that text onto programmatic layouts with precise pixel-level annotations. The pipeline generates hierarchical bounding boxes at word, line, and paragraph levels, plus relation graphs encoding document structure and reading order, enabling the model to handle complex layouts including multi-column text, tables, and vertical scripts.

The model's speed derives from its architecture, which unifies text detection and recognition into a single network with a shared convolutional backbone. A RegNetX-8GF backbone processes each input image once, producing feature maps reused by both the text recognizer and a compact relational model component. This feature reuse eliminates redundant computation across downstream components. The synthetic training approach allowed developers to control layouts, fonts, colors, and augmentations systematically, enabling models trained entirely on synthetic data to generalize to real-world documents.

Sources
  1. 01Hugging FaceBuilding a Fast Multilingual OCR Model with Synthetic Data
Also on Tools

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.