Tools · Apr 18, 2026

NVIDIA releases Nemotron OCR v2, a multilingual text-recognition model trained on synthetic data

The updated model achieves 34.7 pages per second on single A100 GPUs and improves accuracy across six languages by using 12 million programmatically generated training images.

Trust66

HypeLow hype

1 source

ShareX LinkedIn Email

TL;DR

NVIDIA released Nemotron OCR v2, a multilingual optical character recognition model that processes 34.7 pages per second on a single A100 GPU
The model was trained on 12 million synthetic images covering six languages (Japanese, Korean, Russian, Simplified Chinese, Traditional Chinese, and English), reducing character-level error rates from 0.56–0.92 to 0.035–0.069 on non-English languages
The synthetic data pipeline uses mOSCAR (a multilingual web corpus) for source text and a modified version of SynthDoG to generate pixel-precise annotations at word, line, and paragraph levels
The architecture reuses convolutional feature maps across detection and recognition components to minimize computational overhead
Both the model and the underlying dataset (nvidia/OCR-Synthetic-Multilingual-v1) are publicly available on Hugging Face

NVIDIA and Hugging Face released Nemotron OCR v2, an optical character recognition model capable of processing 34.7 pages per second on a single A100 GPU while maintaining multilingual accuracy. The model was developed to address limitations in its predecessor, which struggled with non-English text due to insufficient training data and character set constraints.

The core innovation involves training on 12 million synthetic images across six languages: English, Japanese, Korean, Russian, and both Simplified and Traditional Chinese. The predecessor model achieved normalized edit distance (NED) scores ranging from 0.56 to 0.92 on these languages, indicating severe recognition errors. The new version reduced those error rates to 0.035–0.069, demonstrating substantial improvements in character-level accuracy.

The synthetic data pipeline combines two primary components: mOSCAR, a multilingual web corpus spanning 163 language subsets, provides realistic source text distributions; a modified version of SynthDoG (Synthetic Document Generator) renders that text onto programmatic layouts with precise pixel-level annotations. The pipeline generates hierarchical bounding boxes at word, line, and paragraph levels, plus relation graphs encoding document structure and reading order, enabling the model to handle complex layouts including multi-column text, tables, and vertical scripts.

The model's speed derives from its architecture, which unifies text detection and recognition into a single network with a shared convolutional backbone. A RegNetX-8GF backbone processes each input image once, producing feature maps reused by both the text recognizer and a compact relational model component. This feature reuse eliminates redundant computation across downstream components. The synthetic training approach allowed developers to control layouts, fonts, colors, and augmentations systematically, enabling models trained entirely on synthetic data to generalize to real-world documents.

Sources

01Hugging Face — Building a Fast Multilingual OCR Model with Synthetic Data

Also on Tools

NVIDIA releases Nemotron OCR v2, a multilingual text-recognition model trained on synthetic data

Nonprofit Current AI launches open public AI infrastructure projects with $3.2M in grants

Smartsheet deploys remote Model Context Protocol server on AWS to connect AI agents to enterprise data

Interactive SQLite Query Explainer runs in-browser with annotated query plans