Hugging Face adds private datasets to Open ASR Leaderboard to combat benchmarking gaming
The Open ASR Leaderboard now incorporates high-quality private speech datasets from Appen and DataoceanAI to reduce model overfitting to public test sets, while keeping the leaderboard's main ranking metric based on public data by default.
1 source · single source
- Hugging Face introduced private English speech datasets covering multiple accents and styles to the Open ASR Leaderboard, sourced from Appen Inc. and DataoceanAI with over 29 hours of audio across scripted and conversational speech.
- Private datasets remain hidden from model developers to prevent benchmarking optimization (benchmaxxing) and test-set contamination, addressing a known vulnerability in open benchmarks.
- The leaderboard's primary Average WER metric continues to use only public datasets by default; private data can be toggled on separately to observe its impact on model rankings.
- The update includes targeted metrics for scripted versus conversational speech and US versus non-US accents, designed to surface performance gaps that single-metric leaderboards typically obscure.
The Open ASR Leaderboard, which has received over 710,000 visits since September 2023, now incorporates private evaluation datasets to reduce model overfitting to public benchmarks. Hugging Face collaborated with data providers Appen Inc. and DataoceanAI to curate high-quality speech datasets spanning multiple English accents and speaking styles.
The new private dataset includes approximately 29 hours of audio across Australian, Canadian, Indian, US, and British English accents. Each accent variant exists in both scripted and conversational forms, with datasets ranging from 1.02 to 8.82 hours. The datasets are intentionally withheld from model developers to prevent explicit optimization against the test sets.
The leaderboard implements a structural safeguard: the default Average WER metric excludes private dataset results, preventing private data from immediately affecting model rankings. Developers can optionally toggle private data visibility to observe downstream effects, but the primary leaderboard ranking remains based on public datasets. This design choice prioritizes transparency while protecting evaluation integrity.
The update introduces fine-grained performance metrics that break down results by speech style (scripted versus conversational) and accent region (US versus non-US). These targeted metrics are computed as macroaverages across data providers to prevent developers from gaming specific subsets. Hugging Face acknowledges that models trained on data from the private providers may retain some advantage, mitigated by the presence of multiple independent data sources and openness to adding new providers.
- May 6, 2026 · Hugging Face Blog
ServiceNow engineers document vLLM V1 migration: fixing backend correctness in reinforcement learning pipelines
Trust69 - May 6, 2026 · GitHub · huggingface/transformers releases
Transformers 5.8.0 adds DeepSeek-V4, Gemma 4 Assistant, and five new models
Trust79 - May 5, 2026 · OpenAI — News
OpenAI introduces beta self-serve ads manager for ChatGPT with CPC bidding
Trust79