Skip to content
Tools · May 6, 2026

Hugging Face adds private datasets to Open ASR Leaderboard to combat benchmarking gaming

The Open ASR Leaderboard now incorporates high-quality private speech datasets from Appen and DataoceanAI to reduce model overfitting to public test sets, while keeping the leaderboard's main ranking metric based on public data by default.

Trust70
HypeLow hype

1 source · single source

ShareXLinkedInEmail
TL;DR
  • Hugging Face introduced private English speech datasets covering multiple accents and styles to the Open ASR Leaderboard, sourced from Appen Inc. and DataoceanAI with over 29 hours of audio across scripted and conversational speech.
  • Private datasets remain hidden from model developers to prevent benchmarking optimization (benchmaxxing) and test-set contamination, addressing a known vulnerability in open benchmarks.
  • The leaderboard's primary Average WER metric continues to use only public datasets by default; private data can be toggled on separately to observe its impact on model rankings.
  • The update includes targeted metrics for scripted versus conversational speech and US versus non-US accents, designed to surface performance gaps that single-metric leaderboards typically obscure.

The Open ASR Leaderboard, which has received over 710,000 visits since September 2023, now incorporates private evaluation datasets to reduce model overfitting to public benchmarks. Hugging Face collaborated with data providers Appen Inc. and DataoceanAI to curate high-quality speech datasets spanning multiple English accents and speaking styles.

The new private dataset includes approximately 29 hours of audio across Australian, Canadian, Indian, US, and British English accents. Each accent variant exists in both scripted and conversational forms, with datasets ranging from 1.02 to 8.82 hours. The datasets are intentionally withheld from model developers to prevent explicit optimization against the test sets.

The leaderboard implements a structural safeguard: the default Average WER metric excludes private dataset results, preventing private data from immediately affecting model rankings. Developers can optionally toggle private data visibility to observe downstream effects, but the primary leaderboard ranking remains based on public datasets. This design choice prioritizes transparency while protecting evaluation integrity.

The update introduces fine-grained performance metrics that break down results by speech style (scripted versus conversational) and accent region (US versus non-US). These targeted metrics are computed as macroaverages across data providers to prevent developers from gaming specific subsets. Hugging Face acknowledges that models trained on data from the private providers may retain some advantage, mitigated by the presence of multiple independent data sources and openness to adding new providers.

Sources
  1. 01Hugging FaceAdding Benchmaxxer Repellant to the Open ASR Leaderboard
Also on Tools

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.