Open ASR Leaderboardにベンチマーク不正対策の非公開データセットを追加 Adding Benchmaxxer Repellant to the Open ASR Leaderboard

Hugging Face Blog · huggingface.co · 2026/05/06 09:00 · 1mo ago · 📖 2 min

AI 3 行サマリ

Hugging FaceがOpen ASR Leaderboardに非公開テストセットを導入し、ベンチマークの過剰最適化（benchmaxxing）を防ぐ仕組みを追加した。
これによりモデルの真の汎化性能を測定でき、公開データへの過学習を見抜けるようになる。

Hugging Faceが運営するOpen ASR Leaderboardに、ベンチマーク不正最適化（benchmaxxing）対策として非公開のテストデータセットが追加された。音声認識モデルの真の汎化性能を測る上で重要な改善である。

Open ASR Leaderboardは、LibriSpeechやCommon Voice、TED-LIUMなど複数の公開データセットでWord Error Rate（WER）を比較する代表的な評価基盤として知られる。しかし公開データセットには根本的な弱点があり、開発者がテストセットの内容を見て調整したり、訓練データに混入させたりすることで、実用環境では再現しない高スコアを得られてしまう。これが「benchmaxxing」と呼ばれる問題である。

今回導入された非公開テストセットは、リーダーボードの運営者のみが保持し、提出されたモデルを内部で評価する仕組みと見られる。これにより、公開ベンチマークに過学習したモデルと、汎化性能の高いモデルとを区別できるようになる。記事では、複数モデルを公開データと非公開データで比較した結果、ランキングが入れ替わる事例も示されている可能性が高い。

Hugging FaceがOpen ASR Leaderboardに非公開テストセットを導入し、ベンチマークの過剰最適化（benchmaxxing）を防ぐ仕組みを追加した。

🏠 Local LLM / Open Models · 本記事のポイント

背景として、ASR領域ではOpenAIのWhisper、NVIDIAのCanaryやParakeet、AssemblyAIのUniversalなど、商用・オープン双方で競争が激化している。一方でLLM分野でも、MMLUやHumanEvalといった主要ベンチマークでの汚染（contamination）が長く議論されており、ScaleAIのSEAL Leaderboardのように非公開評価を採用する流れが広がっている。今回のOpen ASR Leaderboardの方針転換は、このトレンドを音声分野に持ち込むものといえる。

ユーザーや研究者にとっては、モデル選定の信頼性が向上する一方、提出側はモデルウェイトや推論コードを運営に渡す必要があるため、クローズドな商用APIをどう扱うかは今後の運用課題となるだろう。

Hugging Face has updated its Open ASR Leaderboard with a private, held-out test set, an effort framed as a defense against the increasingly common practice of optimizing models specifically for public benchmarks — sometimes called "benchmaxxing." The change is intended to give a clearer picture of how speech recognition systems actually generalize beyond the datasets that researchers and engineers have come to know intimately.

The Open ASR Leaderboard has, over the past few years, become one of the de facto reference points for comparing automatic speech recognition models. It aggregates Word Error Rate (WER) scores across a battery of public corpora including LibriSpeech, Common Voice, TED-LIUM, GigaSpeech, SPGISpeech, VoxPopuli, Earnings-22 and AMI. While that breadth helps, all of those datasets share a common vulnerability: their test splits are downloadable. Developers can inspect failure cases, tune decoding heuristics around known quirks, or — intentionally or not — let test audio leak into training pipelines. The result is leaderboard scores that do not always translate to production performance.

The newly added private evaluation reportedly consists of test data held only by the leaderboard maintainers, with submitted models scored internally rather than against publicly accessible audio and transcripts. According to the post, comparing the same models on public versus private data surfaces meaningful re-orderings: systems that look state-of-the-art on familiar benchmarks do not necessarily retain their lead when faced with audio they could not have seen during development. That gap is precisely the signal a leaderboard is supposed to provide.

The move mirrors a broader shift across machine learning evaluation. In the LLM world, contamination of benchmarks such as MMLU, GSM8K and HumanEval has been a recurring concern, and projects like Scale AI's SEAL Leaderboard, LMSYS's arena-style human preference evaluations, and various private-eval initiatives have emerged in response. Speech recognition has historically been somewhat insulated from this dynamic because acoustic data is harder to memorize incidentally than text, but as ASR models trend toward general-purpose audio foundation models trained on web-scale corpora, the same contamination risks apply. Whisper-style training recipes that scrape thousands of hours from the open internet can easily ingest portions of public test sets without the developers realizing it.

The competitive backdrop helps explain why the maintainers see this as urgent. The ASR field has grown crowded with strong entrants: OpenAI's Whisper family remains a baseline, while NVIDIA has pushed Canary and Parakeet to the top of public rankings, AssemblyAI has promoted its Universal models, and a steady stream of open-weight releases from labs such as Mistral-affiliated efforts and Chinese research groups continue to claim incremental WER wins. When fractions of a percentage point determine bragging rights, the incentive to overfit to the leaderboard — even unconsciously — grows accordingly.

For practitioners choosing a model, a private evaluation track should make the rankings more trustworthy as a proxy for real-world deployment. For model providers, however, it introduces logistical friction. Submitting to a private benchmark typically requires handing over model weights or, at minimum, an inference container that the maintainers can run on held-out audio. That is straightforward for open-weight releases but awkward for closed commercial APIs, which may need bespoke arrangements or rate-limited endpoints. How the leaderboard handles proprietary systems — and whether vendors such as Deepgram, AssemblyAI, Google, or Microsoft will participate on equal footing — is likely to shape its long-term credibility.

There are also open methodological questions. A single private set can itself become stale or biased toward particular acoustic conditions, and rotating it periodically introduces its own comparability problems. The maintainers will need to balance freshness, domain coverage (telephony, broadcast, conversational, accented speech) and reproducibility. Even so, the direction of travel is clear: as foundation models absorb more of the open web, evaluation infrastructure has to move in the opposite direction, holding back data the models cannot have seen. The Open ASR Leaderboard's update brings speech recognition in line with that emerging norm, and may pressure other audio benchmarks to follow.