Tools · Jun 30, 2026

Hugging Face integrates Every Eval Ever results into Community Evals for standardized model benchmarking

New converter links EEE’s structured evaluation records to model cards and leaderboards, improving transparency and reproducibility of benchmark scores on the Hub.

Trust79

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

Hugging Face and the EvalEval Coalition launched interoperable evaluation reporting to unify how benchmark scores are shared on the Hub.
A new converter maps Every Eval Ever (EEE) JSON records into Hugging Face Community Evals YAML files, enabling cross-posting of scores with full provenance.
The combined system now aggregates 229,000 evaluation results across 22,000 models and 2,200 benchmarks, with verified source links on model pages.
Contributors can submit results via pull requests, and model authors can review or hide entries; each score carries a badge indicating provenance.

Hugging Face and the EvalEval Coalition announced interoperability between Every Eval Ever (EEE) and Hugging Face Community Evals, enabling cross-posting and standardized interpretation of evaluation results. The integration links open model pages, leaderboards, and a unified metadata store, addressing gaps in how users, researchers, and policymakers trust and compare evaluations.

EEE, launched in February 2026 as part of the EvalEval Coalition, introduced a JSON schema to standardize evaluation reporting. The schema records who ran the evaluation, which model was tested, how it was accessed, generation settings, and the meaning of metrics, with optional per-sample outputs. The datastore has since grown to approximately 229,000 evaluation results across more than 22,000 models and 2,200 benchmarks, aggregated from 31 different reporting formats.

Hugging Face Community Evals, also launched in February 2026, decentralizes benchmark score reporting on the Hub. Benchmarks register via an eval.yaml file, and model scores are stored in .eval_results/*.yaml files within model repositories. Scores appear on model cards and feed into benchmark leaderboards, with badges indicating whether results are author-submitted, community-submitted, or independently verified.

A new converter bridges the two systems by mapping EEE JSON records into the YAML format required for Hugging Face Community Evals. The tool reads an EEE datastore collection, checks object hashes, and audits existing scores in a model’s repository before generating previews and opening pull requests only after explicit sign-off. It currently supports four official benchmarks: MMLU-Pro, GPQA, HLE, and GSM8K.

When a contributor submits a result to both EEE and Community Evals, the score appears on the model page and the benchmark leaderboard, with a source badge linking back to the full EEE record. This includes generation configuration, harness version, reproducibility notes, and instance-level data, making evaluations both visible and legible.

Sources

01Hugging Face — Featuring Every Eval Ever Results on Hugging Face Model Pages

Also on Tools

Hugging Face integrates Every Eval Ever results into Community Evals for standardized model benchmarking

Base44 rolls out custom LLM to support its vibe-coding platform

California government secures half-price access to Anthropic’s Claude AI

Arena, the AI model leaderboard, reports $100M annualized revenue eight months after launching commercial service