Browse

Archive

01
01New benchmark on arXiv· This week
Researchers release CLIR-Bench to evaluate multimodal QA over irregular clinical time series
Introduces CLIR-Bench, a benchmark for multimodal question answering over irregular clinical time series.
Jul 14, 2026arXiv cs.CL2 min read
Trust79
HypeLow hype
02
02Hamel Husain
Comparison finds automated evals correlate with human annotations in 100 traces
An applied AI engineer compared 100 human-annotated traces with automated eval systems to assess their reliability.
Jul 12, 2026Hamel Husain — applied AI engineering3 min read
Trust79
HypeLow hype
03
03OpenAI analysis
OpenAI flags reliability issues in SWE-Bench Pro coding benchmark
OpenAI published an analysis identifying reliability and accuracy concerns in SWE-Bench Pro, a popular benchmark for evaluating AI coding performance.
Jul 9, 2026OpenAI — News2 min read
Trust79
HypeLow hype
04
04Research
New benchmark ‘AgentLens’ evaluates interactive coding agents by full task trajectory, not just pass/fail
AgentLens introduces a benchmark that evaluates interactive coding agents by their full task trajectory rather than a binary pass/fail outcome.
Jul 9, 2026arXiv cs.AI2 min read
Trust79
HypeLow hype
05
05Research announcement
New benchmark CSTutorBench evaluates small language models as tutors for block-based programming
A new benchmark called CSTutorBench evaluates how well small language models can act as tutors in K-12 computer science education using block-based programming in VEX VR.
Jul 8, 2026arXiv cs.AI2 min read
Trust79
HypeLow hype
06
06Hugging Face Blog
ScarfBench released to evaluate AI agents on enterprise Java framework migration
ScarfBench introduces 34 enterprise Java applications, 204 migration tasks, and 1,331 expert-written tests to evaluate AI agents on framework modernization.
Jun 30, 2026Hugging Face Blog3 min read
Trust79
HypeLow hype
07
07New benchmark on arXiv
New benchmark GPTNT reveals real-time collaboration gaps in multimodal agents
GPTNT is a new benchmark for evaluating real-time collaboration between multimodal agents using the cooperative video game 'Keep Talking and Nobody Explodes'.
Jun 30, 2026arXiv cs.AI3 min read
Trust79
HypeLow hype
08
08Research preprint
Researchers propose broader evaluation dimensions for AI agents after benchmark accuracy saturates
Accuracy saturation in AI benchmarks often leads to retirement or replacement, but this misses other performance dimensions like construct validity, efficiency, and reliability.
Jun 26, 2026arXiv cs.AI3 min read
Trust79
HypeLow hype
09
09Research paper
Contamination-aware benchmark finds selective abstention in instruction-tuned LLMs
A new benchmark introduces a contamination-aware, multi-zone protocol to evaluate when LLMs should answer or abstain.
Jun 26, 2026arXiv cs.CL3 min read
Trust79
HypeLow hype
10
10Apple ML Research
Apple study finds LLM-as-a-judge panels provide roughly two independent votes’ worth of information despite nine judges
A nine-judge LLM-as-a-judge panel provides only about two independent votes’ worth of information due to correlated errors.
Jun 23, 2026Apple — Machine Learning Research3 min read
Trust84
HypeLow hype

Researchers release CLIR-Bench to evaluate multimodal QA over irregular clinical time series

Comparison finds automated evals correlate with human annotations in 100 traces

OpenAI flags reliability issues in SWE-Bench Pro coding benchmark

New benchmark ‘AgentLens’ evaluates interactive coding agents by full task trajectory, not just pass/fail

New benchmark CSTutorBench evaluates small language models as tutors for block-based programming

ScarfBench released to evaluate AI agents on enterprise Java framework migration

New benchmark GPTNT reveals real-time collaboration gaps in multimodal agents

Researchers propose broader evaluation dimensions for AI agents after benchmark accuracy saturates

Contamination-aware benchmark finds selective abstention in instruction-tuned LLMs

Apple study finds LLM-as-a-judge panels provide roughly two independent votes’ worth of information despite nine judges