Contamination-aware benchmark finds selective abstention in instruction-tuned LLMs
New multi-zone evaluation protocol separates answerability from guessing and contamination, with Qwen2.5-3B-Instruct leading reliability among tested models.
1 source · cross-referenced
- A new benchmark introduces a contamination-aware, multi-zone protocol to evaluate when LLMs should answer or abstain.
Researchers propose Know2Guess, a contamination-aware benchmark designed to measure the transition from answerable knowledge to abstention-expected unknowns in large language models. The benchmark includes 1,200 items across five domains, with explicit abstention expectations and contamination-risk metadata. It also features dual parsing via an official strict parser and a normalized robustness parser.
The authors evaluate FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct under locked answer-or-abstain prompts, answer-only controls, and prompt-template variants. They find that generic non-answer behavior does not solve the benchmark: FLAN baselines remain weak on productive abstention, while stronger instruction-tuned models show a selective but incomplete transition from answering to abstaining.
Qwen2.5-3B-Instruct achieves the best overall reliability among the tested models, but answer-expected zones remain difficult, calibration is poor, and benign-item refusal persists. Prompt and parser robustness analyses preserve the main ranking and qualitative conclusions.
The benchmark is intended to provide a reproducible protocol for auditing answerability, abstention, refusal, and contamination as distinct but interacting dimensions of LLM reliability. The dataset is publicly available on GitHub.
- Jun 26, 2026 · arXiv cs.AI
Researchers propose broader evaluation dimensions for AI agents after benchmark accuracy saturates
Trust79 - Jun 23, 2026 · Apple — Machine Learning Research
Apple study finds LLM-as-a-judge panels provide roughly two independent votes’ worth of information despite nine judges
Trust84 - Jun 18, 2026 · OpenAI — News
OpenAI releases LifeSciBench, an expert-authored benchmark for evaluating AI in life sciences
Trust75