Evals · Jun 26, 2026

Contamination-aware benchmark finds selective abstention in instruction-tuned LLMs

New multi-zone evaluation protocol separates answerability from guessing and contamination, with Qwen2.5-3B-Instruct leading reliability among tested models.

Trust79

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

A new benchmark introduces a contamination-aware, multi-zone protocol to evaluate when LLMs should answer or abstain.

Researchers propose Know2Guess, a contamination-aware benchmark designed to measure the transition from answerable knowledge to abstention-expected unknowns in large language models. The benchmark includes 1,200 items across five domains, with explicit abstention expectations and contamination-risk metadata. It also features dual parsing via an official strict parser and a normalized robustness parser.

The authors evaluate FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct under locked answer-or-abstain prompts, answer-only controls, and prompt-template variants. They find that generic non-answer behavior does not solve the benchmark: FLAN baselines remain weak on productive abstention, while stronger instruction-tuned models show a selective but incomplete transition from answering to abstaining.

Qwen2.5-3B-Instruct achieves the best overall reliability among the tested models, but answer-expected zones remain difficult, calibration is poor, and benign-item refusal persists. Prompt and parser robustness analyses preserve the main ranking and qualitative conclusions.

The benchmark is intended to provide a reproducible protocol for auditing answerability, abstention, refusal, and contamination as distinct but interacting dimensions of LLM reliability. The dataset is publicly available on GitHub.

Sources

01arXiv cs.CL — Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

Also on Evals

Contamination-aware benchmark finds selective abstention in instruction-tuned LLMs

Researchers propose broader evaluation dimensions for AI agents after benchmark accuracy saturates

Apple study finds LLM-as-a-judge panels provide roughly two independent votes’ worth of information despite nine judges

OpenAI releases LifeSciBench, an expert-authored benchmark for evaluating AI in life sciences