Skip to content
Evals · Jun 26, 2026

Contamination-aware benchmark finds selective abstention in instruction-tuned LLMs

New multi-zone evaluation protocol separates answerability from guessing and contamination, with Qwen2.5-3B-Instruct leading reliability among tested models.

Trust79
HypeLow hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • A new benchmark introduces a contamination-aware, multi-zone protocol to evaluate when LLMs should answer or abstain.

Researchers propose Know2Guess, a contamination-aware benchmark designed to measure the transition from answerable knowledge to abstention-expected unknowns in large language models. The benchmark includes 1,200 items across five domains, with explicit abstention expectations and contamination-risk metadata. It also features dual parsing via an official strict parser and a normalized robustness parser.

The authors evaluate FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct under locked answer-or-abstain prompts, answer-only controls, and prompt-template variants. They find that generic non-answer behavior does not solve the benchmark: FLAN baselines remain weak on productive abstention, while stronger instruction-tuned models show a selective but incomplete transition from answering to abstaining.

Qwen2.5-3B-Instruct achieves the best overall reliability among the tested models, but answer-expected zones remain difficult, calibration is poor, and benign-item refusal persists. Prompt and parser robustness analyses preserve the main ranking and qualitative conclusions.

The benchmark is intended to provide a reproducible protocol for auditing answerability, abstention, refusal, and contamination as distinct but interacting dimensions of LLM reliability. The dataset is publicly available on GitHub.

Sources
  1. 01arXiv cs.CLKnow2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models
Also on Evals

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.