Skip to content
Research · Jun 23, 2026

Apple study finds annotation needs depend on the evaluation metric in NLI tasks

Fine-tuning on label distributions shows entropy correlation requires roughly 20–50 annotators to converge, while distributional match saturates at about 10 annotators.

Trust84
HypeLow hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • Annotation budgets for natural language inference should be set by the target metric, not uniformly.

Apple’s Machine Learning Research group reports that the number of annotators needed to capture label disagreement in natural language inference (NLI) depends on the evaluation metric. In experiments fine-tuning NLI models on label distributions subsampled from the ChaosNLI dataset, the team found that entropy correlation—a measure of whether a model identifies items that elicit human disagreement—requires roughly 20 to 50 annotators per item to converge. In contrast, distributional match, measured by KL divergence, saturates at about 10 annotators, achieving 87–95% of the total improvement across five model seeds.

The researchers attribute this metric-dependent saturation to the signal carried by soft labels, which they argue cannot be replicated by label smoothing. Across five smoothing intensities, entropy correlation remained in the range of approximately 0.45 to 0.49, while soft labels achieved an entropy correlation of 0.643 (p < 0.001). A per-item analysis indicated that label smoothing fails to distinguish ambiguous items from clear ones, whereas soft labels retain item-specific signal.

The findings were replicated across two architectures—DeBERTa and RoBERTa—as well as a non-NLI-pretrained baseline. An exploratory evaluation in the content safety domain further supported the soft-label advantage. The authors conclude that annotation budgets should be informed by the target evaluation metric rather than set uniformly, potentially improving both efficiency and model alignment with human judgment variability.

Sources
  1. 01Apple — Machine Learning ResearchMetric-Dependent Annotation Saturation for Learning from Label Distributions
Also on Research

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.