Research · Jun 23, 2026

Apple study finds annotation needs depend on the evaluation metric in NLI tasks

Fine-tuning on label distributions shows entropy correlation requires roughly 20–50 annotators to converge, while distributional match saturates at about 10 annotators.

Trust84

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

Annotation budgets for natural language inference should be set by the target metric, not uniformly.

Apple’s Machine Learning Research group reports that the number of annotators needed to capture label disagreement in natural language inference (NLI) depends on the evaluation metric. In experiments fine-tuning NLI models on label distributions subsampled from the ChaosNLI dataset, the team found that entropy correlation—a measure of whether a model identifies items that elicit human disagreement—requires roughly 20 to 50 annotators per item to converge. In contrast, distributional match, measured by KL divergence, saturates at about 10 annotators, achieving 87–95% of the total improvement across five model seeds.

The researchers attribute this metric-dependent saturation to the signal carried by soft labels, which they argue cannot be replicated by label smoothing. Across five smoothing intensities, entropy correlation remained in the range of approximately 0.45 to 0.49, while soft labels achieved an entropy correlation of 0.643 (p < 0.001). A per-item analysis indicated that label smoothing fails to distinguish ambiguous items from clear ones, whereas soft labels retain item-specific signal.

The findings were replicated across two architectures—DeBERTa and RoBERTa—as well as a non-NLI-pretrained baseline. An exploratory evaluation in the content safety domain further supported the soft-label advantage. The authors conclude that annotation budgets should be informed by the target evaluation metric rather than set uniformly, potentially improving both efficiency and model alignment with human judgment variability.

Sources

01Apple — Machine Learning Research — Metric-Dependent Annotation Saturation for Learning from Label Distributions

Also on Research

Apple study finds annotation needs depend on the evaluation metric in NLI tasks

No evidence of Semitic-specific cross-lingual transfer in large language models

LLM ensemble achieves 0.74 F1-score in automating EQ-5D study detection from PubMed abstracts

Frontier LLMs hit ceiling on VerilogEval hardware-coding benchmark, study finds