Chain-of-Thought Reasoning Amplifies Position Bias in Multiple-Choice Questions
A new study challenges the assumption that extended reasoning reduces model bias, finding instead that longer reasoning trajectories correlate with increased susceptibility to answer-position preferences across DeepSeek-R1 and other reasoning-tuned models.
1 source · single source
- Chain-of-thought reasoning in models like DeepSeek-R1 does not eliminate position bias in multiple-choice tasks; instead, longer reasoning chains correlate with stronger bias toward certain answer positions (correlation 0.11–0.41, all p<0.05).
- Testing across thirteen configurations on MMLU, ARC-Challenge, and GPQA datasets showed twelve models exhibited monotonically increasing position bias across reasoning-length quartiles.
- A truncation intervention confirmed causality: resuming reasoning from later points in a chain increased the likelihood of shifting toward position-preferred answers (16% to 32% for R1-Qwen-7B).
- DeepSeek-R1 at 671B shows lower aggregate position bias (0.019), but the length effect persists in its longest reasoning quartile, suggesting accuracy may mask rather than eliminate the underlying mechanism.
- The authors propose diagnostic tools for auditing position bias in reasoning models and argue that multi-choice evaluation pipelines should not assume reasoning-tuned models are order-robust.
Chain-of-thought reasoning has become a foundational technique for improving model performance on complex tasks, with the assumption that extended thinking reduces reliance on shallow heuristics and spurious correlations. A new preprint challenges this premise by examining how reasoning models handle position bias—a persistent tendency to favor answers in specific locations within multiple-choice sets. Across DeepSeek-R1 and distilled variants, plus base models prompted with CoT instructions, researchers found a counterintuitive pattern: the longer the reasoning chain, the more pronounced the position bias.
The study analyzed thirteen distinct reasoning configurations on three major benchmarks (MMLU, ARC-Challenge, GPQA) using a Position Bias Score (PBS) metric. Twelve of the thirteen configurations showed statistically significant positive correlations between reasoning trajectory length and position bias (r ranging 0.11–0.41, p<0.05), even after controlling for accuracy. All open-weight reasoning variants displayed monotonically increasing PBS across length quartiles, suggesting a systematic rather than random effect.
To establish causality beyond correlation, the researchers performed truncation interventions: they resumed partially completed reasoning chains from intermediate points and observed whether the model's final answer selection shifted. The results were unambiguous—continuations started from later points in the trajectory showed increasingly strong shifts toward position-preferred answers (16% to 32% for R1-Qwen-7B depending on position bucket), demonstrating that the reasoning process itself is driving the bias, not an artifact of model initialization.
Notably, DeepSeek-R1 at 671B parameters showed substantially lower aggregate position bias (PBS 0.019) compared to smaller models, but the length effect did not disappear; it remained visible in the longest reasoning quartile (PBS 0.071). The authors interpret this as evidence that higher accuracy achieved by larger models may gate or suppress the expression of bias rather than eliminating the underlying mechanism. This distinction matters: a bias that only manifests at certain accuracy thresholds could still distort evaluations within particular task domains or model scales.
The research also identifies a separate direct-answer position bias—present when models bypass reasoning and answer directly—with a distinct pattern across architectures (strong in Llama-Instruct-direct, weak in Qwen-Instruct-direct) uncorrelated with trajectory length. CoT reasoning does not reduce this baseline bias but rather replaces it with a length-accumulated form, suggesting the two biases may have independent mechanisms.
- May 22, 2026 · arXiv cs.AI
New Method Improves LLM Reasoning About Conflicting Beliefs in Complex Social Scenarios
Trust79 - May 20, 2026 · OpenAI — News
OpenAI model resolves 80-year-old discrete geometry conjecture
Trust67 - May 20, 2026 · arXiv cs.AI
Study evaluates how language models interpret personal health records to answer patient questions
Trust74