More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models
Researchers discovered that reasoning-capable AI models like DeepSeek-R1 exhibit increasing position bias as their reasoning chains grow longer, contradicting assumptions that extended thinking reduces heuristic biases. The effect persists across multiple model sizes and datasets, suggesting that longer reasoning trajectories actually accumulate bias rather than eliminate it, with critical implications for multiple-choice question evaluation.
The discovery that chain-of-thought reasoning amplifies rather than mitigates position bias represents a fundamental challenge to current AI evaluation methodologies. Researchers tested thirteen different reasoning configurations across major benchmarks and found that longer reasoning trajectories consistently correlate with stronger position preferences in multiple-choice answers, with correlations ranging from 0.11 to 0.41. This pattern held across both distilled and base models, suggesting a systematic mechanism rather than isolated cases.
The research emerges during a period of rapid advancement in reasoning-focused models, where extended thinking has become a flagship feature promised to improve model reliability and accuracy. Organizations developing these systems have largely assumed that careful deliberation reduces biases, but this work demonstrates that the relationship is more complex. The truncation experiments provide causal evidence: when researchers resumed reasoning from midpoints in model trajectories, models increasingly shifted toward position-preferred answers, indicating the bias accumulates throughout the reasoning process itself.
For developers and evaluation teams, this creates practical vulnerabilities. Models that appear highly capable on benchmarks may perform inconsistently when deployed for high-stakes applications like medical, legal, or scientific decision-making where answer position varies. The research provides diagnostic tools for auditing this bias, but implementation requires modifications to current evaluation pipelines. The finding that this effect persists even in the largest models suggests it reflects a fundamental property of how reasoning mechanisms encode positional information rather than a scalability issue solvable through size alone.
Future work should focus on whether architectural changes or training approaches can decouple reasoning depth from position sensitivity without sacrificing model capabilities.
- βLonger reasoning chains in AI models correlate with stronger position bias in multiple-choice questions, contradicting the assumption that extended thinking reduces heuristic biases
- βThe effect manifests across thirteen different model configurations including DeepSeek-R1, suggesting a systematic phenomenon rather than model-specific behavior
- βTruncation experiments provide causal evidence that bias accumulates throughout the reasoning process as trajectories lengthen
- βCurrent multiple-choice evaluation benchmarks may overestimate model robustness because they don't account for position bias effects in reasoning-capable systems
- βResearchers provide diagnostic tools for auditing position bias, requiring updates to standard AI evaluation pipelines