Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models
Researchers systematically evaluated how sampling temperature and prompting strategies affect extended reasoning performance in large language models, finding that zero-shot prompting peaks at moderate temperatures (T=0.4-0.7) while chain-of-thought performs better at extremes. The study reveals that extended reasoning benefits grow substantially with higher temperatures, suggesting that T=0 is suboptimal for reasoning tasks.
This research addresses a critical gap in LLM optimization by investigating the interplay between temperature settings and prompting strategies for extended reasoning models. The findings challenge conventional wisdom that zero temperature—which eliminates randomness—is universally optimal for reasoning tasks. Using Grok-4.1 on IMO-level mathematics problems, the team discovered that moderate temperatures enable zero-shot prompting to achieve 59% accuracy, while chain-of-thought methods thrive at temperature extremes. The expanding benefit gap—from 6x at T=0.0 to 14.3x at T=1.0—indicates that higher temperature diversity becomes increasingly valuable when models have extended computation budgets.
This work reflects the broader evolution of LLM capabilities toward explicit reasoning phases. Traditional approaches treated temperature as a secondary tuning parameter, but extended reasoning fundamentally changes the calculus. When models allocate substantial tokens to intermediate reasoning steps, the stochastic exploration enabled by higher temperatures allows them to discover more diverse solution paths. The research suggests that practitioners have likely been undershooting performance by rigidly adhering to T=0 conventions.
The implications extend across AI development and deployment. Teams building reasoning-focused applications should expect that temperature optimization requires joint tuning with prompting strategy rather than isolated parameter adjustment. For competitive AI benchmarks and production systems relying on mathematical reasoning, these insights offer immediate optimization pathways. The work also highlights that extended reasoning models operate under fundamentally different principles than standard LLMs, requiring fresh engineering approaches.
- →Zero-shot prompting achieves peak performance at moderate temperatures (0.4-0.7), contradicting the standard T=0 practice for reasoning tasks.
- →Extended reasoning benefits scale dramatically with temperature, increasing 2.4x from T=0.0 to T=1.0 on challenging mathematical benchmarks.
- →Temperature and prompting strategy must be optimized jointly rather than treated as independent hyperparameters.
- →Chain-of-thought prompting performs better at temperature extremes while zero-shot excels at moderate settings.
- →Conventional T=0 configurations may significantly underutilize extended reasoning capabilities in production systems.