Researchers have developed methods to predict real-time progress in reasoning language models with long chains of thought, achieving a 0.161 MAE on mathematical tasks. The work addresses the opacity problem in extended reasoning by training linear probes on hidden states and fine-tuning models to generate percentage-based progress estimates, while quantifying the inherent ambiguity in progress labeling across different model sizes.
This research tackles a critical usability challenge in advanced language models: understanding what's happening during extended reasoning processes. As reasoning models handle increasingly complex tasks over longer time horizons, users face a black-box problem where internal progress remains invisible. The team's approach uses two complementary methods—discretizing reasoning trajectories to probe hidden states for progress information, and fine-tuning models to generate explicit 0-100% progress estimates during chain-of-thought reasoning. Their 0.161 MAE demonstrates meaningful accuracy on mathematical reasoning, though performance gaps versus position-based baselines suggest room for improvement. The research reveals an important insight about model scaling: larger models don't necessarily provide more stable progress labels. Qwen3-4B's superior performance in reducing rollout dispersion indicates that model consistency in remaining solution length matters more than raw scale. This finding challenges conventional assumptions about size-equals-capability. The quantification of inherent label ambiguity—measuring variation across different continuations of the same partial solution—provides a valuable framework for understanding the limits of progress prediction. This work has implications for deploying reasoning models in production environments where users need oversight and confidence in system behavior. Better progress transparency could improve trust in AI systems for complex problem-solving and enable human-in-the-loop verification at strategic decision points.
- →Real-time progress prediction in reasoning models is feasible, achieving 0.161 MAE on mathematical reasoning tasks through linear probes and fine-tuning.
- →Smaller models like Qwen3-4B demonstrate more stable progress labels due to lower variation in remaining solution steps, suggesting scalability doesn't guarantee better progress coherence.
- →Hidden states in language models encode meaningful progress information that can be extracted and converted into human-readable percentage estimates.
- →Inherent ambiguity in progress labeling stems from multiple valid solution paths, requiring evaluation of label variance across different model continuations.
- →Progress transparency mechanisms could enhance real-time oversight and expectation management for long-horizon reasoning tasks in production deployments.