When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
Researchers introduce a learnable approach to commitment depth—the number of primitive actions executed before replanning—in vision-language models for long-horizon reasoning. Their adaptive policy outperforms fixed-depth baselines and surpasses GPT-4.5 and Claude Sonnet on puzzle-solving tasks, achieving higher solve rates with fewer actions.
This research addresses a fundamental optimization problem in long-horizon AI reasoning: balancing the computational cost of frequent replanning against the compounding errors from executing actions without observation feedback. Traditional approaches fix commitment depth as a hyperparameter, treating it as a static design choice rather than a dynamic variable responsive to context. The proposed method reframes this as a learnable, state-conditioned component of the policy itself, allowing the system to adaptively decide when to pause and replan based on current conditions.
The work builds on recent advances in vision-language models and their application to sequential decision-making. By jointly predicting both actions and their execution duration, the approach integrates temporal abstraction directly into the model architecture rather than as a post-hoc scheduling mechanism. This represents a shift toward more sophisticated reasoning systems that can self-regulate their intervention frequency.
The empirical results demonstrate substantial practical improvements. On Sliding Puzzle and Sokoban benchmarks, the adaptive policy achieves up to 12.5 percentage points higher success rates while reducing primitive action counts by approximately 25 percent. Notably, the method outperforms larger proprietary models (GPT-4.5, Claude Sonnet) despite using a 7B parameter backbone, suggesting that architectural innovations in commitment strategy can partially compensate for model scale disadvantages.
The theoretical analysis provides formal justification: state-conditioned commitment strictly dominates fixed-depth approaches when optimal depth varies across different states. This creates a foundation for future research into adaptive temporal abstraction in reinforcement learning and language-guided agent systems. The work suggests that treating previously hard-coded parameters as learnable policy components may unlock efficiency gains across other domains requiring long-horizon planning.
- →Adaptive commitment depth improves solve rates by 12.5% and reduces actions by ~25% compared to fixed-depth baselines
- →A 7B vision-language model with learnable commitment outperforms GPT-4.5 and Claude Sonnet on complex reasoning tasks
- →State-conditioned commitment theoretically dominates fixed-depth strategies when optimal depth varies across different states
- →Joint prediction of actions and execution duration integrates temporal abstraction directly into the model architecture
- →Open-weight vision-language models achieve 0% success on these tasks, highlighting the importance of architectural innovations over scale alone