REPOT: Recoverable Program-of-Thought via Checkpoint Repair
Researchers introduce RePoT (Recoverable Program-of-Thought), an enhanced AI reasoning method that fixes failed code generation by replaying execution to identify the first error point, then using a single LLM call to recover rather than restarting. The technique improves accuracy by 3-11 percentage points across multiple models and benchmarks, with particularly strong gains on smaller models like GPT-4 mini.
RePoT addresses a critical limitation in Program-of-Thought reasoning: when an LLM generates Python code to solve problems, a single invalid action corrupts the entire trajectory, forcing complete regeneration. The innovation lies in deterministic verification—systematically replaying the generated plan through an environment to pinpoint the exact failure point. Rather than discarding failed attempts entirely, RePoT leverages the verified correct prefix as context for targeted recovery, dramatically reducing computational waste and improving success rates.
This work emerges from the broader trend of optimizing LLM reasoning chains. As models scale, they generate increasingly coherent reasoning steps but still fail probabilistically on complex multi-step tasks. Previous approaches relied on retry mechanisms or end-to-end regeneration, both expensive at scale. RePoT's checkpoint-based recovery represents a more efficient paradigm: it transforms failure into recoverable information rather than total loss.
The empirical results demonstrate significant practical value. Performance gains range from 3-11 percentage points across closed-model configurations, with particularly pronounced improvements on smaller models—a capability-scaling pattern that matters for cost-conscious deployments. The controlled benchmark (Derail-550) reveals that checkpoint information itself drives recovery success; merely knowing where execution failed proves far more valuable than error-only feedback, establishing it as the critical load-bearing signal.
The adaptive variant that routes between suffix repair and fresh retries based on verified-prefix length hints at future refinement. As LLM reasoning systems proliferate in production environments—from scientific problem-solving to robotic task planning—efficient recovery mechanisms become economically critical. This work establishes that failure points contain actionable information worth mining before discarding.
- →RePoT improves reasoning accuracy by 3-11pp through checkpoint-based recovery rather than full regeneration
- →Verified-prefix length determines whether suffix repair or fresh restart yields better recovery outcomes
- →Checkpoint information drives 10-20x better recovery performance than error-only feedback on controlled benchmarks
- →Gains are strongest on smaller models (Gemini, GPT-4 mini), making the technique cost-effective for inference
- →Method replicates across diverse benchmarks and open-weights models, indicating broad applicability