ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning
ReFlect introduces a training-free harness system that wraps around LLMs to detect and recover from reasoning failures in complex, multi-step tasks. Testing across six models shows significant improvements in task success rates, with gains inversely correlated to baseline performance, though the approach reveals limitations in how smaller models handle structured reasoning.
ReFlect addresses a critical vulnerability in current LLM reasoning systems: the silent accumulation of errors across long-horizon tasks. Traditional approaches like chain-of-thought and ReAct assume each reasoning step builds reliably on previous ones, but this breaks down on complex multi-stage problems. The research reveals that prompt-level self-critique is largely ineffective, with LLMs accepting incorrect answers 76% of the time and failing to flag genuine errors in 90% of reflection blocks.
The harness operates as a deterministic wrapper at inference time, enabling error detection and recovery without model retraining. Results demonstrate substantial performance gains: Claude Sonnet 4.5 improved 29 percentage points over direct chain-of-thought, while even the smallest tested model (gpt-4o-mini) achieved 41% success rate on benchmark tasks. Notably, the ReFlect approach proves most beneficial for weaker baseline performers, recovering roughly 1.69 percentage points of gain for each percentage point of baseline loss.
However, the research identifies a scaling paradox: mid-size models (70B parameters) struggle with structured reasoning state management, achieving only 15-18.7% improvement. This suggests the harness design assumes capabilities larger models possess. For developers and AI system designers, ReFlect offers a practical, training-free improvement mechanism that works across model families. The technology's model-agnostic nature makes it broadly applicable, though effectiveness varies significantly by model scale and architecture. Future work should focus on understanding why structured state management fails at intermediate scales and whether this limitation affects production deployment.
- βReFlect improves LLM task success rates by 7-29 percentage points across six models without requiring model retraining or fine-tuning.
- βCurrent self-critique methods fail to detect errors in 90% of cases and accept wrong answers in at least 76% of scenarios.
- βPerformance gains are inversely proportional to baseline success rates, making ReFlect most valuable for lower-performing models.
- βMid-size models (70B parameters) show disproportionate weakness in handling structured reasoning state, suggesting scale-dependent limitations.
- βThe inference-time harness approach is model-agnostic and immediately deployable across different LLM architectures.