Denoising Iterative Self-Correction: Structured Verification Loops for Reliable LLM Reasoning
Researchers introduce Denoising Iterative Self-Correction (DISC), a test-time procedure that improves large language model reasoning by treating verification outputs as noisy signals to progressively correct errors across multiple passes. The method demonstrates superior performance over existing correction approaches, achieving 81.6% accuracy on BIG-Bench Mistake with 13x better improvement-to-degradation ratios than Chain-of-Verification.
DISC addresses a fundamental challenge in large language model deployment: the paradox that naive self-correction mechanisms often degrade already-correct reasoning paths while attempting to fix errors. The research treats this as a signal-processing problem rather than a binary verification task, drawing parallels to traditional denoising algorithms. This conceptual shift enables the method to balance two competing objectives—maximizing error repair while minimizing false corrections—through a gating mechanism that blocks harmful rewrites.
The significance lies in how the research quantifies this trade-off through paired diagnostics: improvement-to-degradation ratio (precision) and repair rate (recall). This framework mirrors evaluation methodologies in signal processing and allows direct comparison with prior approaches. The substantial performance gaps—13x improvement over Chain-of-Verification and 5x over Self-Refine on BIG-Bench Mistake using Claude Sonnet~4.5—suggest meaningful progress in inference reliability.
Cross-model role allocation emerges as a secondary but important finding. By assigning verification and judgment to different models than the generator, the approach mitigates self-confirmation bias, a known failure mode where models reinforce their own errors. This heterogeneous architecture introduces computational overhead but appears necessary for robust correction pipelines.
The identified capability floor on GPQA Diamond—where models recognize contradictory evidence but cannot act on that recognition—reveals fundamental limitations in current LLM reasoning. This distinction between detection and correction capacity has implications for designing safer deployment systems. For practitioners building mission-critical applications requiring reliable multi-step reasoning, DISC provides a methodologically sound approach to inference-time quality control.
- →DISC achieves 81.6% accuracy on BIG-Bench Mistake with 13x better improvement-to-degradation ratios than Chain-of-Verification.
- →The method treats verification as noisy signal processing rather than binary judgment, enabling progressive error reduction across multiple passes.
- →Cross-model role allocation—using different models for verification and correction—mitigates self-confirmation bias in reasoning tasks.
- →A capability floor exists where language models recognize contradictory evidence but cannot translate that recognition into valid corrections.
- →Paired diagnostics (precision and recall) provide a more nuanced evaluation framework for correction mechanisms than single-metric benchmarks.