y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 7/10

How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks

arXiv – CS AI|Johin Johny Arimbur|
🤖AI Summary

Researchers demonstrate that modern large language models can significantly improve code generation accuracy through iterative self-repair—feeding execution errors back to the model for correction—achieving 4.9-30.0 percentage point gains across benchmarks. The study reveals that instruction-tuned models succeed with prompting alone even at 8B scale, with Gemini 2.5 Flash reaching 96.3% pass rates on HumanEval, though logical errors remain substantially harder to fix than syntax errors.

Analysis

This research addresses a fundamental gap between how LLMs are evaluated and how they're actually used in practice. While benchmarks traditionally measure single-shot performance, developers iteratively refine model outputs in real applications. The study's findings across seven models—from Meta's Llama series to Google's Gemini—demonstrate that iterative self-repair has become a universal capability of modern instruction-tuned models, contradicting earlier findings that only larger or fine-tuned models could self-correct effectively.

The research reveals important architectural insights: dense and mixture-of-experts (MoE) models both benefit from iterative repair, with most gains concentrated in the first two attempts. Notably, different error types show dramatically different repair success rates—syntax and naming errors reach near-perfect correction rates, while assertion errors (indicating logical mistakes in the algorithm itself) only repair successfully ~45% of the time. This ceiling suggests fundamental limits to self-correction without external validation or architectural changes.

For the AI development community, these findings validate iterative refinement as a practical deployment strategy rather than a theoretical curiosity. The effectiveness of chain-of-thought prompting adding 5.5 percentage points of additional gain indicates that how developers structure repair prompts significantly impacts outcomes. This creates opportunities for optimization in development workflows and suggests that LLM-powered code generation tools should implement iterative refinement by default rather than relying on single-shot outputs.

Key Takeaways
  • Modern instruction-tuned LLMs universally improve code generation through iterative self-repair, gaining 4.9-30 percentage points across benchmarks
  • Logical errors are substantially harder to fix than syntax errors, with ~45% repair success rate versus much higher rates for other error types
  • Chain-of-thought prompting adds up to 5.5 percentage points of additional self-repair gains compared to minimal prompting
  • Even 8B parameter models succeed at self-repair with prompting alone, eliminating the need for fine-tuning or larger models
  • Majority of improvement occurs within first two repair attempts, suggesting diminishing returns from excessive iteration
Mentioned in AI
Models
GeminiGoogle
LlamaMeta
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles