AIBullisharXiv β CS AI Β· 14h ago7/10
π§
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
Researchers demonstrate that modern large language models can significantly improve code generation accuracy through iterative self-repairβfeeding execution errors back to the model for correctionβachieving 4.9-30.0 percentage point gains across benchmarks. The study reveals that instruction-tuned models succeed with prompting alone even at 8B scale, with Gemini 2.5 Flash reaching 96.3% pass rates on HumanEval, though logical errors remain substantially harder to fix than syntax errors.
π§ Geminiπ§ Llama