From Empirical Evaluation to Context-Aware Enhancement: Repairing Regression Errors with LLMs
Researchers introduce RegressionBug4APR, a benchmark of 200 real-world Java and Python regression bugs, to evaluate automated program repair (APR) techniques. The study finds that traditional APR tools fail entirely on regression bugs, while LLM-based approaches show promise, achieving 1.6x better results when enhanced with bug-inducing change context.
This research addresses a critical gap in software engineering by systematically evaluating how modern APR techniques handle regression bugs—defects that break previously working functionality. While LLM-based program repair has advanced rapidly for general bug fixing, its effectiveness on regression bugs specifically remained unknown until this empirical study. The introduction of RegressionBug4APR provides the research community with a structured, high-quality benchmark drawn from popular open-source repositories, enabling reproducible evaluation and future methodological improvements.
The findings reveal a stark divide in repair capabilities. Classical APR approaches, which rely on pattern matching and syntactic transformations, completely fail on regression bugs. This suggests regression bugs require deeper semantic understanding of code behavior and state changes. LLM-based approaches, conversely, demonstrate meaningful potential by leveraging natural language reasoning about code intent and functionality. The most significant discovery involves context-aware enhancement: incorporating information about bug-inducing changes yields a 1.6x performance improvement. This suggests that understanding what changed to introduce the regression is crucial for finding repairs.
The consistency of results across both Java and Python languages strengthens confidence in generalizability. For software development teams and organizations, this research indicates that LLM-powered repair tools merit investment and adoption, particularly when designed with historical change context. The work directly impacts development velocity and code quality maintenance. Going forward, the research community should focus on integrating version history and change semantics into APR pipelines, potentially combining multiple context sources to approach human-level repair performance on regression bugs.
- →Classical APR tools achieve zero success on regression bugs, while LLM-based approaches show measurable effectiveness on this specific bug category.
- →Incorporating bug-inducing change information improves LLM-based APR performance by 1.6x, highlighting the importance of historical context.
- →RegressionBug4APR benchmark provides 200 real-world Java and Python regression bugs for standardized APR technique evaluation and research.
- →Results are consistent across programming languages, suggesting the context-aware enhancement approach generalizes beyond single-language implementations.
- →Development teams should prioritize LLM-based repair tools that integrate version history and change tracking for regression bug automation.