y0news
← Feed
←Back to feed
🧠 AIβšͺ NeutralImportance 7/10

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

arXiv – CS AI|Kaifeng He, Xiaojun Zhang, Peiliang Cai, Mingwei Liu, Yanlin Wang, Chong Wang, Kaifeng Huang, Bihuan Chen, Xin Peng, Zibin Zheng|
πŸ€–AI Summary

A systematic review of 114 studies reveals that code quality defects in large language models stem primarily from training data imperfections rather than model limitations alone. The research establishes a taxonomy linking 18 propagation mechanisms between data quality issues and generated code failures, while advocating for proactive data governance over reactive post-generation filtering.

Analysis

This research addresses a critical gap in understanding why LLMs generate buggy and vulnerable code despite their apparent sophistication. Rather than treating generation failures as inherent model weaknesses, the study traces root causes to training corpus quality, fundamentally reframing how the AI development community should approach code generation reliability. The systematic review of 114 papers provides empirical grounding for what practitioners have increasingly suspected: garbage in, garbage out applies to LLMs just as much as traditional systems.

The establishment of a unified taxonomy across nine dimensions of code quality issues and categorization of training data problems into code and non-code attributes creates a shared vocabulary for developers and researchers. This systematization matters because it enables reproducible problem-solving rather than ad-hoc fixes. The formalized causal framework with 18 propagation mechanisms transforms anecdotal knowledge into actionable patterns that teams can audit against their own pipelines.

The industry implications are substantial. Development teams building code-generation tools now have scientific evidence supporting investment in data curation and governance infrastructure rather than purely architectural improvements. This validates the data-centric AI movement and suggests that competitive advantages in code generation will accrue to organizations with superior data quality practices. Security researchers also benefit, as understanding exactly how training data defects propagate into vulnerabilities enables better threat modeling and mitigation strategies.

The shift from reactive post-generation filtering to proactive data governance represents a maturing field recognizing that prevention costs less than remediation. Organizations deploying LLM code tools should scrutinize their training data provenance and implement continuous evaluation loops rather than assuming model outputs are inherently safe after fine-tuning.

Key Takeaways
  • β†’Training data quality issues are the primary cause of code generation failures in LLMs, not inherent model limitations
  • β†’A unified taxonomy identifies 18 specific mechanisms linking training data defects to code quality problems across nine dimensions
  • β†’Industry is shifting from reactive post-generation filtering toward proactive, data-centric governance and closed-loop repair systems
  • β†’Organizations building code-generation tools should prioritize data curation and continuous evaluation over architectural improvements alone
  • β†’The research provides actionable patterns for auditing training pipelines and implementing data governance frameworks
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles