66 articles tagged with #code-generation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Apr 107/10
๐ง Researchers propose Symbolic Equivalence Partitioning, a novel inference-time selection method for code generation that uses symbolic execution and SMT constraints to identify correct solutions without expensive external verifiers. The approach improves accuracy on HumanEval+ by 10.3% and on LiveCodeBench by 17.1% at N=10 without requiring additional LLM inference.
AIBullisharXiv โ CS AI ยท Apr 77/10
๐ง Researchers have developed SecPI, a new fine-tuning pipeline that teaches reasoning language models to automatically generate secure code without requiring explicit security instructions. The approach improves secure code generation by 14 percentage points on security benchmarks while maintaining functional correctness.
AIBullisharXiv โ CS AI ยท Apr 77/10
๐ง Researchers propose using generative AI agents to create customized user plane processing blocks for 6G mobile networks based on text-based service requests. The study evaluates factors affecting AI code generation accuracy for network-specific tasks, finding that AI agents can successfully generate desired processing functions under suitable conditions.
AINeutralarXiv โ CS AI ยท Apr 67/10
๐ง Researchers introduce IndustryCode, the first comprehensive benchmark for evaluating Large Language Models' code generation capabilities across multiple industrial domains and programming languages. The benchmark includes 579 sub-problems from 125 industrial challenges spanning finance, automation, aerospace, and remote sensing, with the top-performing model Claude 4.5 Opus achieving 68.1% accuracy on sub-problems.
๐ง Claude
AINeutralarXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduced WebCoderBench, the first comprehensive benchmark for evaluating web application generation by large language models, featuring 1,572 real-world user requirements and 24 evaluation metrics. The benchmark tests 12 representative LLMs and shows no single model dominates across all metrics, providing opportunities for targeted improvements.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduced SAGE, a multi-agent framework that improves large language model reasoning through self-evolution using four specialized agents. The system achieved significant performance gains on coding and mathematics benchmarks without requiring large human-labeled datasets.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduced PriCoder, a new approach that improves Large Language Models' ability to generate code using private library APIs by over 20%. The method uses automatically synthesized training data through graph-based operators to teach LLMs private library usage, addressing a key limitation in current AI coding capabilities.
AINeutralarXiv โ CS AI ยท Mar 117/10
๐ง Researchers introduce MiniAppBench, a new benchmark for evaluating Large Language Models' ability to generate interactive HTML applications rather than static text responses. The benchmark includes 500 real-world tasks and an agentic evaluation framework called MiniAppEval that uses browser automation for testing.
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง Researchers developed R1-Code-Interpreter, a large language model that uses multi-stage reinforcement learning to autonomously generate code for step-by-step reasoning across diverse tasks. The 14B parameter model achieves 72.4% accuracy on test tasks, outperforming GPT-4o variants and demonstrating emergent self-checking capabilities through code generation.
๐ข Hugging Face๐ง GPT-4
AINeutralarXiv โ CS AI ยท Mar 57/10
๐ง Researchers introduce SWE-CI, a new benchmark that evaluates AI agents' ability to maintain codebases over time through continuous integration processes. Unlike existing static bug-fixing benchmarks, SWE-CI tests agents across 100 long-term tasks spanning an average of 233 days and 71 commits each.
AIBullisharXiv โ CS AI ยท Mar 57/10
๐ง Researchers developed a multi-agent LLM system that translates legal statutes into executable software, using U.S. tax preparation as a test case. The system achieved a 45% success rate using GPT-4o-mini, significantly outperforming larger frontier models like GPT-4o and Claude 3.5 which only achieved 9-15% success rates on complex tax code tasks.
๐ง GPT-4๐ง Claude
AINeutralarXiv โ CS AI ยท Mar 46/105
๐ง Researchers propose Human-Certified Module Repositories (HCMRs) as a new framework to ensure trustworthy software development in the AI era. The system combines human oversight with automated analysis to certify and curate reusable code modules, addressing growing security concerns as AI increasingly generates and assembles software components.
AIBullisharXiv โ CS AI ยท Mar 46/102
๐ง Researchers have developed a Bayesian adversarial multi-agent framework for AI-driven scientific code generation, featuring three coordinated LLM agents that work together to improve reliability and reduce errors. The Low-code Platform (LCP) enables non-expert users to generate scientific code through natural language prompts, demonstrating superior performance in benchmark tests and Earth Science applications.
AINeutralarXiv โ CS AI ยท Mar 46/104
๐ง Researchers introduce CUDABench, a comprehensive benchmark for evaluating Large Language Models' ability to generate CUDA code from text descriptions. The benchmark reveals significant challenges including high compilation success rates but low functional correctness, lack of domain-specific knowledge, and poor GPU hardware utilization.
AINeutralarXiv โ CS AI ยท Mar 37/104
๐ง Researchers introduce Interaction2Code, the first benchmark for evaluating Multimodal Large Language Models' ability to generate interactive webpage code from prototypes. The study identifies four critical limitations in current MLLMs and proposes enhancement strategies to improve their performance on dynamic web interactions.
AIBullisharXiv โ CS AI ยท Mar 37/104
๐ง Researchers released two open-source datasets, SwallowCode and SwallowMath, that significantly improve large language model performance in coding and mathematics through systematic data rewriting rather than filtering. The datasets boost Llama-3.1-8B performance by +17.0 on HumanEval for coding and +12.4 on GSM8K for math tasks.
AINeutralarXiv โ CS AI ยท Mar 37/103
๐ง Researchers introduce InnoGym, the first benchmark designed to evaluate AI agents' innovation potential rather than just correctness. The framework measures both performance gains and methodological novelty across 18 real-world engineering and scientific tasks, revealing that while AI agents can generate novel approaches, they lack robustness for significant performance improvements.
AIBullisharXiv โ CS AI ยท Feb 277/106
๐ง Researchers introduce VALTEST, a framework that uses semantic entropy to automatically validate test cases generated by Large Language Models, addressing the problem of invalid or hallucinated tests that mislead AI programming agents. The system improves test validity by up to 29% and enhances code generation performance through better filtering of LLM-generated test cases.
AINeutralarXiv โ CS AI ยท Feb 277/106
๐ง Researchers identify a critical trade-off in AI model training where optimizing for Pass@k metrics (multiple attempts) degrades Pass@1 performance (single attempt). The study reveals this occurs due to gradient conflicts when the training process reweights toward low-success prompts, creating interference that hurts single-shot performance.
AIBullishOpenAI News ยท Nov 197/108
๐ง OpenAI introduces GPT-5.1-Codex-Max, an advanced agentic coding model designed for large-scale, long-running development projects. The model features enhanced reasoning capabilities and improved token efficiency compared to previous versions.
AIBullishOpenAI News ยท May 247/107
๐ง OpenAI Codex is now powering 70 different applications across various use cases through the OpenAI API. This represents significant adoption of OpenAI's code generation technology across the developer ecosystem.
AIBullishOpenAI News ยท Aug 107/105
๐ง OpenAI has released an improved version of Codex, their AI system that converts natural language into code. The enhanced system is now available through their API in private beta, marking a significant advancement in AI-powered programming tools.
AIBearisharXiv โ CS AI ยท Apr 106/10
๐ง A new empirical study reveals that eight major LLMs exhibit systematic biases in code generation, overusing popular libraries like NumPy in 45% of cases and defaulting to Python even when unsuitable, prioritizing familiarity over task-specific optimality. The findings highlight gaps in current LLM evaluation methodologies and underscore the need for targeted improvements in training data diversity and benchmarking standards.
AIBullisharXiv โ CS AI ยท Apr 106/10
๐ง Researchers propose FLeX, a parameter-efficient fine-tuning approach combining LoRA, advanced optimizers, and Fourier-based regularization to enable cross-lingual code generation across programming languages. The method achieves 42.1% pass@1 on Java tasks compared to a 34.2% baseline, demonstrating significant improvements in multilingual transfer without full model retraining.
๐ง Llama
AIBearisharXiv โ CS AI ยท Apr 106/10
๐ง Researchers introduce CLI-Tool-Bench, a new benchmark for evaluating large language models' ability to generate complete software from scratch. Testing seven state-of-the-art LLMs reveals that top models achieve under 43% success rates, exposing significant limitations in current AI-driven 0-to-1 software generation despite increased computational investment.