#code-generation News & Analysis

204 articles tagged with #code-generation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

204 articles

AINeutralarXiv – CS AI · Mar 177/10

🧠

WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics

Researchers introduced WebCoderBench, the first comprehensive benchmark for evaluating web application generation by large language models, featuring 1,572 real-world user requirements and 24 evaluation metrics. The benchmark tests 12 representative LLMs and shows no single model dominates across all metrics, providing opportunities for targeted improvements.

AIBullisharXiv – CS AI · Mar 177/10

🧠

SAGE: Multi-Agent Self-Evolution for LLM Reasoning

Researchers introduced SAGE, a multi-agent framework that improves large language model reasoning through self-evolution using four specialized agents. The system achieved significant performance gains on coding and mathematics benchmarks without requiring large human-labeled datasets.

AINeutralarXiv – CS AI · Mar 117/10

🧠

MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

Researchers introduce MiniAppBench, a new benchmark for evaluating Large Language Models' ability to generate interactive HTML applications rather than static text responses. The benchmark includes 500 real-world tasks and an agentic evaluation framework called MiniAppEval that uses browser automation for testing.

AIBullisharXiv – CS AI · Mar 56/10

🧠

R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning

Researchers developed R1-Code-Interpreter, a large language model that uses multi-stage reinforcement learning to autonomously generate code for step-by-step reasoning across diverse tasks. The 14B parameter model achieves 72.4% accuracy on test tasks, outperforming GPT-4o variants and demonstrating emergent self-checking capabilities through code generation.

🏢 Hugging Face🧠 GPT-4

AINeutralarXiv – CS AI · Mar 57/10

🧠

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Researchers introduce SWE-CI, a new benchmark that evaluates AI agents' ability to maintain codebases over time through continuous integration processes. Unlike existing static bug-fixing benchmarks, SWE-CI tests agents across 100 long-term tasks spanning an average of 233 days and 71 commits each.

AIBullisharXiv – CS AI · Mar 57/10

🧠

An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software

Researchers developed a multi-agent LLM system that translates legal statutes into executable software, using U.S. tax preparation as a test case. The system achieved a 45% success rate using GPT-4o-mini, significantly outperforming larger frontier models like GPT-4o and Claude 3.5 which only achieved 9-15% success rates on complex tax code tasks.

🧠 GPT-4🧠 Claude

AIBullisharXiv – CS AI · Mar 46/102

🧠

AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework

Researchers have developed a Bayesian adversarial multi-agent framework for AI-driven scientific code generation, featuring three coordinated LLM agents that work together to improve reliability and reduce errors. The Low-code Platform (LCP) enables non-expert users to generate scientific code through natural language prompts, demonstrating superior performance in benchmark tests and Earth Science applications.

AINeutralarXiv – CS AI · Mar 46/104

🧠

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

Researchers introduce CUDABench, a comprehensive benchmark for evaluating Large Language Models' ability to generate CUDA code from text descriptions. The benchmark reveals significant challenges including high compilation success rates but low functional correctness, lack of domain-specific knowledge, and poor GPU hardware utilization.

AINeutralarXiv – CS AI · Mar 46/105

🧠

Human-Certified Module Repositories for the AI Age

Researchers propose Human-Certified Module Repositories (HCMRs) as a new framework to ensure trustworthy software development in the AI era. The system combines human oversight with automated analysis to certify and curate reusable code modules, addressing growing security concerns as AI increasingly generates and assembles software components.

AIBullisharXiv – CS AI · Mar 37/104

🧠

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Researchers released two open-source datasets, SwallowCode and SwallowMath, that significantly improve large language model performance in coding and mathematics through systematic data rewriting rather than filtering. The datasets boost Llama-3.1-8B performance by +17.0 on HumanEval for coding and +12.4 on GSM8K for math tasks.

AINeutralarXiv – CS AI · Mar 37/103

🧠

InnoGym: Benchmarking the Innovation Potential of AI Agents

Researchers introduce InnoGym, the first benchmark designed to evaluate AI agents' innovation potential rather than just correctness. The framework measures both performance gains and methodological novelty across 18 real-world engineering and scientific tasks, revealing that while AI agents can generate novel approaches, they lack robustness for significant performance improvements.

AINeutralarXiv – CS AI · Mar 37/104

🧠

Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping

Researchers introduce Interaction2Code, the first benchmark for evaluating Multimodal Large Language Models' ability to generate interactive webpage code from prototypes. The study identifies four critical limitations in current MLLMs and proposes enhancement strategies to improve their performance on dynamic web interactions.

AINeutralarXiv – CS AI · Feb 277/106

🧠

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Researchers identify a critical trade-off in AI model training where optimizing for Pass@k metrics (multiple attempts) degrades Pass@1 performance (single attempt). The study reveals this occurs due to gradient conflicts when the training process reweights toward low-success prompts, creating interference that hurts single-shot performance.

AIBullisharXiv – CS AI · Feb 277/106

🧠

Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy

Researchers introduce VALTEST, a framework that uses semantic entropy to automatically validate test cases generated by Large Language Models, addressing the problem of invalid or hallucinated tests that mislead AI programming agents. The system improves test validity by up to 29% and enhances code generation performance through better filtering of LLM-generated test cases.

AIBullishOpenAI News · Nov 197/108

🧠

Building more with GPT-5.1-Codex-Max

OpenAI introduces GPT-5.1-Codex-Max, an advanced agentic coding model designed for large-scale, long-running development projects. The model features enhanced reasoning capabilities and improved token efficiency compared to previous versions.

AIBullishOpenAI News · May 247/107

🧠

Powering next generation applications with OpenAI Codex

OpenAI Codex is now powering 70 different applications across various use cases through the OpenAI API. This represents significant adoption of OpenAI's code generation technology across the developer ecosystem.

AIBullishOpenAI News · Aug 107/105

🧠

OpenAI Codex

OpenAI has released an improved version of Codex, their AI system that converts natural language into code. The enhanced system is now available through their API in private beta, marking a significant advancement in AI-powered programming tools.

AIBullishCrypto Briefing · Jun 256/10

🧠

Z.AI’s GLM-5.2 (Max) climbs to second place on Code Arena frontend leaderboard

Z.AI's GLM-5.2 (Max) model has achieved second place on the Code Arena frontend leaderboard, signaling competitive performance in AI code generation tasks. The achievement underscores the model's capability to provide cost-effective, customizable AI solutions while reducing vendor dependency in enterprise AI deployment.

AIBearisharXiv – CS AI · Jun 256/10

🧠

Evaluating LLMs on Real-World Software Performance Optimization

Researchers introduce SWE-Pro, a benchmark revealing that current Large Language Models perform poorly at real-world software performance optimization compared to expert engineers. The study shows LLMs achieve negligible runtime improvements and nearly zero memory optimizations, while human experts demonstrate 15.5x speedups and 171.3x peak memory reductions across the same tasks.

AINeutralarXiv – CS AI · Jun 256/10

🧠

SoK: AI Secure Code Generation: Progress, Pitfalls, and Paths Forward

A systematic analysis of AI code generation security reveals that while models understand secure coding principles theoretically, they frequently fail to implement them correctly in practice. The research identifies substantial gaps between knowledge and execution, offering a framework to measure progress and suggesting principle-guided approaches as a path forward.

AINeutralarXiv – CS AI · Jun 256/10

🧠

LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models

Researchers introduce LibEvoBench, a benchmark testing how well AI code generation models handle multiple versions of Python libraries. The study reveals that state-of-the-art LLMs struggle with version-specific API knowledge, making anachronistic errors when libraries evolve, though documentation significantly improves performance.

AIBullishTechCrunch – AI · Jun 246/10

🧠

Figma adds code layers, support for animations, more AI features in new update

Figma released a major update introducing code layers, motion and shader support, and AI-powered custom plugin creation capabilities. These features enhance the design platform's technical depth and automation potential, positioning Figma as a more comprehensive tool for developers and designers working with interactive and dynamic content.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Context-Aware Distillation and Ablation for Text2DSL

Researchers improved Text2DSL, a system that automatically generates domain-specific language code from natural language, by replacing prompt-based generation with context-aware distillation using structured inputs like BNF grammars and API specifications. The enhanced approach scaled verified training data from 4,204 to 10,073 examples while maintaining 99.7% runtime accuracy, and ablation studies confirmed that vocabulary context provides the strongest semantic improvements.

AINeutralarXiv – CS AI · Jun 236/10

🧠

When Do Intrinsic Rewards Work for Code Reasoning? A Comprehensive Study

Researchers conducted a systematic empirical study of intrinsic reward methods for code generation using reinforcement learning, finding that certainty-based approaches achieve early gains but inevitably collapse as models progressively shorten outputs and lose reasoning capability. The study reveals that pre-training with intrinsic rewards offers no significant improvement over training from scratch, challenging the transferability of these methods from mathematical reasoning to code generation tasks.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Text2DSL: LLM-Based Code Generation for Domain-Specific Languages

Researchers introduce Text2DSL, a framework for automatically generating domain-specific language (DSL) code from natural language using large language models, validated on 4,204 Polkit security policy rules. The study demonstrates that providing structured context like BNF grammar and API specifications dramatically improves code generation accuracy to 98.6-99.4% syntactic validity across different model scales without requiring fine-tuning.

← PrevPage 3 of 9Next →