y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#code-generation News & Analysis

66 articles tagged with #code-generation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

66 articles
AIBullisharXiv โ€“ CS AI ยท Apr 107/10
๐Ÿง 

Inference-Time Code Selection via Symbolic Equivalence Partitioning

Researchers propose Symbolic Equivalence Partitioning, a novel inference-time selection method for code generation that uses symbolic execution and SMT constraints to identify correct solutions without expensive external verifiers. The approach improves accuracy on HumanEval+ by 10.3% and on LiveCodeBench by 17.1% at N=10 without requiring additional LLM inference.

AIBullisharXiv โ€“ CS AI ยท Apr 77/10
๐Ÿง 

SecPI: Secure Code Generation with Reasoning Models via Security Reasoning Internalization

Researchers have developed SecPI, a new fine-tuning pipeline that teaches reasoning language models to automatically generate secure code without requiring explicit security instructions. The approach improves secure code generation by 14 percentage points on security benchmarks while maintaining functional correctness.

AIBullisharXiv โ€“ CS AI ยท Apr 77/10
๐Ÿง 

Customized User Plane Processing via Code Generating AI Agents for Next Generation Mobile Networks

Researchers propose using generative AI agents to create customized user plane processing blocks for 6G mobile networks based on text-based service requests. The study evaluates factors affecting AI code generation accuracy for network-specific tasks, finding that AI agents can successfully generate desired processing functions under suitable conditions.

AINeutralarXiv โ€“ CS AI ยท Apr 67/10
๐Ÿง 

IndustryCode: A Benchmark for Industry Code Generation

Researchers introduce IndustryCode, the first comprehensive benchmark for evaluating Large Language Models' code generation capabilities across multiple industrial domains and programming languages. The benchmark includes 579 sub-problems from 125 industrial challenges spanning finance, automation, aerospace, and remote sensing, with the top-performing model Claude 4.5 Opus achieving 68.1% accuracy on sub-problems.

๐Ÿง  Claude
AINeutralarXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics

Researchers introduced WebCoderBench, the first comprehensive benchmark for evaluating web application generation by large language models, featuring 1,572 real-world user requirements and 24 evaluation metrics. The benchmark tests 12 representative LLMs and shows no single model dominates across all metrics, providing opportunities for targeted improvements.

AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

SAGE: Multi-Agent Self-Evolution for LLM Reasoning

Researchers introduced SAGE, a multi-agent framework that improves large language model reasoning through self-evolution using four specialized agents. The system achieved significant performance gains on coding and mathematics benchmarks without requiring large human-labeled datasets.

AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

Researchers introduced PriCoder, a new approach that improves Large Language Models' ability to generate code using private library APIs by over 20%. The method uses automatically synthesized training data through graph-based operators to teach LLMs private library usage, addressing a key limitation in current AI coding capabilities.

AIBullisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning

Researchers developed R1-Code-Interpreter, a large language model that uses multi-stage reinforcement learning to autonomously generate code for step-by-step reasoning across diverse tasks. The 14B parameter model achieves 72.4% accuracy on test tasks, outperforming GPT-4o variants and demonstrating emergent self-checking capabilities through code generation.

๐Ÿข Hugging Face๐Ÿง  GPT-4
AIBullisharXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software

Researchers developed a multi-agent LLM system that translates legal statutes into executable software, using U.S. tax preparation as a test case. The system achieved a 45% success rate using GPT-4o-mini, significantly outperforming larger frontier models like GPT-4o and Claude 3.5 which only achieved 9-15% success rates on complex tax code tasks.

๐Ÿง  GPT-4๐Ÿง  Claude
AINeutralarXiv โ€“ CS AI ยท Mar 46/105
๐Ÿง 

Human-Certified Module Repositories for the AI Age

Researchers propose Human-Certified Module Repositories (HCMRs) as a new framework to ensure trustworthy software development in the AI era. The system combines human oversight with automated analysis to certify and curate reusable code modules, addressing growing security concerns as AI increasingly generates and assembles software components.

AIBullisharXiv โ€“ CS AI ยท Mar 46/102
๐Ÿง 

AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework

Researchers have developed a Bayesian adversarial multi-agent framework for AI-driven scientific code generation, featuring three coordinated LLM agents that work together to improve reliability and reduce errors. The Low-code Platform (LCP) enables non-expert users to generate scientific code through natural language prompts, demonstrating superior performance in benchmark tests and Earth Science applications.

AINeutralarXiv โ€“ CS AI ยท Mar 46/104
๐Ÿง 

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

Researchers introduce CUDABench, a comprehensive benchmark for evaluating Large Language Models' ability to generate CUDA code from text descriptions. The benchmark reveals significant challenges including high compilation success rates but low functional correctness, lack of domain-specific knowledge, and poor GPU hardware utilization.

AIBullisharXiv โ€“ CS AI ยท Mar 37/104
๐Ÿง 

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Researchers released two open-source datasets, SwallowCode and SwallowMath, that significantly improve large language model performance in coding and mathematics through systematic data rewriting rather than filtering. The datasets boost Llama-3.1-8B performance by +17.0 on HumanEval for coding and +12.4 on GSM8K for math tasks.

AINeutralarXiv โ€“ CS AI ยท Mar 37/103
๐Ÿง 

InnoGym: Benchmarking the Innovation Potential of AI Agents

Researchers introduce InnoGym, the first benchmark designed to evaluate AI agents' innovation potential rather than just correctness. The framework measures both performance gains and methodological novelty across 18 real-world engineering and scientific tasks, revealing that while AI agents can generate novel approaches, they lack robustness for significant performance improvements.

AIBullisharXiv โ€“ CS AI ยท Feb 277/106
๐Ÿง 

Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy

Researchers introduce VALTEST, a framework that uses semantic entropy to automatically validate test cases generated by Large Language Models, addressing the problem of invalid or hallucinated tests that mislead AI programming agents. The system improves test validity by up to 29% and enhances code generation performance through better filtering of LLM-generated test cases.

AINeutralarXiv โ€“ CS AI ยท Feb 277/106
๐Ÿง 

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Researchers identify a critical trade-off in AI model training where optimizing for Pass@k metrics (multiple attempts) degrades Pass@1 performance (single attempt). The study reveals this occurs due to gradient conflicts when the training process reweights toward low-success prompts, creating interference that hurts single-shot performance.

AIBullishOpenAI News ยท Nov 197/108
๐Ÿง 

Building more with GPT-5.1-Codex-Max

OpenAI introduces GPT-5.1-Codex-Max, an advanced agentic coding model designed for large-scale, long-running development projects. The model features enhanced reasoning capabilities and improved token efficiency compared to previous versions.

AIBullishOpenAI News ยท May 247/107
๐Ÿง 

Powering next generation applications with OpenAI Codex

OpenAI Codex is now powering 70 different applications across various use cases through the OpenAI API. This represents significant adoption of OpenAI's code generation technology across the developer ecosystem.

AIBullishOpenAI News ยท Aug 107/105
๐Ÿง 

OpenAI Codex

OpenAI has released an improved version of Codex, their AI system that converts natural language into code. The enhanced system is now available through their API in private beta, marking a significant advancement in AI-powered programming tools.

AIBearisharXiv โ€“ CS AI ยท Apr 106/10
๐Ÿง 

A Study of LLMs' Preferences for Libraries and Programming Languages

A new empirical study reveals that eight major LLMs exhibit systematic biases in code generation, overusing popular libraries like NumPy in 45% of cases and defaulting to Python even when unsuitable, prioritizing familiarity over task-specific optimality. The findings highlight gaps in current LLM evaluation methodologies and underscore the need for targeted improvements in training data diversity and benchmarking standards.

AIBullisharXiv โ€“ CS AI ยท Apr 106/10
๐Ÿง 

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

Researchers propose FLeX, a parameter-efficient fine-tuning approach combining LoRA, advanced optimizers, and Fourier-based regularization to enable cross-lingual code generation across programming languages. The method achieves 42.1% pass@1 on Java tasks compared to a 34.2% baseline, demonstrating significant improvements in multilingual transfer without full model retraining.

๐Ÿง  Llama
AIBearisharXiv โ€“ CS AI ยท Apr 106/10
๐Ÿง 

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

Researchers introduce CLI-Tool-Bench, a new benchmark for evaluating large language models' ability to generate complete software from scratch. Testing seven state-of-the-art LLMs reveals that top models achieve under 43% success rates, exposing significant limitations in current AI-driven 0-to-1 software generation despite increased computational investment.

Page 1 of 3Next โ†’