75 articles tagged with #code-generation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers propose LatentRefusal, a safety mechanism for LLM-based text-to-SQL systems that detects unanswerable queries by analyzing intermediate hidden activations rather than relying on output-level instruction following. The approach achieves 88.5% F1 score across four benchmarks while adding minimal computational overhead, addressing a critical deployment challenge in AI systems that generate executable code.
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers introduce CodeRQ-Bench, the first benchmark for evaluating LLM reasoning quality across coding tasks including generation, summarization, and classification. They propose VERA, a two-stage evaluator combining evidence-grounded verification with ambiguity-aware score correction, achieving significant performance improvements over existing methods.
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers propose CoDe-R, a two-stage framework using Large Language Models to improve binary decompilation by reducing logical errors and semantic misalignment. A 1.3B model using this approach achieves state-of-the-art performance on the HumanEval-Decompile benchmark, becoming the first lightweight model to exceed 50% re-executability rates.
AIBullisharXiv โ CS AI ยท 3d ago6/10
๐ง Researchers introduce PoTable, a novel AI framework that enhances Large Language Models' ability to reason about tabular data through systematic, stage-oriented planning before execution. The approach mimics professional data analyst workflows by breaking complex table reasoning into distinct analytical stages with clear objectives, demonstrating improved accuracy and explainability across benchmark datasets.
AINeutralarXiv โ CS AI ยท 3d ago6/10
๐ง Doctoral research proposes a systematic framework for multi-agent LLM pair programming that improves code reliability and auditability through externalized intent and iterative validation. The study addresses critical gaps in how AI coding agents can produce trustworthy outputs aligned with developer objectives across testing, implementation, and maintenance workflows.
AIBullisharXiv โ CS AI ยท 4d ago6/10
๐ง Researchers present PETITE, a tutor-student multi-agent framework that enhances LLM problem-solving by assigning complementary roles to agents from the same model. Evaluated on coding benchmarks, the approach achieves comparable or superior accuracy to existing methods while consuming significantly fewer tokens, demonstrating that structured role-differentiated interactions can improve LLM performance more efficiently than larger models or heterogeneous ensembles.
AIBearisharXiv โ CS AI ยท Apr 106/10
๐ง Researchers introduce CLI-Tool-Bench, a new benchmark for evaluating large language models' ability to generate complete software from scratch. Testing seven state-of-the-art LLMs reveals that top models achieve under 43% success rates, exposing significant limitations in current AI-driven 0-to-1 software generation despite increased computational investment.
AIBullisharXiv โ CS AI ยท Apr 106/10
๐ง Researchers propose FLeX, a parameter-efficient fine-tuning approach combining LoRA, advanced optimizers, and Fourier-based regularization to enable cross-lingual code generation across programming languages. The method achieves 42.1% pass@1 on Java tasks compared to a 34.2% baseline, demonstrating significant improvements in multilingual transfer without full model retraining.
๐ง Llama
AIBearisharXiv โ CS AI ยท Apr 106/10
๐ง A new empirical study reveals that eight major LLMs exhibit systematic biases in code generation, overusing popular libraries like NumPy in 45% of cases and defaulting to Python even when unsuitable, prioritizing familiarity over task-specific optimality. The findings highlight gaps in current LLM evaluation methodologies and underscore the need for targeted improvements in training data diversity and benchmarking standards.
AIBullisharXiv โ CS AI ยท Apr 76/10
๐ง Researchers developed AP-MAE, a vision transformer model that analyzes attention patterns in large language models at scale to improve interpretability. The system can predict code generation accuracy with 55-70% precision and enable targeted interventions that increase model accuracy by 13.6%.
AINeutralarXiv โ CS AI ยท Apr 76/10
๐ง Research study reveals that when Claude Opus 4.6 deobfuscates JavaScript code, poisoned identifier names from the original string table consistently survive in the reconstructed code, even when the AI demonstrates correct understanding of the code's semantics. Changing the task framing from 'deobfuscate' to 'write fresh implementation' significantly reduced this persistence while maintaining algorithmic accuracy.
๐ง Claude๐ง Haiku๐ง Opus
AIBullisharXiv โ CS AI ยท Apr 66/10
๐ง Researchers introduce InCoder-32B-Thinking, an AI model trained with Error-driven Chain-of-Thought (ECoT) framework and Industrial Code World Model (ICWM) for industrial software development. The model generates reasoning traces for hardware-constrained programming and achieves top-tier performance on 23 benchmarks, scoring 81.3% on LiveCodeBench v5 and 84.0% on CAD-Coder.
AINeutralarXiv โ CS AI ยท Apr 66/10
๐ง Researchers introduce StructEval, a comprehensive benchmark for evaluating Large Language Models' ability to generate structured outputs across 18 formats including JSON, HTML, and React. Even state-of-the-art models like o1-mini only achieve 75.58% average scores, with open-source models performing approximately 10 points lower.
AIBullisharXiv โ CS AI ยท Mar 276/10
๐ง CodeRefine is a new AI framework that automatically converts research paper methodologies into functional code using Large Language Models. The system creates knowledge graphs from papers and uses retrieval-augmented generation to produce more accurate code implementations than traditional zero-shot prompting methods.
AIBullisharXiv โ CS AI ยท Mar 266/10
๐ง Researchers have developed LLMLOOP, a framework that automatically refines LLM-generated code and test cases through five iterative loops addressing compilation errors, static analysis issues, test failures, and quality improvements. The tool was evaluated on HUMANEVAL-X benchmark and demonstrated effectiveness in improving the quality of AI-generated code outputs.
AIBullisharXiv โ CS AI ยท Mar 266/10
๐ง Researchers developed a scalable multi-turn synthetic data generation pipeline using reinforcement learning to improve large language models' code generation capabilities. The approach uses teacher models to create structured difficulty progressions and curriculum-based training, showing consistent improvements in code generation across Llama3.1-8B and Qwen models.
๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose a new framework that uses LLMs as code generators rather than per-instance evaluators for high-stakes decision-making, creating interpretable and reproducible AI systems. The approach generates executable decision logic once instead of querying LLMs for each prediction, demonstrated through venture capital founder screening with competitive performance while maintaining full transparency.
๐ง GPT-4
AIBearisharXiv โ CS AI ยท Mar 126/10
๐ง A research study analyzing 319 LLM-generated security patches found that only 24.8% achieve full correctness, with most failures due to semantic misunderstanding rather than syntax errors. LLMs preserve functionality well but struggle significantly with security fixes, with success rates varying dramatically by vulnerability type.
AIBullisharXiv โ CS AI ยท Mar 116/10
๐ง Researchers introduce SiliconMind-V1, a new multi-agent AI framework that generates Verilog hardware code with improved functional correctness. The system uses locally fine-tuned language models with integrated testing and debugging capabilities, outperforming existing methods while using fewer training resources.
AIBullisharXiv โ CS AI ยท Mar 116/10
๐ง Researchers have developed neural debuggers - AI models that can emulate traditional Python debuggers by stepping through code execution, setting breakpoints, and predicting both forward and backward program states. This breakthrough enables more interactive control over neural code interpretation compared to existing approaches that only execute programs linearly.
๐ข Meta
AIBullisharXiv โ CS AI ยท Mar 116/10
๐ง Researchers introduce RECODE, a new framework that improves visual reasoning in AI models by converting images into executable code for verification. The system generates multiple candidate programs to reproduce visuals, then selects and refines the most accurate reconstruction, significantly outperforming existing methods on visual reasoning benchmarks.
AINeutralarXiv โ CS AI ยท Mar 96/10
๐ง A research study involving 737 participants found that human guidance is crucial in 'vibe coding' - using natural language to generate code through AI. The study shows hybrid systems perform best when humans provide high-level instructions while AI handles evaluation, with AI-only instruction leading to performance collapse.
AIBullisharXiv โ CS AI ยท Mar 36/107
๐ง Researchers introduce SWE-Hub, a comprehensive system for generating scalable, executable software engineering tasks for training AI agents. The platform addresses current limitations in AI software development by providing unified environment automation, bug synthesis, and diverse task generation across multiple programming languages.
AIBullisharXiv โ CS AI ยท Mar 37/107
๐ง Researchers propose MIST-RL, a reinforcement learning framework that improves AI code generation by creating more efficient test suites. The method achieves 28.5% higher fault detection while using 19.3% fewer test cases, demonstrating significant improvements in AI code verification efficiency.
AIBullisharXiv โ CS AI ยท Mar 37/107
๐ง Researchers propose a new framework called 'method' that addresses the challenge of automated paper reproduction by recovering tacit knowledge that academic papers leave implicit. The graph-based agent framework achieves 10.04% performance gap against official implementations, improving over baselines by 24.68% across 40 recent papers.
$LINK