AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce Proactive Interactive Reasoning (PIR), a new paradigm that enables large language models to ask clarifying questions during problem-solving rather than operating blindly with incomplete information. The approach combines supervised fine-tuning and policy optimization to achieve significant improvements in mathematical reasoning, code generation, and document editing tasks while reducing computational overhead.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Eureka is an LLM-driven framework that automates feature engineering for machine learning by treating feature design as a code generation problem. The system combines expert agents, chain-of-thought reasoning, and reinforcement learning to generate and refine features iteratively, demonstrating 16% improvement in cloud resource prediction at Alibaba Cloud.
AIBullisharXiv – CS AI · 4d ago7/10
🧠LACUNA is a new programming model that allows LLM agents to write code that shapes their own runtime environment while maintaining safety through type-checking and validation. The system rejects unsafe code before execution and uses compiler diagnostics to drive retries, achieving competitive performance on benchmark tests while preventing prompt injection and tool misuse attacks.
AIBearisharXiv – CS AI · 4d ago7/10
🧠Researchers have identified and measured Vertical Integration Bias (VIB) in LLMs, where AI models affiliated with specific providers generate code favoring their provider's ecosystem over comparable alternatives. The study found significant bias in direct code generation (up to +18.8 percentage points) that amplifies dramatically in agentic workflows (up to +39.2 pp), raising concerns about vendor lock-in and reduced developer autonomy.
AIBullisharXiv – CS AI · 5d ago7/10
🧠HTMLCure introduces a browser experience framework that improves how large language models generate functional HTML pages by testing them across multiple interactions and states rather than relying on static screenshots. The system automatically repairs broken pages through a closed-loop process, demonstrating significant performance improvements on HTML generation benchmarks.
🧠 GPT-5
AINeutralarXiv – CS AI · May 127/10
🧠Microsoft researchers released Delulu, a benchmark dataset containing 1,951 code generation samples across 7 programming languages designed to test how well large language models detect hallucinations in Fill-in-the-Middle tasks. Testing 11 open-weight models revealed fundamental limitations, with even the strongest achieving only 84.5% accuracy, indicating that code hallucination remains a persistent challenge across all model families.
AI × CryptoNeutralarXiv – CS AI · May 127/10
🤖Researchers introduce SmartEval, a comprehensive benchmark for evaluating Solidity smart contracts generated by LLMs from natural language specifications, comprising 9,000 contracts with expert validation and a five-dimensional evaluation framework. The study reveals characteristic failure modes in LLM-generated contracts and confirms that automated evaluation scores align closely with human expert judgment, establishing a reproducible foundation for assessing smart contract synthesis quality.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce a benchmark showing that AI coding agents achieve 95% compliance with product decisions when augmented with context retrieval systems versus 46% with codebase access alone, a 49-point improvement. The study reveals that product context—including design specs, customer signals, and competitive intelligence—is essential for AI agents to follow organizational decisions invisible in source code.
🧠 Claude
AIBearisharXiv – CS AI · May 97/10
🧠A comprehensive measurement study reveals that large language models frequently specify vulnerable and incompatible library versions in generated Python code, with 36.70%-55.70% of tasks containing known CVEs and 62.75%-74.51% rated as Critical or High severity. The research demonstrates this represents a systemic bias across all evaluated models rather than isolated errors, with most CVEs publicly disclosed before the models' knowledge cutoffs.
AIBullisharXiv – CS AI · May 77/10
🧠Researchers introduce Lossless Context Management (LCM), a deterministic architecture for LLM memory that outperforms Claude Code on long-context tasks up to 1M tokens. LCM combines recursive context compression with engine-managed task partitioning, representing an evolution of recursive language models that prioritizes reliability and state retrievability over flexibility.
🧠 Claude🧠 Opus
AIBullisharXiv – CS AI · May 77/10
🧠Researchers propose Stream of Revision, a new paradigm for LLM-based code generation that allows models to revise and correct their output during generation rather than producing code in a strictly linear fashion. By introducing special action tokens enabling backtracking and editing within a single forward pass, the approach significantly reduces security vulnerabilities in generated code with minimal computational overhead.
AINeutralarXiv – CS AI · May 47/10
🧠Researchers have identified severe social bias in code generated by large language models, with bias scores reaching 60.58% across four major models. They propose a Fairness Monitor Agent that reduces bias by 65.1% while improving code correctness, revealing that standard fairness interventions often amplify rather than mitigate demographic discrimination in AI-generated software.
AIBearisharXiv – CS AI · Apr 157/10
🧠Researchers empirically evaluated 450 LLM-generated Python scripts for construction safety and found alarming reliability gaps, including a 45% silent failure rate where code executes but produces mathematically incorrect safety outputs. The study demonstrates that current frontier LLMs lack the deterministic rigor required for autonomous safety-critical engineering applications, necessitating human oversight and governance frameworks.
🧠 GPT-4🧠 Claude🧠 Gemini
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers introduce JanusCoder, a foundational multimodal AI model that bridges visual and programmatic intelligence by processing both code and visual outputs. The team created JanusCode-800K, the largest multimodal code corpus, enabling their 7B-14B parameter models to match or exceed commercial AI performance on code generation tasks combining textual instructions and visual inputs.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce a framework that automatically learns context-sensitive constraints from LLM interactions, eliminating the need for manual specification while ensuring perfect constraint adherence during generation. The method enables even 1B-parameter models to outperform larger models and state-of-the-art reasoning systems in constraint-compliant generation.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers propose Symbolic Equivalence Partitioning, a novel inference-time selection method for code generation that uses symbolic execution and SMT constraints to identify correct solutions without expensive external verifiers. The approach improves accuracy on HumanEval+ by 10.3% and on LiveCodeBench by 17.1% at N=10 without requiring additional LLM inference.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers have developed SecPI, a new fine-tuning pipeline that teaches reasoning language models to automatically generate secure code without requiring explicit security instructions. The approach improves secure code generation by 14 percentage points on security benchmarks while maintaining functional correctness.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers propose using generative AI agents to create customized user plane processing blocks for 6G mobile networks based on text-based service requests. The study evaluates factors affecting AI code generation accuracy for network-specific tasks, finding that AI agents can successfully generate desired processing functions under suitable conditions.
AINeutralarXiv – CS AI · Apr 67/10
🧠Researchers introduce IndustryCode, the first comprehensive benchmark for evaluating Large Language Models' code generation capabilities across multiple industrial domains and programming languages. The benchmark includes 579 sub-problems from 125 industrial challenges spanning finance, automation, aerospace, and remote sensing, with the top-performing model Claude 4.5 Opus achieving 68.1% accuracy on sub-problems.
🧠 Claude
AINeutralarXiv – CS AI · Mar 177/10
🧠Researchers introduced WebCoderBench, the first comprehensive benchmark for evaluating web application generation by large language models, featuring 1,572 real-world user requirements and 24 evaluation metrics. The benchmark tests 12 representative LLMs and shows no single model dominates across all metrics, providing opportunities for targeted improvements.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduced PriCoder, a new approach that improves Large Language Models' ability to generate code using private library APIs by over 20%. The method uses automatically synthesized training data through graph-based operators to teach LLMs private library usage, addressing a key limitation in current AI coding capabilities.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduced SAGE, a multi-agent framework that improves large language model reasoning through self-evolution using four specialized agents. The system achieved significant performance gains on coding and mathematics benchmarks without requiring large human-labeled datasets.
AINeutralarXiv – CS AI · Mar 117/10
🧠Researchers introduce MiniAppBench, a new benchmark for evaluating Large Language Models' ability to generate interactive HTML applications rather than static text responses. The benchmark includes 500 real-world tasks and an agentic evaluation framework called MiniAppEval that uses browser automation for testing.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers developed R1-Code-Interpreter, a large language model that uses multi-stage reinforcement learning to autonomously generate code for step-by-step reasoning across diverse tasks. The 14B parameter model achieves 72.4% accuracy on test tasks, outperforming GPT-4o variants and demonstrating emergent self-checking capabilities through code generation.
🏢 Hugging Face🧠 GPT-4
AINeutralarXiv – CS AI · Mar 57/10
🧠Researchers introduce SWE-CI, a new benchmark that evaluates AI agents' ability to maintain codebases over time through continuous integration processes. Unlike existing static bug-fixing benchmarks, SWE-CI tests agents across 100 long-term tasks spanning an average of 233 days and 71 commits each.