36 articles tagged with #software-engineering. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers demonstrate a methodology for translating a large production Rust codebase (648K LOC) into Python using LLM assistance, guided by benchmark performance as an objective function. The Python port of Codex CLI, an AI coding agent, achieves near-parity performance on real-world tasks while reducing code size by 15.9x and enabling 30 new features absent from the original Rust implementation.
AIBearishCrypto Briefing · Apr 77/10
🧠Simon Willison warns that AI's rapid advancement in coding capabilities could lead to a major disaster without improved safety practices. The discussion highlights how AI is transforming software engineering productivity and reshaping traditional development roles.
AIBullisharXiv – CS AI · Apr 67/10
🧠Researchers demonstrated AI-assisted automated unit test generation and code refactoring in a case study, generating nearly 16,000 lines of reliable unit tests in hours instead of weeks. The approach achieved up to 78% branch coverage in critical modules and significantly reduced regression risk during large-scale refactoring of legacy codebases.
AIBullisharXiv – CS AI · Mar 277/10
🧠A paradigm shift is occurring in software engineering as AI systems like LLMs increasingly boost development productivity. The paper presents a vision for growing symbiotic partnerships between human developers and AI, identifying key research challenges the software engineering community must address.
AIBearishArs Technica – AI · Mar 107/10
🧠Amazon Web Services is implementing new oversight requirements for AI-assisted code changes after experiencing at least two outages linked to AI coding assistants. Senior engineers will now need to sign off on AI-generated code modifications to prevent future incidents.
AINeutralarXiv – CS AI · Mar 57/10
🧠Researchers introduce SWE-CI, a new benchmark that evaluates AI agents' ability to maintain codebases over time through continuous integration processes. Unlike existing static bug-fixing benchmarks, SWE-CI tests agents across 100 long-term tasks spanning an average of 233 days and 71 commits each.
AIBearisharXiv – CS AI · Mar 47/103
🧠Researchers introduced ZeroDayBench, a new benchmark testing LLM agents' ability to find and patch 22 critical vulnerabilities in open-source code. Testing on frontier models GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 revealed that current LLMs cannot yet autonomously solve cybersecurity tasks, highlighting limitations in AI-powered code security.
AIBullisharXiv – CS AI · Feb 277/104
🧠Researchers developed RepGen, an AI-powered tool that automatically reproduces deep learning bugs with an 80.19% success rate, significantly improving upon the current 3% manual reproduction rate. The system uses LLMs to generate reproduction code through an iterative process, reducing debugging time by 56.8% in developer studies.
AIBullishOpenAI News · May 167/107
🧠OpenAI has released Codex, a cloud-based coding agent powered by codex-1, which is an optimized version of OpenAI o3 specifically designed for software engineering tasks. The system was trained using reinforcement learning on real-world coding environments to generate human-like code that follows instructions precisely and iteratively tests until achieving passing results.
AINeutralThe Register – AI · 1d ago6/10
🧠Salesforce has introduced Headless 360, an AI-powered development platform designed to automate software development tasks and reduce manual coding work. The initiative reflects the broader enterprise software trend of leveraging AI to accelerate development cycles and lower engineering costs.
AINeutralarXiv – CS AI · 3d ago6/10
🧠A large-scale survey of 457 software engineering researchers reveals that generative AI adoption is widespread in academic research, concentrated primarily in writing and early-stage tasks. While researchers perceive significant productivity gains, persistent concerns about accuracy, bias, and lack of governance frameworks highlight the need for clearer guidelines on responsible AI integration in academic practice.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Doctoral research proposes a systematic framework for multi-agent LLM pair programming that improves code reliability and auditability through externalized intent and iterative validation. The study addresses critical gaps in how AI coding agents can produce trustworthy outputs aligned with developer objectives across testing, implementation, and maintenance workflows.
AIBullisharXiv – CS AI · 4d ago6/10
🧠Researchers present the AI Codebase Maturity Model (ACMM), a 5-level framework for systematically evolving codebases from basic AI-assisted coding to self-sustaining systems. Validated through a 4-month case study of KubeStellar Console, the model demonstrates that AI system intelligence depends primarily on surrounding infrastructure—testing, metrics, and feedback loops—rather than the AI model itself.
🏢 Microsoft🧠 Claude🧠 Copilot
AINeutralarXiv – CS AI · Apr 106/10
🧠A study of 51 industry practitioners reveals that generative AI integration into software development has created a significant gap between university curricula and industry hiring expectations. The research identifies new required skills like prompting and output evaluation, while emphasizing that soft skills and traditional competencies remain critical for modern software engineers.
AIBullisharXiv – CS AI · Apr 106/10
🧠Researchers propose fine-grained confidence calibration methods for large language models in automated code revision tasks, addressing the limitation of traditional global calibration approaches. By applying local Platt-scaling to task-specific confidence scores, the study demonstrates improved calibration accuracy across multiple code repair and refinement tasks, enabling developers to better trust LLM outputs.
AINeutralarXiv – CS AI · Apr 66/10
🧠Researchers introduced GBQA, a new benchmark with 30 games and 124 verified bugs to test whether large language models can autonomously discover software bugs. The best-performing model, Claude-4.6-Opus, only identified 48.39% of bugs, highlighting the significant challenges in autonomous bug detection.
🧠 Claude
AIBearisharXiv – CS AI · Apr 66/10
🧠Researchers introduced ChomskyBench, a new benchmark for evaluating large language models' formal reasoning capabilities using the Chomsky Hierarchy framework. The study reveals that while larger models show improvements, current LLMs face severe efficiency barriers and are significantly less efficient than traditional algorithmic programs for formal reasoning tasks.
AIBearisharXiv – CS AI · Mar 176/10
🧠A research study reveals that software engineers' cognitive engagement consistently declines when working with agentic AI coding assistants, raising concerns about over-reliance and reduced critical thinking. The study found that current AI assistants provide limited support for reflection and verification, identifying design opportunities to promote deeper thinking in AI-assisted programming.
AIBullisharXiv – CS AI · Mar 66/10
🧠Researchers have developed OPENDEV, an open-source command-line AI coding agent that operates directly in terminal environments where developers manage source control and deployments. The system uses a compound AI architecture with dual-agent design, specialized model routing, and adaptive context management to provide autonomous coding assistance while maintaining safety controls.
AINeutralarXiv – CS AI · Mar 55/10
🧠Researchers introduce CodeTaste, a benchmark testing whether AI coding agents can perform code refactoring at human-level quality. The study reveals frontier AI models struggle to identify appropriate refactorings when given general improvement areas, but perform better with detailed specifications.
AINeutralarXiv – CS AI · Mar 55/10
🧠Researchers conducted a large-scale empirical study analyzing 401 open-source repositories to understand how developers use cursor rules - persistent, machine-readable directives that provide context to AI coding assistants. The study identified five key themes of project context that developers consider essential: Conventions, Guidelines, Project Information, LLM Directives, and Examples.
AIBullisharXiv – CS AI · Mar 36/1010
🧠Researchers have developed a pattern language methodology to systematically identify and modularize crosscutting concerns in agentic AI systems, addressing issues like security, reliability, and cost management that contribute to high AI project failure rates. The approach uses goal models to discover reusable patterns and implements them through aspect-oriented programming in Rust.
AIBullisharXiv – CS AI · Mar 36/107
🧠Researchers introduce SWE-Hub, a comprehensive system for generating scalable, executable software engineering tasks for training AI agents. The platform addresses current limitations in AI software development by providing unified environment automation, bug synthesis, and diverse task generation across multiple programming languages.
AIBearisharXiv – CS AI · Mar 37/108
🧠Research reveals that Large Language Models (LLMs) systematically fail at code review tasks, frequently misclassifying correct code as defective when matching implementations to natural language requirements. The study found that more detailed prompts actually increase misjudgment rates, raising concerns about LLM reliability in automated development workflows.
AINeutralarXiv – CS AI · Mar 36/107
🧠Researchers introduce Theory of Code Space (ToCS), a new benchmark that evaluates AI agents' ability to understand software architecture across multi-file codebases. The study reveals significant performance gaps between frontier LLM agents and rule-based baselines, with F1 scores ranging from 0.129 to 0.646.