#software-engineering News & Analysis

66 articles tagged with #software-engineering. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

66 articles

AINeutralarXiv – CS AI · 5d ago5/10

🧠

Improving Requirements Classification with SMOTE-Tomek Preprocessing

Researchers applied SMOTE-Tomek preprocessing to address class imbalance in requirements engineering classification, achieving 76.16% accuracy with logistic regression compared to a 58.31% baseline. The technique combines synthetic minority oversampling with Tomek link removal and stratified K-fold validation on the PROMISE dataset of 969 categorized requirements.

AINeutralarXiv – CS AI · 6d ago6/10

🧠

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

VISTA is a new benchmark for evaluating how well AI agents can generate functional web applications from visual specifications and text descriptions. The benchmark introduces five different testing conditions with varying levels of design detail and technology stack constraints, using manual annotations and multi-modal evaluation metrics to assess both visual fidelity and functional correctness.

AIBullishOpenAI News · May 226/10

🧠

OpenAI named a Leader in enterprise coding agents by Gartner

OpenAI has been recognized as a Leader in Gartner's 2026 Magic Quadrant for Enterprise AI Coding Agents, with its Codex model praised for innovation and enterprise-scale deployment capabilities. This recognition validates OpenAI's position in the rapidly growing enterprise AI development tools market.

🏢 OpenAI

AIBullishOpenAI News · May 206/10

🧠

How Ramp engineers accelerate code review with Codex

Ramp engineers leverage Codex with GPT-5.5 to accelerate code review processes, reducing feedback cycles from hours to minutes. This AI-assisted workflow demonstrates how large language models integrate into developer productivity pipelines, enabling faster iteration and shipping cycles in fintech engineering teams.

🧠 GPT-5

AIBullishOpenAI News · May 146/10

🧠

Sea's View on the Future of Agentic Software Development with Codex

Sea Limited is deploying Codex, an AI development tool, across its engineering teams to accelerate AI-native software development in Asia. The company's Chief Product Officer explains the strategic rationale behind this move, signaling enterprise adoption of agentic AI tools in the region's tech sector.

AINeutralarXiv – CS AI · May 126/10

🧠

BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

BoostAPR is a new AI framework that improves automated program repair by using dual reward models and reinforcement learning to identify which code edits actually fix bugs. The system achieves significant improvements on multiple benchmarks, including 40.7% on SWE-bench Verified, demonstrating that more granular feedback mechanisms can substantially enhance AI's ability to repair software vulnerabilities.

AINeutralarXiv – CS AI · May 126/10

🧠

What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook

Researchers analyzed how autonomous AI agents discuss software engineering when interacting primarily with each other on MoltBook, an AI-only social network, revealing that AI discourse emphasizes security and trust (27.4%) while lacking the concrete runtime details, code artifacts, and environmental specifics common in human developer discussions on GitHub.

AINeutralarXiv – CS AI · May 116/10

🧠

From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines

This research paper addresses the emerging challenge of designing safe AI agents for CI/CD pipelines by introducing a framework distinguishing between data-plane authority (localized interventions) and control-plane authority (configuration changes). The authors argue that current systems prioritize bounded autonomy with external governance rather than intrinsic safety guarantees, identifying control-plane safety and formalization of autonomy boundaries as critical research gaps.

AINeutralarXiv – CS AI · May 116/10

🧠

Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study

Researchers conducted a controlled empirical study evaluating three LLMs (Claude Haiku, DeepSeek-Chat, Gemini 2.5 Flash) for qualitative coding of psychological safety in software engineering communities. Multi-shot prompting improved Claude Haiku's performance but not the others, while all models exhibited systematic biases in coding predictions, providing evidence-based guidelines for LLM-assisted qualitative research.

🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · May 76/10

🧠

Search-Based Software Engineering and AI Foundation Models: Current Landscape and Future Roadmap

This research roadmap examines the evolving relationship between search-based software engineering (SBSE) and AI foundation models like large language models, after 25 years of SBSE development. The paper identifies three core integration pathways: using FMs to enhance SBSE techniques, applying SBSE methods to improve FM development, and exploring synergies between both approaches for future software engineering challenges.

AIBullishOpenAI News · May 76/10

🧠

Simplex rethinks software development with Codex

Simplex has integrated ChatGPT Enterprise and Codex to accelerate software development workflows, reducing time spent on design, build, and testing phases. The move reflects growing adoption of AI-driven development tools to improve productivity and scale engineering operations.

🧠 ChatGPT

AIBearisharXiv – CS AI · May 16/10

🧠

Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

A comprehensive study comparing 12 large language models against 4 classical classifiers for automating evidence screening in software engineering systematic literature reviews reveals that LLMs exhibit significant performance variability and lack consistent superiority over traditional methods. The research emphasizes that abstract availability is critical for LLM performance, while title and keywords provide minimal additional value, suggesting LLM adoption should be driven by operational constraints rather than performance guarantees.

🏢 OpenAI🏢 Anthropic🧠 Gemini

AINeutralarXiv – CS AI · May 16/10

🧠

ML Code Smells: From Specification to Detection

Researchers introduce SpecDetect4ML, a specification-driven tool that detects code smells in machine learning pipelines using Code Property Graphs. The tool identifies 22 types of recurring implementation patterns that compromise reproducibility, robustness, and maintainability, achieving 95.82% precision and 88.14% recall—significantly outperforming existing static analysis tools.

AINeutralarXiv – CS AI · Apr 206/10

🧠

The Semi-Executable Stack: Agentic Software Engineering and the Expanding Scope of SE

A research paper proposes that AI-driven software engineering doesn't threaten the field but rather expands its scope to include 'semi-executable' artifacts—combinations of natural language, tools, and workflows requiring human or probabilistic interpretation. The Semi-Executable Stack model provides a diagnostic framework across six layers to understand how software engineering practices evolve as AI agents handle routine tasks.

AINeutralThe Register – AI · Apr 156/10

🧠

Headless 360: Salesforce's latest pitch to let AI do the dev work

Salesforce has introduced Headless 360, an AI-powered development platform designed to automate software development tasks and reduce manual coding work. The initiative reflects the broader enterprise software trend of leveraging AI to accelerate development cycles and lower engineering costs.

AINeutralarXiv – CS AI · Apr 146/10

🧠

From Helpful to Trustworthy: LLM Agents for Pair Programming

Doctoral research proposes a systematic framework for multi-agent LLM pair programming that improves code reliability and auditability through externalized intent and iterative validation. The study addresses critical gaps in how AI coding agents can produce trustworthy outputs aligned with developer objectives across testing, implementation, and maintenance workflows.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape

A large-scale survey of 457 software engineering researchers reveals that generative AI adoption is widespread in academic research, concentrated primarily in writing and early-stage tasks. While researchers perceive significant productivity gains, persistent concerns about accuracy, bias, and lack of governance frameworks highlight the need for clearer guidelines on responsible AI integration in academic practice.

AIBullisharXiv – CS AI · Apr 136/10

🧠

The AI Codebase Maturity Model: From Assisted Coding to Self-Sustaining Systems

Researchers present the AI Codebase Maturity Model (ACMM), a 5-level framework for systematically evolving codebases from basic AI-assisted coding to self-sustaining systems. Validated through a 4-month case study of KubeStellar Console, the model demonstrates that AI system intelligence depends primarily on surrounding infrastructure—testing, metrics, and feedback loops—rather than the AI model itself.

🏢 Microsoft🧠 Claude🧠 Copilot

AINeutralarXiv – CS AI · Apr 106/10

🧠

"Don't Be Afraid, Just Learn": Insights from Industry Practitioners to Prepare Software Engineers in the Age of Generative AI

A study of 51 industry practitioners reveals that generative AI integration into software development has created a significant gap between university curricula and industry hiring expectations. The research identifies new required skills like prompting and output evaluation, while emphasizing that soft skills and traditional competencies remain critical for modern software engineers.

AIBullisharXiv – CS AI · Apr 106/10

🧠

Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision

Researchers propose fine-grained confidence calibration methods for large language models in automated code revision tasks, addressing the limitation of traditional global calibration approaches. By applying local Platt-scaling to task-specific confidence scores, the study demonstrates improved calibration accuracy across multiple code repair and refinement tasks, enabling developers to better trust LLM outputs.

AINeutralarXiv – CS AI · Apr 66/10

🧠

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Researchers introduced GBQA, a new benchmark with 30 games and 124 verified bugs to test whether large language models can autonomously discover software bugs. The best-performing model, Claude-4.6-Opus, only identified 48.39% of bugs, highlighting the significant challenges in autonomous bug detection.

🧠 Claude

AIBearisharXiv – CS AI · Apr 66/10

🧠

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

Researchers introduced ChomskyBench, a new benchmark for evaluating large language models' formal reasoning capabilities using the Chomsky Hierarchy framework. The study reveals that while larger models show improvements, current LLMs face severe efficiency barriers and are significantly less efficient than traditional algorithmic programs for formal reasoning tasks.

AIBearisharXiv – CS AI · Mar 176/10

🧠

I'm Not Reading All of That: Understanding Software Engineers' Level of Cognitive Engagement with Agentic Coding Assistants

A research study reveals that software engineers' cognitive engagement consistently declines when working with agentic AI coding assistants, raising concerns about over-reliance and reduced critical thinking. The study found that current AI assistants provide limited support for reflection and verification, identifying design opportunities to promote deeper thinking in AI-assisted programming.

AIBullisharXiv – CS AI · Mar 66/10

🧠

Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned

Researchers have developed OPENDEV, an open-source command-line AI coding agent that operates directly in terminal environments where developers manage source control and deployments. The system uses a compound AI architecture with dual-agent design, specialized model routing, and adaptive context management to provide autonomous coding assistance while maintaining safety controls.

AINeutralarXiv – CS AI · Mar 55/10

🧠

CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Researchers introduce CodeTaste, a benchmark testing whether AI coding agents can perform code refactoring at human-level quality. The study reveals frontier AI models struggle to identify appropriate refactorings when given general improvement areas, but perform better with detailed specifications.

← PrevPage 2 of 3Next →