AINeutralarXiv – CS AI · 5d ago5/10
🧠Researchers applied SMOTE-Tomek preprocessing to address class imbalance in requirements engineering classification, achieving 76.16% accuracy with logistic regression compared to a 58.31% baseline. The technique combines synthetic minority oversampling with Tomek link removal and stratified K-fold validation on the PROMISE dataset of 969 categorized requirements.
AINeutralarXiv – CS AI · 6d ago6/10
🧠VISTA is a new benchmark for evaluating how well AI agents can generate functional web applications from visual specifications and text descriptions. The benchmark introduces five different testing conditions with varying levels of design detail and technology stack constraints, using manual annotations and multi-modal evaluation metrics to assess both visual fidelity and functional correctness.
AIBullishOpenAI News · May 226/10
🧠OpenAI has been recognized as a Leader in Gartner's 2026 Magic Quadrant for Enterprise AI Coding Agents, with its Codex model praised for innovation and enterprise-scale deployment capabilities. This recognition validates OpenAI's position in the rapidly growing enterprise AI development tools market.
🏢 OpenAI
AIBullishOpenAI News · May 206/10
🧠Ramp engineers leverage Codex with GPT-5.5 to accelerate code review processes, reducing feedback cycles from hours to minutes. This AI-assisted workflow demonstrates how large language models integrate into developer productivity pipelines, enabling faster iteration and shipping cycles in fintech engineering teams.
🧠 GPT-5
AIBullishOpenAI News · May 146/10
🧠Sea Limited is deploying Codex, an AI development tool, across its engineering teams to accelerate AI-native software development in Asia. The company's Chief Product Officer explains the strategic rationale behind this move, signaling enterprise adoption of agentic AI tools in the region's tech sector.
AINeutralarXiv – CS AI · May 126/10
🧠BoostAPR is a new AI framework that improves automated program repair by using dual reward models and reinforcement learning to identify which code edits actually fix bugs. The system achieves significant improvements on multiple benchmarks, including 40.7% on SWE-bench Verified, demonstrating that more granular feedback mechanisms can substantially enhance AI's ability to repair software vulnerabilities.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers analyzed how autonomous AI agents discuss software engineering when interacting primarily with each other on MoltBook, an AI-only social network, revealing that AI discourse emphasizes security and trust (27.4%) while lacking the concrete runtime details, code artifacts, and environmental specifics common in human developer discussions on GitHub.
AINeutralarXiv – CS AI · May 116/10
🧠This research paper addresses the emerging challenge of designing safe AI agents for CI/CD pipelines by introducing a framework distinguishing between data-plane authority (localized interventions) and control-plane authority (configuration changes). The authors argue that current systems prioritize bounded autonomy with external governance rather than intrinsic safety guarantees, identifying control-plane safety and formalization of autonomy boundaries as critical research gaps.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers conducted a controlled empirical study evaluating three LLMs (Claude Haiku, DeepSeek-Chat, Gemini 2.5 Flash) for qualitative coding of psychological safety in software engineering communities. Multi-shot prompting improved Claude Haiku's performance but not the others, while all models exhibited systematic biases in coding predictions, providing evidence-based guidelines for LLM-assisted qualitative research.
🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · May 76/10
🧠This research roadmap examines the evolving relationship between search-based software engineering (SBSE) and AI foundation models like large language models, after 25 years of SBSE development. The paper identifies three core integration pathways: using FMs to enhance SBSE techniques, applying SBSE methods to improve FM development, and exploring synergies between both approaches for future software engineering challenges.
AIBullishOpenAI News · May 76/10
🧠Simplex has integrated ChatGPT Enterprise and Codex to accelerate software development workflows, reducing time spent on design, build, and testing phases. The move reflects growing adoption of AI-driven development tools to improve productivity and scale engineering operations.
🧠 ChatGPT
AIBearisharXiv – CS AI · May 16/10
🧠A comprehensive study comparing 12 large language models against 4 classical classifiers for automating evidence screening in software engineering systematic literature reviews reveals that LLMs exhibit significant performance variability and lack consistent superiority over traditional methods. The research emphasizes that abstract availability is critical for LLM performance, while title and keywords provide minimal additional value, suggesting LLM adoption should be driven by operational constraints rather than performance guarantees.
🏢 OpenAI🏢 Anthropic🧠 Gemini
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce SpecDetect4ML, a specification-driven tool that detects code smells in machine learning pipelines using Code Property Graphs. The tool identifies 22 types of recurring implementation patterns that compromise reproducibility, robustness, and maintainability, achieving 95.82% precision and 88.14% recall—significantly outperforming existing static analysis tools.
AINeutralarXiv – CS AI · Apr 206/10
🧠A research paper proposes that AI-driven software engineering doesn't threaten the field but rather expands its scope to include 'semi-executable' artifacts—combinations of natural language, tools, and workflows requiring human or probabilistic interpretation. The Semi-Executable Stack model provides a diagnostic framework across six layers to understand how software engineering practices evolve as AI agents handle routine tasks.
AINeutralThe Register – AI · Apr 156/10
🧠Salesforce has introduced Headless 360, an AI-powered development platform designed to automate software development tasks and reduce manual coding work. The initiative reflects the broader enterprise software trend of leveraging AI to accelerate development cycles and lower engineering costs.
AINeutralarXiv – CS AI · Apr 146/10
🧠Doctoral research proposes a systematic framework for multi-agent LLM pair programming that improves code reliability and auditability through externalized intent and iterative validation. The study addresses critical gaps in how AI coding agents can produce trustworthy outputs aligned with developer objectives across testing, implementation, and maintenance workflows.
AINeutralarXiv – CS AI · Apr 146/10
🧠A large-scale survey of 457 software engineering researchers reveals that generative AI adoption is widespread in academic research, concentrated primarily in writing and early-stage tasks. While researchers perceive significant productivity gains, persistent concerns about accuracy, bias, and lack of governance frameworks highlight the need for clearer guidelines on responsible AI integration in academic practice.
AIBullisharXiv – CS AI · Apr 136/10
🧠Researchers present the AI Codebase Maturity Model (ACMM), a 5-level framework for systematically evolving codebases from basic AI-assisted coding to self-sustaining systems. Validated through a 4-month case study of KubeStellar Console, the model demonstrates that AI system intelligence depends primarily on surrounding infrastructure—testing, metrics, and feedback loops—rather than the AI model itself.
🏢 Microsoft🧠 Claude🧠 Copilot
AINeutralarXiv – CS AI · Apr 106/10
🧠A study of 51 industry practitioners reveals that generative AI integration into software development has created a significant gap between university curricula and industry hiring expectations. The research identifies new required skills like prompting and output evaluation, while emphasizing that soft skills and traditional competencies remain critical for modern software engineers.
AIBullisharXiv – CS AI · Apr 106/10
🧠Researchers propose fine-grained confidence calibration methods for large language models in automated code revision tasks, addressing the limitation of traditional global calibration approaches. By applying local Platt-scaling to task-specific confidence scores, the study demonstrates improved calibration accuracy across multiple code repair and refinement tasks, enabling developers to better trust LLM outputs.
AINeutralarXiv – CS AI · Apr 66/10
🧠Researchers introduced GBQA, a new benchmark with 30 games and 124 verified bugs to test whether large language models can autonomously discover software bugs. The best-performing model, Claude-4.6-Opus, only identified 48.39% of bugs, highlighting the significant challenges in autonomous bug detection.
🧠 Claude
AIBearisharXiv – CS AI · Apr 66/10
🧠Researchers introduced ChomskyBench, a new benchmark for evaluating large language models' formal reasoning capabilities using the Chomsky Hierarchy framework. The study reveals that while larger models show improvements, current LLMs face severe efficiency barriers and are significantly less efficient than traditional algorithmic programs for formal reasoning tasks.
AIBearisharXiv – CS AI · Mar 176/10
🧠A research study reveals that software engineers' cognitive engagement consistently declines when working with agentic AI coding assistants, raising concerns about over-reliance and reduced critical thinking. The study found that current AI assistants provide limited support for reflection and verification, identifying design opportunities to promote deeper thinking in AI-assisted programming.
AIBullisharXiv – CS AI · Mar 66/10
🧠Researchers have developed OPENDEV, an open-source command-line AI coding agent that operates directly in terminal environments where developers manage source control and deployments. The system uses a compound AI architecture with dual-agent design, specialized model routing, and adaptive context management to provide autonomous coding assistance while maintaining safety controls.
AINeutralarXiv – CS AI · Mar 55/10
🧠Researchers introduce CodeTaste, a benchmark testing whether AI coding agents can perform code refactoring at human-level quality. The study reveals frontier AI models struggle to identify appropriate refactorings when given general improvement areas, but perform better with detailed specifications.