AIBearisharXiv – CS AI · 6d ago7/10
🧠A comprehensive evaluation of 9 open-source coding LLMs across 2,707 LeetCode problems in 12 programming languages reveals significant performance gaps compared to human developers. The best model achieves only 23.64% correctness versus a 57.2% human baseline, with performance varying substantially across languages and problem types, indicating that aggregate benchmarks mask critical weaknesses in code generation systems.
AIBearisharXiv – CS AI · Jun 27/10
🧠Researchers have developed a framework to measure and mitigate bias in code generated by large language models like GPT-4o and Gemini, using metrics called Code Bias Score and Attribute Change Ratio. The study finds that bias persists across protected attributes even after applying four mitigation strategies, indicating that more robust solutions are needed for AI-driven code generation systems.
🧠 GPT-4🧠 Gemini
AIBearisharXiv – CS AI · Apr 137/10
🧠Researchers have identified and systematically studied correctness bugs in PyTorch's compiler (torch.compile) that silently produce incorrect outputs without crashing or warning users. A new testing technique called AlignGuard has detected 23 previously unknown bugs, with over 60% classified as high-priority by the PyTorch team, highlighting a critical reliability gap in a core tool for AI infrastructure optimization.
AIBearisharXiv – CS AI · Apr 107/10
🧠Researchers evaluated Cursor, an AI-powered IDE, on its ability to generate large-scale software projects and found it achieves 91% functional correctness but produces significant design issues including code duplication, complexity violations, and framework best-practice breaches that threaten long-term maintainability.
AINeutralarXiv – CS AI · Mar 277/10
🧠Researchers introduced WebTestBench, a new benchmark for evaluating automated web testing using AI agents and large language models. The study reveals significant gaps between current AI capabilities and industrial deployment needs, with LLMs struggling with test completeness, defect detection, and long-term interaction reliability.
AINeutralarXiv – CS AI · 6d ago5/10
🧠Researchers propose a closed-loop AI-enhanced architecture for continuous software quality intelligence that integrates requirement analysis, test prioritization, defect prediction, and production incident feedback. Testing on a semi-synthetic dataset demonstrates significant improvements: 35% reduction in test execution time, defect leakage reduction from 0.19 to 0.13, and detection effectiveness improvement from 0.72 to 0.84 across six release cycles.
AIBearishTechCrunch – AI · May 296/10
🧠Developers increasingly rely on AI tools to write code faster, but research suggests this productivity gain comes at the cost of code quality. The trend poses long-term risks for software reliability and maintenance, potentially creating technical debt that could undermine the benefits of rapid development.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers present ARMeta, an LLM-based multi-agent tool that automates metamorphic testing for REST APIs by identifying test scenarios and generating executable tests without requiring explicit correct outputs. The approach addresses the test oracle problem in API validation and demonstrates complementary capabilities to traditional scenario-based testing methods.
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce DEFault++, an AI diagnostic system that automatically detects, categorizes, and identifies root causes of faults in transformer neural networks across 45 different failure mechanisms. The tool achieves over 96% accuracy in fault detection and demonstrates practical value in helping developers fix issues correctly 46% more often than without assistance.
AIBearisharXiv – CS AI · Apr 106/10
🧠A new empirical study reveals that eight major LLMs exhibit systematic biases in code generation, overusing popular libraries like NumPy in 45% of cases and defaulting to Python even when unsuitable, prioritizing familiarity over task-specific optimality. The findings highlight gaps in current LLM evaluation methodologies and underscore the need for targeted improvements in training data diversity and benchmarking standards.
AINeutralarXiv – CS AI · Mar 276/10
🧠A systematic literature review of 24 studies reveals that AI-generated code quality depends on multiple factors including prompt design, task specification, and developer expertise. The research shows variable outcomes for code correctness, security, and maintainability, indicating that AI-assisted development requires careful human oversight and validation.
AIBullisharXiv – CS AI · Mar 266/10
🧠Researchers have developed LLMLOOP, a framework that automatically refines LLM-generated code and test cases through five iterative loops addressing compilation errors, static analysis issues, test failures, and quality improvements. The tool was evaluated on HUMANEVAL-X benchmark and demonstrated effectiveness in improving the quality of AI-generated code outputs.
AINeutralarXiv – CS AI · Mar 36/103
🧠Researchers introduce OBsmith, an LLM-powered framework that tests JavaScript obfuscators for correctness bugs that can silently alter program functionality. The tool discovered 11 previously unknown bugs that existing JavaScript fuzzers failed to detect, highlighting critical gaps in obfuscation quality assurance.