AIBullisharXiv – CS AI · Feb 277/106
🧠Researchers introduce VALTEST, a framework that uses semantic entropy to automatically validate test cases generated by Large Language Models, addressing the problem of invalid or hallucinated tests that mislead AI programming agents. The system improves test validity by up to 29% and enhances code generation performance through better filtering of LLM-generated test cases.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduce GUITestScape, a new benchmark for evaluating AI agents' ability to autonomously test Android applications, along with GUIJudge, an evaluator that assesses both interaction and display defects beyond predefined annotations. The work addresses critical gaps in current GUI testing evaluation by enabling process-aware assessment of agent capabilities rather than just final outcomes.
AINeutralarXiv – CS AI · Apr 106/10
🧠Researchers have developed an enhanced version of YOLOv5 that combines visual and textual data through cross-attention mechanisms to improve UI control detection in software screenshots. Tested on over 16,000 annotated images across 23 control classes, the multi-modal approach significantly outperforms pixel-only detection, with convolutional fusion showing the strongest results for semantically complex elements.
AINeutralarXiv – CS AI · Apr 66/10
🧠Researchers introduced GBQA, a new benchmark with 30 games and 124 verified bugs to test whether large language models can autonomously discover software bugs. The best-performing model, Claude-4.6-Opus, only identified 48.39% of bugs, highlighting the significant challenges in autonomous bug detection.
🧠 Claude
AIBullisharXiv – CS AI · Mar 36/103
🧠Researchers developed LSPRAG, a new framework that uses Language Server Protocol backends to help Large Language Models generate unit tests across multiple programming languages in real-time. The system achieved significant improvements in test coverage, with increases up to 213% for Java, 174% for Go, and 31% for Python compared to existing methods.
AINeutralarXiv – CS AI · Mar 27/1019
🧠Researchers developed Once4All, an LLM-assisted fuzzing framework for testing SMT solvers that addresses syntax validity issues and computational overhead. The system found 43 confirmed bugs in leading solvers Z3 and cvc5, with 40 already fixed by developers.
AINeutralarXiv – CS AI · Mar 275/10
🧠A research paper introduces metamorphic testing as a solution for testing AI and LLM-integrated software systems. The approach addresses the challenge of unreliable LLM outputs and limited labeled ground truth by using relationships between multiple test executions as test oracles.
AINeutralarXiv – CS AI · Mar 33/104
🧠Researchers conducted a comprehensive literature review of test case prioritization (TCP) techniques and developed a new framework with ensemble methods called approach combinators. The study analyzed 324 TCP-related studies and proposed new evaluation metrics, with their methods achieving up to 2.7% reduction in regression testing time while performing comparably to state-of-the-art algorithms.