#software-testing News & Analysis

10 articles tagged with #software-testing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

10 articles

AIBullisharXiv – CS AI · Feb 277/106

🧠

Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy

Researchers introduce VALTEST, a framework that uses semantic entropy to automatically validate test cases generated by Large Language Models, addressing the problem of invalid or hallucinated tests that mislead AI programming agents. The system improves test validity by up to 29% and enhances code generation performance through better filtering of LLM-generated test cases.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Governance Controls for AI-Generated Test Artifacts in Autonomous Software Testing

Researchers introduce the Governance-Aware Autonomous Testing Framework (GATF), which adds governance validation, compliance monitoring, and explainability controls to AI-powered software testing systems. The framework achieved 89.6% reduction in governance-related risks and demonstrated high accuracy across multiple performance metrics, addressing critical concerns about AI-generated test artifacts including hallucinations and security vulnerabilities.

AINeutralarXiv – CS AI · Jun 26/10

🧠

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

Researchers introduce PBT-Bench, a benchmark testing AI agents' ability to derive semantic invariants from documentation and construct property-based testing strategies across 100 problems in Python libraries. Results show current LLMs achieve 42-83% bug recall with structured prompting, revealing significant performance gaps where different models fail on different problems.

AINeutralarXiv – CS AI · May 296/10

🧠

GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing

Researchers introduce GUITestScape, a new benchmark for evaluating AI agents' ability to autonomously test Android applications, along with GUIJudge, an evaluator that assesses both interaction and display defects beyond predefined annotations. The work addresses critical gaps in current GUI testing evaluation by enabling process-aware assessment of agent capabilities rather than just final outcomes.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Multi-modal user interface control detection using cross-attention

Researchers have developed an enhanced version of YOLOv5 that combines visual and textual data through cross-attention mechanisms to improve UI control detection in software screenshots. Tested on over 16,000 annotated images across 23 control classes, the multi-modal approach significantly outperforms pixel-only detection, with convolutional fusion showing the strongest results for semantically complex elements.

AINeutralarXiv – CS AI · Apr 66/10

🧠

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Researchers introduced GBQA, a new benchmark with 30 games and 124 verified bugs to test whether large language models can autonomously discover software bugs. The best-performing model, Claude-4.6-Opus, only identified 48.39% of bugs, highlighting the significant challenges in autonomous bug detection.

🧠 Claude

AIBullisharXiv – CS AI · Mar 36/103

🧠

LSPRAG: LSP-Guided RAG for Language-Agnostic Real-Time Unit Test Generation

Researchers developed LSPRAG, a new framework that uses Language Server Protocol backends to help Large Language Models generate unit tests across multiple programming languages in real-time. The system achieved significant improvements in test coverage, with increases up to 213% for Java, 174% for Go, and 31% for Python compared to existing methods.

AINeutralarXiv – CS AI · Mar 27/1019

🧠

Once4All: Skeleton-Guided SMT Solver Fuzzing with LLM-Synthesized Generators

Researchers developed Once4All, an LLM-assisted fuzzing framework for testing SMT solvers that addresses syntax validity issues and computational overhead. The system found 43 confirmed bugs in leading solvers Z3 and cvc5, with 40 already fixed by developers.

AINeutralarXiv – CS AI · Mar 275/10

🧠

From Untestable to Testable: Metamorphic Testing in the Age of LLMs

A research paper introduces metamorphic testing as a solution for testing AI and LLM-integrated software systems. The approach addresses the challenge of unreliable LLM outputs and limited labeled ground truth by using relationships between multiple test executions as test oracles.

AINeutralarXiv – CS AI · Mar 33/104

🧠

Test Case Prioritization: A Snowballing Literature Review and TCPFramework with Approach Combinators

Researchers conducted a comprehensive literature review of test case prioritization (TCP) techniques and developed a new framework with ensemble methods called approach combinators. The study analyzed 324 TCP-related studies and proposed new evaluation metrics, with their methods achieving up to 2.7% reduction in regression testing time while performing comparably to state-of-the-art algorithms.