#automated-testing News & Analysis

15 articles tagged with #automated-testing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

15 articles

AIBearisharXiv – CS AI · Jun 97/10

🧠

VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents

Researchers introduce VESTA, an automated safety evaluation framework for LLM agents that generates 1,072 diverse evaluation scenarios across five risk dimensions. Testing 12 LLM agents reveals significant behavioral safety vulnerabilities, with average attack success rates of 47.1% and some models exceeding 70%, highlighting critical gaps in agent safety assurance.

AIBullisharXiv – CS AI · Apr 67/10

🧠

AI-Assisted Unit Test Writing and Test-Driven Code Refactoring: A Case Study

Researchers demonstrated AI-assisted automated unit test generation and code refactoring in a case study, generating nearly 16,000 lines of reliable unit tests in hours instead of weeks. The approach achieved up to 78% branch coverage in critical modules and significantly reduced regression risk during large-scale refactoring of legacy codebases.

AINeutralarXiv – CS AI · Apr 67/10

🧠

AgenticRed: Evolving Agentic Systems for Red-Teaming

AgenticRed introduces an automated red-teaming system that uses evolutionary algorithms and LLMs to autonomously design attack methods without human intervention. The system achieved near-perfect attack success rates across multiple AI models, including 100% success on GPT-5.1, DeepSeek-R1 and DeepSeek V3.2.

🧠 GPT-5🧠 Llama

AIBearisharXiv – CS AI · Mar 127/10

🧠

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

Researchers developed a new framework for evaluating AI security risks specifically in banking and financial services, introducing the Risk-Adjusted Harm Score (RAHS) to measure severity of AI model failures. The study found that AI models become more vulnerable to security exploits during extended interactions, exposing critical weaknesses in current AI safety assessments for financial institutions.

AINeutralarXiv – CS AI · Mar 97/10

🧠

AdAEM: An Adaptively and Automated Extensible Measurement of LLMs' Value Difference

Researchers introduce AdAEM, a new evaluation algorithm that automatically generates test questions to better assess value differences and biases across Large Language Models. Unlike static benchmarks, AdAEM adaptively creates controversial topics that reveal more distinguishable insights about LLMs' underlying values and cultural alignment.

GeneralNeutralarXiv – CS AI · May 125/10

📰

Mazocarta: A Seeded Procedural Deckbuilder for Instrumented Game Development

Mazocarta is an open-source procedural deckbuilder game built in Rust and WebAssembly that serves as a reference architecture for instrumented game development. The project demonstrates how a single rules engine can support interactive play, automated testing, balance simulation, and local multiplayer, with evaluation showing 36.1% single-player and 34.9% two-player win rates across 1,000 deterministic seeds.

AINeutralarXiv – CS AI · Apr 76/10

🧠

Discovering Failure Modes in Vision-Language Models using RL

Researchers developed an AI framework using reinforcement learning to automatically discover failure modes in vision-language models without human intervention. The system trains a questioner agent that generates adaptive queries to expose weaknesses, successfully identifying 36 novel failure modes across various VLM combinations.

AINeutralarXiv – CS AI · Mar 266/10

🧠

LLMORPH: Automated Metamorphic Testing of Large Language Models

Researchers have developed LLMORPH, an automated testing tool for Large Language Models that uses Metamorphic Testing to identify faulty behaviors without requiring human-labeled data. The tool was tested on GPT-4, LLAMA3, and HERMES 2 across four NLP benchmarks, generating over 561,000 test executions and successfully exposing model inconsistencies.

🧠 GPT-4

AINeutralarXiv – CS AI · Mar 126/10

🧠

FERRET: Framework for Expansion Reliant Red Teaming

Researchers introduce FERRET, a new automated red teaming framework designed to generate multi-modal adversarial conversations to test AI model vulnerabilities. The framework uses three types of expansions (horizontal, vertical, and meta) to create more effective attack strategies and demonstrates superior performance compared to existing red teaming approaches.

AIBearisharXiv – CS AI · Mar 37/108

🧠

Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement

Research reveals that Large Language Models (LLMs) systematically fail at code review tasks, frequently misclassifying correct code as defective when matching implementations to natural language requirements. The study found that more detailed prompts actually increase misjudgment rates, raising concerns about LLM reliability in automated development workflows.

AIBullisharXiv – CS AI · Mar 36/109

🧠

AWE: Adaptive Agents for Dynamic Web Penetration Testing

Researchers introduced AWE, a memory-augmented multi-agent framework for autonomous web penetration testing that outperforms existing tools on injection vulnerabilities. AWE achieved 87% XSS success and 66.7% blind SQL injection success on benchmark tests, demonstrating superior accuracy and efficiency compared to general-purpose AI penetration testing tools.

AIBullisharXiv – CS AI · Mar 36/105

🧠

Agentic Code Reasoning

Researchers introduce 'semi-formal reasoning' for LLM agents to analyze code semantics without execution, showing significant accuracy improvements across multiple tasks. The methodology achieves 88-93% accuracy on patch verification and 87% on code question answering, potentially enabling practical applications in automated code review and static analysis.

AINeutralarXiv – CS AI · Mar 36/103

🧠

OBsmith: LLM-Powered JavaScript Obfuscator Testing

Researchers introduce OBsmith, an LLM-powered framework that tests JavaScript obfuscators for correctness bugs that can silently alter program functionality. The tool discovered 11 previously unknown bugs that existing JavaScript fuzzers failed to detect, highlighting critical gaps in obfuscation quality assurance.

AINeutralarXiv – CS AI · Mar 27/1019

🧠

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Researchers have developed an automated pipeline to detect hidden biases in Large Language Models that don't appear in their reasoning explanations. The system discovered previously unknown biases like Spanish fluency and writing formality across seven LLMs in hiring, loan approval, and university admission tasks.

AINeutralarXiv – CS AI · Mar 24/106

🧠

QD-MAPPER: A Quality Diversity Framework to Automatically Evaluate Multi-Agent Path Finding Algorithms in Diverse Maps

Researchers developed QD-MAPPER, a framework using Quality Diversity algorithms and Neural Cellular Automata to automatically generate diverse maps for evaluating Multi-Agent Path Finding (MAPF) algorithms. This addresses the limitation of testing MAPF algorithms on fixed, human-designed maps that may not cover all scenarios and could lead to overfitting.