y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#automated-testing News & Analysis

13 articles tagged with #automated-testing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

13 articles
AIBullisharXiv โ€“ CS AI ยท Apr 67/10
๐Ÿง 

AI-Assisted Unit Test Writing and Test-Driven Code Refactoring: A Case Study

Researchers demonstrated AI-assisted automated unit test generation and code refactoring in a case study, generating nearly 16,000 lines of reliable unit tests in hours instead of weeks. The approach achieved up to 78% branch coverage in critical modules and significantly reduced regression risk during large-scale refactoring of legacy codebases.

AINeutralarXiv โ€“ CS AI ยท Apr 67/10
๐Ÿง 

AgenticRed: Evolving Agentic Systems for Red-Teaming

AgenticRed introduces an automated red-teaming system that uses evolutionary algorithms and LLMs to autonomously design attack methods without human intervention. The system achieved near-perfect attack success rates across multiple AI models, including 100% success on GPT-5.1, DeepSeek-R1 and DeepSeek V3.2.

๐Ÿง  GPT-5๐Ÿง  Llama
AIBearisharXiv โ€“ CS AI ยท Mar 127/10
๐Ÿง 

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

Researchers developed a new framework for evaluating AI security risks specifically in banking and financial services, introducing the Risk-Adjusted Harm Score (RAHS) to measure severity of AI model failures. The study found that AI models become more vulnerable to security exploits during extended interactions, exposing critical weaknesses in current AI safety assessments for financial institutions.

AINeutralarXiv โ€“ CS AI ยท Mar 97/10
๐Ÿง 

AdAEM: An Adaptively and Automated Extensible Measurement of LLMs' Value Difference

Researchers introduce AdAEM, a new evaluation algorithm that automatically generates test questions to better assess value differences and biases across Large Language Models. Unlike static benchmarks, AdAEM adaptively creates controversial topics that reveal more distinguishable insights about LLMs' underlying values and cultural alignment.

AINeutralarXiv โ€“ CS AI ยท Apr 76/10
๐Ÿง 

Discovering Failure Modes in Vision-Language Models using RL

Researchers developed an AI framework using reinforcement learning to automatically discover failure modes in vision-language models without human intervention. The system trains a questioner agent that generates adaptive queries to expose weaknesses, successfully identifying 36 novel failure modes across various VLM combinations.

AINeutralarXiv โ€“ CS AI ยท Mar 266/10
๐Ÿง 

LLMORPH: Automated Metamorphic Testing of Large Language Models

Researchers have developed LLMORPH, an automated testing tool for Large Language Models that uses Metamorphic Testing to identify faulty behaviors without requiring human-labeled data. The tool was tested on GPT-4, LLAMA3, and HERMES 2 across four NLP benchmarks, generating over 561,000 test executions and successfully exposing model inconsistencies.

๐Ÿง  GPT-4
AINeutralarXiv โ€“ CS AI ยท Mar 126/10
๐Ÿง 

FERRET: Framework for Expansion Reliant Red Teaming

Researchers introduce FERRET, a new automated red teaming framework designed to generate multi-modal adversarial conversations to test AI model vulnerabilities. The framework uses three types of expansions (horizontal, vertical, and meta) to create more effective attack strategies and demonstrates superior performance compared to existing red teaming approaches.

AIBearisharXiv โ€“ CS AI ยท Mar 37/108
๐Ÿง 

Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement

Research reveals that Large Language Models (LLMs) systematically fail at code review tasks, frequently misclassifying correct code as defective when matching implementations to natural language requirements. The study found that more detailed prompts actually increase misjudgment rates, raising concerns about LLM reliability in automated development workflows.

AIBullisharXiv โ€“ CS AI ยท Mar 36/109
๐Ÿง 

AWE: Adaptive Agents for Dynamic Web Penetration Testing

Researchers introduced AWE, a memory-augmented multi-agent framework for autonomous web penetration testing that outperforms existing tools on injection vulnerabilities. AWE achieved 87% XSS success and 66.7% blind SQL injection success on benchmark tests, demonstrating superior accuracy and efficiency compared to general-purpose AI penetration testing tools.

AIBullisharXiv โ€“ CS AI ยท Mar 36/105
๐Ÿง 

Agentic Code Reasoning

Researchers introduce 'semi-formal reasoning' for LLM agents to analyze code semantics without execution, showing significant accuracy improvements across multiple tasks. The methodology achieves 88-93% accuracy on patch verification and 87% on code question answering, potentially enabling practical applications in automated code review and static analysis.

AINeutralarXiv โ€“ CS AI ยท Mar 36/103
๐Ÿง 

OBsmith: LLM-Powered JavaScript Obfuscator Testing

Researchers introduce OBsmith, an LLM-powered framework that tests JavaScript obfuscators for correctness bugs that can silently alter program functionality. The tool discovered 11 previously unknown bugs that existing JavaScript fuzzers failed to detect, highlighting critical gaps in obfuscation quality assurance.

AINeutralarXiv โ€“ CS AI ยท Mar 27/1019
๐Ÿง 

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Researchers have developed an automated pipeline to detect hidden biases in Large Language Models that don't appear in their reasoning explanations. The system discovered previously unknown biases like Spanish fluency and writing formality across seven LLMs in hiring, loan approval, and university admission tasks.

AINeutralarXiv โ€“ CS AI ยท Mar 24/106
๐Ÿง 

QD-MAPPER: A Quality Diversity Framework to Automatically Evaluate Multi-Agent Path Finding Algorithms in Diverse Maps

Researchers developed QD-MAPPER, a framework using Quality Diversity algorithms and Neural Cellular Automata to automatically generate diverse maps for evaluating Multi-Agent Path Finding (MAPF) algorithms. This addresses the limitation of testing MAPF algorithms on fixed, human-designed maps that may not cover all scenarios and could lead to overfitting.