y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#metamorphic-testing News & Analysis

4 articles tagged with #metamorphic-testing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles
AINeutralarXiv โ€“ CS AI ยท Mar 167/10
๐Ÿง 

Semantic Invariance in Agentic AI

Researchers developed a testing framework to evaluate how reliably AI agents maintain consistent reasoning when inputs are semantically equivalent but differently phrased. Their study of seven foundation models across 19 reasoning problems found that larger models aren't necessarily more robust, with the smaller Qwen3-30B-A3B achieving the highest stability at 79.6% invariant responses.

AIBullisharXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software

Researchers developed a multi-agent LLM system that translates legal statutes into executable software, using U.S. tax preparation as a test case. The system achieved a 45% success rate using GPT-4o-mini, significantly outperforming larger frontier models like GPT-4o and Claude 3.5 which only achieved 9-15% success rates on complex tax code tasks.

๐Ÿง  GPT-4๐Ÿง  Claude
AINeutralarXiv โ€“ CS AI ยท Mar 266/10
๐Ÿง 

LLMORPH: Automated Metamorphic Testing of Large Language Models

Researchers have developed LLMORPH, an automated testing tool for Large Language Models that uses Metamorphic Testing to identify faulty behaviors without requiring human-labeled data. The tool was tested on GPT-4, LLAMA3, and HERMES 2 across four NLP benchmarks, generating over 561,000 test executions and successfully exposing model inconsistencies.

๐Ÿง  GPT-4
AINeutralarXiv โ€“ CS AI ยท Mar 275/10
๐Ÿง 

From Untestable to Testable: Metamorphic Testing in the Age of LLMs

A research paper introduces metamorphic testing as a solution for testing AI and LLM-integrated software systems. The approach addresses the challenge of unreliable LLM outputs and limited labeled ground truth by using relationships between multiple test executions as test oracles.