🧠 AI⚪ NeutralImportance 6/10

LLMORPH: Automated Metamorphic Testing of Large Language Models

arXiv – CS AI|Steven Cho, Stefano Ruberto, Valerio Terragni|March 26, 2026 at 04:00 AM

🤖AI Summary

Researchers have developed LLMORPH, an automated testing tool for Large Language Models that uses Metamorphic Testing to identify faulty behaviors without requiring human-labeled data. The tool was tested on GPT-4, LLAMA3, and HERMES 2 across four NLP benchmarks, generating over 561,000 test executions and successfully exposing model inconsistencies.

Key Takeaways

→LLMORPH addresses the challenge of testing LLMs without expensive human-labeled verification data.
→The tool uses Metamorphic Relations to generate follow-up inputs and detect output inconsistencies automatically.
→Testing across three major LLMs with 36 Metamorphic Relations produced over 561,000 test executions.
→The framework can be easily extended to any LLM, NLP task, and set of Metamorphic Relations.
→Results demonstrate the tool's effectiveness in automatically exposing model inconsistencies and faulty behaviors.

Mentioned in AI

Models

GPT-4OpenAI