🧠 AI⚪ NeutralImportance 7/10

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

arXiv – CS AI|Nishal Thomas, Noel Thomas|May 29, 2026 at 04:00 AM

🤖AI Summary

FormInv introduces a measurement protocol that audits mathematical reasoning benchmarks for semantic consistency, revealing that current evaluation methods mask significant ranking volatility across AI models. The study found 3.1% semantically incorrect paraphrases in MathCheck that altered model rankings and discovered that models achieving similar accuracy scores (86-96%) exhibit drastically different consistency rates (50-82%) when tested against semantically equivalent problem restatements.

Analysis

FormInv addresses a critical flaw in how AI reasoning capabilities are measured and compared. Researchers discovered that benchmark rankings depend heavily on which paraphrase families designers select, creating arbitrary outcomes where no model universally outperforms others across semantic variations. This "No-Free-Benchmark" principle reveals that published rankings may reflect benchmark construction choices rather than genuine model capability differences.

The finding that Claude Haiku achieves 86% accuracy yet only 50% semantic consistency rate exposes a measurement crisis in AI evaluation. Standard benchmarks report aggregate accuracy without testing whether models answer identically when problems are restated in semantically equivalent ways. This gap widened across tested models—nine frontier systems showed 32-point spreads between accuracy and consistency metrics, invisible to conventional evaluation.

The practical implications are substantial. AI developers, researchers, and procurement teams currently rely on benchmark rankings to guide model selection and investment decisions. FormInv's audit protocol—which identified errors automatically using cross-model unanimity for under $10—provides an accessible tool for detecting problematic paraphrases. The protocol achieved 100% recall on external benchmarks, suggesting scalability.

This research matters because mathematical reasoning benchmarks directly influence which models get deployed in critical applications. If rankings are partially determined by benchmark construction rather than genuine capability, organizations face hidden technical debt. Moving forward, benchmark designers must adopt semantic invariance testing before publication, and model evaluators should report consistency metrics alongside accuracy scores to provide stakeholders with complete performance profiles.

Key Takeaways

→Current mathematical reasoning benchmarks mask ranking volatility through selective paraphrase family choices that lack semantic invariance testing.
→Models show 32-point gaps between accuracy (86-96%) and semantic consistency rates (50-82%), revealing a fundamental measurement gap in standard evaluations.
→FormInv's audit protocol detected 3.1% semantically incorrect paraphrases in MathCheck and altered GPT-4o's ranking from 2nd to 4th when flawed items were removed.
→Cross-model unanimity enables automatic error detection for under $10, making semantic audit protocols economically feasible for all benchmark developers.
→Benchmark designers implicitly choose which model wins by selecting paraphrase families, demonstrating that published rankings may not reflect genuine capability differences.

Mentioned in AI

Models

GPT-4OpenAI

ClaudeAnthropic

HaikuAnthropic

#benchmark-evaluation #semantic-invariance #ai-reasoning #measurement-protocol #model-ranking #mathematical-reasoning #llm-evaluation

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge