FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks
FormInv introduces a measurement protocol that audits mathematical reasoning benchmarks for semantic consistency, revealing that current evaluation methods mask significant ranking volatility across AI models. The study found 3.1% semantically incorrect paraphrases in MathCheck that altered model rankings and discovered that models achieving similar accuracy scores (86-96%) exhibit drastically different consistency rates (50-82%) when tested against semantically equivalent problem restatements.
FormInv addresses a critical flaw in how AI reasoning capabilities are measured and compared. Researchers discovered that benchmark rankings depend heavily on which paraphrase families designers select, creating arbitrary outcomes where no model universally outperforms others across semantic variations. This "No-Free-Benchmark" principle reveals that published rankings may reflect benchmark construction choices rather than genuine model capability differences.
The finding that Claude Haiku achieves 86% accuracy yet only 50% semantic consistency rate exposes a measurement crisis in AI evaluation. Standard benchmarks report aggregate accuracy without testing whether models answer identically when problems are restated in semantically equivalent ways. This gap widened across tested models—nine frontier systems showed 32-point spreads between accuracy and consistency metrics, invisible to conventional evaluation.
The practical implications are substantial. AI developers, researchers, and procurement teams currently rely on benchmark rankings to guide model selection and investment decisions. FormInv's audit protocol—which identified errors automatically using cross-model unanimity for under $10—provides an accessible tool for detecting problematic paraphrases. The protocol achieved 100% recall on external benchmarks, suggesting scalability.
This research matters because mathematical reasoning benchmarks directly influence which models get deployed in critical applications. If rankings are partially determined by benchmark construction rather than genuine capability, organizations face hidden technical debt. Moving forward, benchmark designers must adopt semantic invariance testing before publication, and model evaluators should report consistency metrics alongside accuracy scores to provide stakeholders with complete performance profiles.
- →Current mathematical reasoning benchmarks mask ranking volatility through selective paraphrase family choices that lack semantic invariance testing.
- →Models show 32-point gaps between accuracy (86-96%) and semantic consistency rates (50-82%), revealing a fundamental measurement gap in standard evaluations.
- →FormInv's audit protocol detected 3.1% semantically incorrect paraphrases in MathCheck and altered GPT-4o's ranking from 2nd to 4th when flawed items were removed.
- →Cross-model unanimity enables automatic error detection for under $10, making semantic audit protocols economically feasible for all benchmark developers.
- →Benchmark designers implicitly choose which model wins by selecting paraphrase families, demonstrating that published rankings may not reflect genuine capability differences.