AINeutralarXiv – CS AI · 14h ago7/10
🧠
FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks
FormInv introduces a measurement protocol that audits mathematical reasoning benchmarks for semantic consistency, revealing that current evaluation methods mask significant ranking volatility across AI models. The study found 3.1% semantically incorrect paraphrases in MathCheck that altered model rankings and discovered that models achieving similar accuracy scores (86-96%) exhibit drastically different consistency rates (50-82%) when tested against semantically equivalent problem restatements.
🧠 GPT-4🧠 Claude🧠 Haiku