AINeutralarXiv – CS AI · 9h ago7/10
🧠
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
Researchers introduce GSM-SEM, a framework for generating semantically diverse variants of math benchmarks like GSM8K to combat memorization in LLM evaluations. Testing 14 state-of-the-art models reveals consistent performance drops averaging 28%, suggesting current leaderboard rankings may overstate true reasoning capabilities.