GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
Researchers introduce GSM-SEM, a framework for generating semantically diverse variants of math benchmarks like GSM8K to combat memorization in LLM evaluations. Testing 14 state-of-the-art models reveals consistent performance drops averaging 28%, suggesting current leaderboard rankings may overstate true reasoning capabilities.
The proliferation of standardized benchmarks in AI has created an unintended consequence: models optimizing specifically for fixed test sets rather than developing genuine reasoning capabilities. GSM-SEM addresses this by introducing stochastic perturbations that fundamentally alter problem semantics—modifying entities, attributes, and relationships—while preserving mathematical validity. This approach differs meaningfully from prior robustness variants that apply surface-level changes like paraphrasing or number swaps, which models can learn to handle without deeper understanding.
The 28% average performance drop across leading models signals that current leaderboard positions may reflect dataset familiarity as much as reasoning prowess. This mirrors broader concerns in AI evaluation where static benchmarks become targets for optimization rather than true capability measures. The framework's reusability and stochastic generation eliminate reliance on fixed public datasets, reducing memorization bias over time as models continually face novel semantic variations.
For the AI development community, GSM-SEM raises important questions about evaluation methodology and what constitutes genuine mathematical reasoning. The release of three fully human-validated datasets (GSM8K-SEM, GSM-Symbolic-SEM, GSM-Plus-SEM) provides immediate practical tools, while demonstrated applicability to BigBenchHard, LogicBench, and NLR-BIRD suggests the framework's potential for broader benchmark evolution.
Moving forward, the field must decide whether to embrace dynamic evaluation frameworks like GSM-SEM as standard practice. This shift would require rethinking how progress is measured and reported, potentially destabilizing existing leaderboards while establishing more robust evaluation baselines.
- →GSM-SEM generates semantically diverse benchmark variants that alter underlying problem facts while preserving mathematical correctness
- →14 state-of-the-art LLMs show average 28% performance drops under maximum semantic perturbation conditions
- →Stochastic, reusable framework reduces reliance on static benchmarks and lowers memorization bias in model evaluation
- →Three fully validated datasets (GSM8K-SEM, GSM-Symbolic-SEM, GSM-Plus-SEM) are publicly released for community use
- →Framework successfully extends beyond math problems to LogicBench, BigBenchHard, and NLR-BIRD, indicating broad applicability