MINCE: Shrinking LLM Evaluation Datasets via Few-Model Monte Carlo Calibration
Researchers introduce MINCE, a novel method that significantly reduces the computational cost of evaluating large language models by intelligently shrinking benchmark datasets. Using Monte Carlo simulation with minimal calibration models, MINCE achieves 54-89% dataset size reductions while maintaining accuracy within acceptable drift thresholds, enabling 2.7-8.1x faster GPU evaluations.
MINCE addresses a critical pain point in modern AI development: the prohibitive cost of repeatedly evaluating LLM variants across large benchmarks. As organizations deploy quantized, fine-tuned, and hardware-specific model versions, the cumulative evaluation burden becomes unsustainable. Traditional approaches require either massive calibration pools or additional learned prediction layers, adding complexity and overhead. MINCE's Monte Carlo-based approach elegantly sidesteps these requirements by using statistical simulation to determine minimum viable subset sizes that preserve evaluation fidelity.
The method's efficiency gains are substantial and practically meaningful. Reducing MMLU by 89% while maintaining sub-2.62 percentage-point accuracy drift demonstrates that existing benchmarks contain significant redundancy. The robustness across different hardware platforms—from BF16 models to edge NPUs—indicates broad applicability rather than narrow optimization. Critically, MINCE outperforms tinyBenchmarks, a comparable compression approach, using orders of magnitude fewer calibration models, suggesting superior algorithmic efficiency.
For the AI development ecosystem, this work has immediate implications. Faster evaluation cycles enable more rapid iteration on model improvements, quantization strategies, and deployment optimizations. This accelerates the practical deployment of efficient models on edge hardware, an increasingly important frontier as inference costs drive adoption decisions. The median 1.7-2.0x speedup on NPU evaluation is particularly significant for embedded AI applications where computational resources are constrained.
Looking forward, similar compression techniques could extend beyond evaluation to training validation and hyperparameter tuning. The open question remains whether further compression is possible without sacrificing robustness, particularly for safety-critical applications where evaluation fidelity is paramount.
- →MINCE reduces LLM evaluation dataset sizes by 54-89% while maintaining sub-2.62pp accuracy drift on production models
- →The Monte Carlo-based method requires no prediction layers and achieves 2.7-8.1x GPU evaluation speedups and 1.7-2.0x NPU speedups
- →MINCE outperforms tinyBenchmarks with 12x lower drift on MMLU while using 57x fewer calibration models
- →The approach is robust to calibration pool size, making it practical for organizations with limited reference model access
- →Faster evaluation cycles could accelerate deployment of efficient models on edge hardware and embedded AI applications