Learning More from Less: Unlocking Internal Representations for Benchmark Compression
RepCore, a new method for compressing LLM benchmarks, uses aligned hidden states from neural networks to identify representative test subsets rather than relying solely on correctness labels. The approach achieves accurate performance estimation with as few as ten source models, addressing the statistical instability that plagues existing coreset methods when evaluation data is limited.
RepCore addresses a fundamental challenge in AI evaluation: the prohibitive computational cost of benchmarking large language models comprehensively. As LLMs grow more capable and expensive to evaluate, the ability to estimate full-benchmark performance from smaller subsets becomes increasingly valuable. This research demonstrates that relying exclusively on binary correctness signals discards rich information encoded within model hidden states—the internal numerical representations that drive model decisions.
The method's significance lies in its practical applicability to newly released benchmarks. Traditional coreset selection requires stable statistical estimates across many source models, creating a chicken-and-egg problem for fresh benchmarks with limited evaluation history. RepCore solves this by extracting deeper model-level information, achieving reliable extrapolation with just ten source models instead of requiring hundreds. This capability accelerates the evaluation cycle for new benchmark releases and reduces computational barriers for smaller research organizations.
For the AI research community, this creates meaningful efficiency gains. The approach's consistency across five benchmarks and 200+ models indicates robust generalization. The spectral analysis revealing separable components—broad response tendencies versus task-specific reasoning—suggests the aligned representations capture fundamental aspects of model behavior rather than statistical artifacts. This understanding could inform both benchmark design and model evaluation protocols going forward. Industry practitioners developing evaluation infrastructure can adopt RepCore to reduce benchmarking costs, while researchers get faster feedback loops for model development.
- →RepCore uses aligned hidden states to construct representative benchmark subsets, achieving accurate performance estimation with minimal source models
- →The method reduces reliance on full benchmark evaluation cycles by 70-90% while maintaining correlation accuracy
- →Newly released benchmarks can now be evaluated reliably with as few as ten source models instead of hundreds
- →Spectral analysis confirms aligned representations separate broad model tendencies from task-specific reasoning patterns
- →The approach generalizes across five diverse benchmarks and 200+ models, demonstrating practical applicability at scale