IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage
IRDS introduces a new data selection method for reinforcement learning with verifiable rewards (RLVR) that uses sparse autoencoders to identify interpretable, high-value training instances. The approach achieves significant accuracy improvements on math reasoning benchmarks while reducing computational costs by an order of magnitude compared to existing methods.
IRDS addresses a fundamental challenge in modern LLM training: efficiently selecting which instances to use for reinforcement learning when verification signals are available. The method distinguishes itself by combining three typically conflicting objectives—subset-level coverage, verifier signal integration, and interpretability—into a single framework. By grounding data selection decisions in sparse autoencoder clusters, the approach makes selection auditable against recognizable problem patterns, enabling researchers to understand why specific instances were chosen rather than treating the process as a black box.
The research builds on the growing recognition that RLVR techniques substantially improve LLM reasoning capabilities, but data inefficiency limits their practical deployment. Prior methods either overlooked coverage requirements, ignored verifier feedback, or produced opaque selection decisions. IRDS solves this through verifier-coupled coverage objectives optimized via greedy log-determinant maximization, selecting instances where models fail but remain capable of learning.
The experimental results demonstrate meaningful performance gains across multiple model architectures and benchmarks. Improvements of 3.9-4.0 percentage points on Qwen models and 0.5 points on Llama-3.1-8B translate to measurable accuracy advances in mathematical reasoning tasks. The computational efficiency gains—achieving stronger performance with significantly reduced computational overhead—make the approach practically viable for scaling RLVR training across larger model families.
For the AI research community, this work represents incremental but important progress toward data-efficient, interpretable training methodologies. The interpretability component addresses growing concerns about opaque AI systems by ensuring training decisions remain auditable. Future applications could extend this framework to other verification domains beyond mathematics.
- →IRDS combines data selection, verifier feedback, and interpretability using sparse autoencoders to identify high-value training instances
- →Method achieves 3.9-4.0pp accuracy improvements on Qwen models and 0.5pp on Llama-3.1-8B across math reasoning benchmarks
- →Computational efficiency is an order of magnitude better than trajectory-based baselines while improving performance
- →Sparse autoencoder clusters enable auditable selection decisions grounded in recognizable problem patterns
- →Approach addresses data inefficiency bottleneck in reinforcement learning with verifiable rewards for LLM reasoning