LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training
Researchers introduce LaRA, a framework for detecting data contamination in reinforcement learning post-trained large language models by analyzing layer-wise representations. The method identifies contamination through geometric deviations across neural network layers, outperforming existing detection approaches that rely on output-level signals unreliable for RL-trained models.
Data contamination in machine learning—where test data leaks into training sets—represents a critical threat to model evaluation integrity. Traditional detection methods measure output signals like token likelihood or entropy, but these metrics become ineffective for RL post-trained models since reinforcement learning optimizes trajectory-level rewards rather than token probabilities. LaRA addresses this gap by analyzing internal representation geometry across network layers, introducing three complementary metrics: perturbation sensitivity, directional collapse, and local representation rigidity. The framework detects contamination through controlled perturbations and identifies systematic patterns in how neural representations degrade when trained on contaminated data.
This research reflects growing maturity in LLM safety and evaluation practices. As reasoning models improve through RL post-training, benchmarking becomes increasingly difficult—subtle contamination can artificially inflate performance metrics and mislead the research community about actual capabilities. LaRA's layer-wise approach provides a more robust detection mechanism than surface-level output analysis, offering practical value for model developers validating their training pipelines.
For the AI development community, this work enables more trustworthy model evaluation and reduces the risk of publishing results based on contaminated benchmarks. Developers can now implement LaRA protocols during post-training to verify data integrity before public release. The methodology also advances understanding of how RL shapes internal model representations, with implications for interpretability research. Teams building frontier reasoning models should consider adopting representation-level analysis as standard practice, alongside traditional output metrics, to maintain evaluation credibility and ensure generalization to truly novel problems.
- →LaRA detects RL post-training data contamination through layer-wise representation analysis rather than unreliable output-level signals
- →Contaminated data produces measurable geometric deviations including amplified perturbation sensitivity and directional collapse across neural layers
- →The framework outperforms existing baselines and provides practical contamination detection for model developers
- →Strengthens evaluation reliability for reasoning models where RL optimization bypasses traditional likelihood-based metrics
- →Addresses critical gap in AI safety practices as RL post-training becomes standard in frontier LLM development