REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge
Researchers introduce REAL, a reinforcement learning framework that optimizes LLMs used as automated evaluators by recognizing ordinal relationships in scoring tasks rather than treating outputs as binary outcomes. The method demonstrates significant performance improvements across model scales, achieving up to +8.40 Pearson correlation gains on Qwen3-32B compared to supervised fine-tuning baselines.
The paper addresses a fundamental limitation in how reinforcement learning trains language models for evaluation tasks. Standard RL approaches collapse scoring into binary rewards, missing the nuanced differences between near-misses and far misses—a critical distinction in regression problems where predicting a score of 4 versus 1 when the truth is 5 represents vastly different quality levels. This oversight has practical implications for any system relying on LLM-as-a-Judge paradigms, from content moderation to output ranking in production systems.
REAL's innovation lies in using generalized policy gradient estimation to decompose the optimization problem into exploration and regression-aware refinement. This technical approach solves a key challenge: standard policy gradient methods assume reward functions independent of policy parameters, but regression objectives inherently depend on the policy itself. The framework's ability to optimize both correlation metrics and regression targets simultaneously positions it as a more principled solution than existing approaches limited to supervised fine-tuning.
The empirical results demonstrate consistent improvements across multiple model scales (8B to 32B parameters), with particularly strong performance on out-of-domain generalization—suggesting the method captures more robust evaluation principles rather than memorizing training patterns. For developers deploying LLM-as-a-Judge systems, this research indicates that incorporating regression awareness into RL training pipelines yields measurably better evaluation quality. The work establishes a methodological foundation for training more discriminative evaluators, which cascades into improved outcomes for any downstream application relying on model ranking or selection.
- →REAL framework enables RL to optimize ordinal scoring tasks by recognizing that prediction quality exists on a spectrum rather than binary categories
- →Achieves +8.40 Pearson and +7.20 Spearman correlation improvements on Qwen3-32B versus SFT baselines through regression-aware optimization
- →Generalizes better to out-of-domain benchmarks compared to both supervised and standard RL approaches, indicating more robust evaluation principles
- →Solves the technical challenge of policy-dependent objectives through generalized policy gradient decomposition into exploration and refinement components
- →Methodology applies broadly to any LLM evaluation system requiring nuanced scoring rather than binary pass-fail judgments