Signed Compression Progress on a Sealed Audit is Goodhart-Resistant
Researchers prove that compression-based intrinsic motivation for AI agents resists reward hacking when implemented as signed loss decrease on a sealed audit panel. The mathematical guarantee shows cumulative reward telescopes to true model improvement, with bounded deviation proportional to the model class complexity, and experiments validate the theory against various exploitation attempts.
This research addresses a fundamental problem in AI alignment: designing reward signals that genuinely incentivize learning rather than gaming. Compression progress—rewarding agents when their world models predict experience better—has long been proposed as intrinsically motivated learning, but lacked formal security guarantees. The paper bridges this gap by proving that if intrinsic reward equals the signed decrease in a fixed, sealed-audit loss function, cumulative reward mathematically telescopes to actual audit improvement. This means no policy can indefinitely increase reward while model performance stagnates or degrades, eliminating a major class of specification gaming attacks.
The work builds on decades of intrinsic motivation research and recent advances in mechanized theorem proving. The authors provide a Lean 4 formalization of core results, lending credibility to their claims. Critically, they identify failure modes: clipping progress, scoring on the agent's own data stream, or applying high-capacity models to reusable panels all circumvent the guarantee. Empirical validation on ARC-TGI transformations demonstrates that finite-audit deviation scales predictably (n^{-0.527}), and signed compression resists common attacks like stream leakage and curiosity-driven farming.
For AI safety practitioners and alignment researchers, this provides theoretical foundations for building intrinsically motivated systems with formal guarantees. The sealed-audit framework could inform future AI training methodologies where model improvement transparency is critical. However, the practical applicability depends on whether sealed audits remain truly sealed in deployed systems and whether the model class complexity remains tractable. The paper does not directly impact cryptocurrency markets but reinforces alignment research that underpins safe AI development—a growing concern for institutional crypto stakeholders.
- →Signed compression progress on sealed audits mathematically guarantees cumulative reward reflects genuine model improvement, eliminating infinite reward hacking on stagnant performance.
- →Formal Lean 4 mechanization of telescoping proofs and finite-audit bounds provides verifiable assurance against common reward gaming attacks.
- →Finite-audit deviation scales as n^{-0.527}, allowing practitioners to bound false-positive exploitation via uniform loss-class concentration.
- →Sealed-audit security breaks under clipping, agent-stream scoring, reusable panels with high-capacity models, or vacuous complexity classes.
- →Experimental validation on grid-transformation tasks confirms theory resists stream leakage, curiosity-driven gaming, and noisy-reward manipulation.