Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text
Researchers identify a critical flaw in machine-generated text detection: token-level likelihood signals vary inconsistently across a detector model's hidden space, causing Simpson's paradox that undermines existing detectors. They propose a learned local calibration method that dramatically improves detection performance, with calibrated variants achieving AUROC improvements from 0.63 to 0.85 on GPT-5.4 text.
The paper addresses a fundamental challenge in AI safety: distinguishing human-written text from large language model outputs. Current detection methods rely on the likelihood hypothesis—the assumption that machine text appears more probable to detector models—but this approach suffers from a subtle statistical problem. Token scores contain strong discriminative signals that are non-uniformly distributed across the model's internal representation space. When detectors naively average these scores across regions with different statistical structures, they inadvertently create Simpson's paradox, where aggregation destroys what should be a strong overall signal.
The researchers' solution introduces learned local calibration grounded in Bayesian decision theory. Rather than directly averaging raw likelihood scores, they train lightweight predictors to estimate score distributions conditioned on position in hidden space, then aggregate calibrated log-likelihood ratios. This modular approach works across different detector architectures and datasets, suggesting the underlying problem is systemic rather than detector-specific.
The practical implications are substantial. Detection accuracy improvements from 0.63 to 0.85 AUROC represent a meaningful step toward reliable AI-generated content identification—increasingly important as language models become more capable. The work has implications for content moderation, academic integrity, and misinformation detection. The authors emphasize their contribution isn't a new detector but a diagnostic framework and remedy that can enhance existing systems.
Future developments likely include richer distributional models, improved calibration strategies, and ensemble methods that incorporate hidden-space geometry. This foundational work establishes a clearer path for the detection research community to build upon.
- →Simpson's paradox in token aggregation destroys detection signals despite strong local discriminative power in machine-generated text
- →Learned local calibration improves detection AUROC from 0.63 to 0.85 on GPT-5.4 text across all tested detectors
- →The proposed remedy is modular and compatible with existing detector architectures, requiring no fundamental redesign
- →Non-uniform signal distribution across hidden space reveals a systemic flaw affecting the entire current detection paradigm
- →Bayesian decision theory-grounded calibration provides a principled framework for future detection improvements