AIBearisharXiv – CS AI · 18h ago7/10
🧠
Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics
Researchers demonstrate that generative perplexity (gen-PPL), the primary metric for evaluating non-autoregressive language models, is fundamentally flawed because it measures only predictability under frozen scorers, not actual text quality. They construct deliberately naive samplers that achieve state-of-the-art results while producing incoherent text, proving the metric's inadequacy and advocating for distributional divergence metrics instead.
🏢 Perplexity