🧠 AI🔴 BearishImportance 7/10

How reliable are LLMs when it comes to playing dice?

arXiv – CS AI|Luca Avena, Gianmarco Bet, Bernardo Busoni|June 8, 2026 at 04:00 AM

🤖AI Summary

A comprehensive study of 8 state-of-the-art language models reveals significant limitations in probabilistic reasoning, with accuracy dropping from 96% on standard problems to 59% on counterintuitive ones. The research demonstrates that LLMs are vulnerable to token bias and prompt manipulation, suggesting they lack genuine probability reasoning despite excelling at other mathematical tasks.

Analysis

The research exposes a fundamental gap between LLM performance on routine tasks and their ability to handle genuine probabilistic reasoning. While these models achieve near-perfect accuracy on standard probability exercises, their dramatic collapse to 59% accuracy on counterintuitive problems reveals they rely on pattern matching rather than principled reasoning. This finding has profound implications for AI development and deployment, particularly in domains requiring robust decision-making under uncertainty.

The token bias phenomenon—where 20% performance drops occur simply from reformulating problems in non-canonical ways—suggests LLMs have learned surface-level associations rather than semantic understanding. More concerning, the 34% performance degradation from misleading prompts indicates no current model is immune to manipulation, even when the underlying mathematics remains identical. This vulnerability extends beyond academic interest; it affects any real-world application where LLMs inform decisions involving probabilities, risk assessment, or statistical analysis.

For the AI industry, this study provides crucial validation of limitations that practitioners have suspected. The consistency of failure patterns across eight different state-of-the-art models indicates this is an architectural or training-data problem, not an isolated implementation issue. These findings will likely influence how enterprises integrate LLMs into decision-critical systems, especially in finance, healthcare, and insurance where probabilistic reasoning directly impacts outcomes.

The path forward requires either architectural innovations addressing true reasoning capabilities or careful guardrails limiting LLM deployment to contexts where pattern recognition suffices. The research demonstrates that current scaling approaches have not solved fundamental reasoning problems, challenging the assumption that larger models automatically become better reasoners.

Key Takeaways

→LLMs achieve 96% accuracy on standard probability problems but only 59% on counterintuitive ones, revealing genuine reasoning gaps
→Token bias causes 20%+ performance drops when problems are reformulated without changing meaning
→Misleading prompts reduce accuracy by up to 34% with no model showing immunity
→Current LLMs use pattern matching rather than principled probabilistic reasoning
→Results have significant implications for deploying LLMs in risk-critical applications