Coherence Under Commitment: Probing Generalization and Vacuous Memorization in LLM Logical Reasoning
Researchers introduce Coherence Under Commitment (CUC), a new evaluation framework that exposes a critical flaw in LLM logical reasoning: models can achieve coherence by refusing to make decisions rather than reasoning correctly. Testing on small language models reveals a stark trade-off where more decisive models contradict themselves frequently, while conservative models abstain from answering.
The research identifies a fundamental evaluation blind spot in large language model assessment. Current metrics measuring logical consistency reward models that sidestep commitment entirely—a model that responds "uncertain" to every query achieves perfect coherence while providing zero practical utility. This represents a systemic weakness in how AI systems are benchmarked for knowledge-intensive tasks where decisiveness matters.
The CUC framework addresses this through dual-metric assessment combining coherence scores with commitment measurement. The commitment score c(φ) = p(φ) + p(¬φ) quantifies how much probability mass a model allocates to definitive answers rather than hedging. Testing across four open-weight models reveals troubling patterns: Qwen2.5-3B achieves near-perfect logical consistency (0.025 contradiction rate) but answers only 7.4% of questions, while TinyLlama-1.1B attempts 79.4% of questions but violates coherence on every example.
This work carries implications for AI deployment in domains requiring reliable reasoning—legal analysis, medical diagnosis, scientific research. Developers and researchers must recognize that abstention-based safety represents a hollow victory. The toolkit's generalization to LogiQA v2 (0.97 correlation) suggests these patterns hold across reasoning benchmarks, not just isolated test cases.
The research doesn't directly impact crypto markets but addresses underlying AI reliability concerns affecting enterprise adoption of language models. Organizations deploying LLMs for mission-critical tasks must now demand both coherence metrics and commitment evidence, potentially slowing adoption timelines until models demonstrate genuine reasoning capability rather than sophisticated hedging.
- →LLMs can achieve logical coherence through systematic abstention rather than actual reasoning capability
- →Qwen2.5-3B shows 97.6% abstention rate despite near-zero contradictions, exposing evaluation metric limitations
- →CUC framework introduces commitment scoring to measure decisiveness alongside consistency in logical reasoning
- →The coherence-commitment frontier generalizes across benchmarks, suggesting this is a systemic model behavior pattern
- →Current AI evaluation protocols may misleadingly rank conservative, non-committal models above those attempting genuine reasoning