Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning
Researchers reveal that AI models can possess stable factual knowledge while failing dramatically at compositional reasoning—assembling facts into logical chains—a problem invisible to standard benchmark metrics. The study introduces a diagnostic protocol showing post-training improvements mask directional shifts in composition capability, with failures often rooted in generation-time constraints rather than fundamental model limitations.
This research exposes a critical blind spot in how the AI industry evaluates language model capabilities. Current benchmarks aggregate performance across multi-hop reasoning tasks, creating an illusion of uniform improvement when models actually exhibit inconsistent composition behavior. The study demonstrates that two models with statistically identical atomic knowledge can differ by over 40 percentage points in compositional reasoning—a gap entirely obscured by aggregate scoring. The composition collapse phenomenon reveals that post-training objectives optimize for benchmark performance without reliably improving the underlying ability to chain facts together, a fundamental requirement for reliable reasoning systems. The double-gate protocol proposed here decomposes performance into three independent channels: atomic stability, residual composition, and critical depth. This methodology enables researchers to isolate composition failures that remain invisible to traditional metrics. The finding that substantial composition failures reflect generation-time computational constraints rather than permanent model limitations suggests these issues may be addressable through inference optimization. For AI development, this work challenges prevailing assumptions about post-training efficacy and highlights the danger of metric-driven optimization without mechanistic understanding. Organizations deploying language models for multi-step reasoning tasks should recognize that benchmark improvements don't guarantee compositional reliability. The research implies future post-training recipes need compositional-aware objectives and controlled evaluation protocols to measure genuine multi-hop reasoning improvements rather than aggregate score inflation.
- →Models with identical atomic knowledge can exhibit 40+ percentage point gaps in compositional reasoning, invisible to aggregate benchmarks.
- →Post-training improvements often shift composition capability in unexpected directions rather than uniformly enhancing multi-hop reasoning.
- →Generation-time computational constraints account for a substantial portion of measured composition failures, suggesting optimization pathways.
- →Current benchmark metrics fundamentally misrepresent multi-hop reasoning capabilities by treating them as single aggregated skills.
- →Diagnostic protocols controlling for atomic knowledge access are essential for accurately evaluating compositional reasoning improvements.