Ontology-constrained multi-LLM scoring of hypothesis support in the predictive processing literature
Researchers developed a multi-LLM pipeline that uses ontology-constrained scoring to synthesize fragmented predictive coding neuroscience literature into quantifiable evidence spaces. The system scored 31 studies across ten language models using a 36-concept glossary, revealing structured disagreement patterns between experimental contexts and introducing 'hypothesis-space temperature' as a novel metric for measuring research dispersion.
This research addresses a critical challenge in interdisciplinary science: synthesizing heterogeneous literature when traditional meta-analysis frameworks fail. Predictive coding neuroscience exemplifies this fragmentation problem, spanning computational theory, electrophysiology, imaging, and behavioral studies with incompatible methodological approaches. The authors' solution leverages large language models as consensus-building tools, constrained by expert-validated ontologies rather than allowed to generate unchecked interpretations.
The multi-LLM council approach represents a methodological shift in literature synthesis. By employing ten local language models that score evidence against predefined glossary terms, the pipeline creates auditable disagreement measurements—a transparency feature absent from conventional meta-analyses. The finding that agreement varies significantly between local and global oddball paradigms demonstrates the system's sensitivity to experimental context nuances that human reviewers might conflate.
The introduction of hypothesis-space temperature as a geometric dispersion metric extends beyond literature cataloging into quantitative mapping. Lower temperature in local contexts versus higher in global contexts suggests that experimental design fundamentally influences evidence clustering. This geometric framework transforms categorical agreement into continuous spatial relationships, enabling researchers to visualize research landscape topology.
For AI and computational neuroscience communities, this work validates LLM-assisted synthesis as a legitimate knowledge integration tool when properly constrained. The generalizability claim—that this framework could address synthesis problems across domains lacking common comparison spaces—suggests broader applications in meta-science infrastructure. Future adoption depends on whether domain experts consistently validate such systems' performance across diverse fields and whether regulatory or publication standards emerge around LLM-assisted evidence synthesis.
- →Multi-LLM councils produce quantifiable, auditable disagreement measurements that reveal structured patterns conventional meta-analysis misses.
- →Ontology-constrained prompting with expert validation prevents LLM hallucination while maintaining analytical flexibility.
- →Hypothesis-space temperature metrics enable geometric visualization of research dispersion across experimental contexts.
- →Evidence disagreement varies systematically between local and global oddball paradigms, suggesting methodological context fundamentally shapes findings.
- →This framework potentially generalizes to cross-disciplinary literature synthesis where traditional meta-analysis lacks unified comparison spaces.