Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning
Researchers demonstrate that large language models fail to accurately predict gene expression changes in cellular perturbation experiments despite producing biologically plausible explanations. They introduce CORE, a contrastive learning method that significantly improves prediction accuracy by organizing evidence from related perturbations rather than evaluating them in isolation.
The research reveals a critical gap between plausibility and accuracy in LLM-based biological prediction systems. While these models generate explanations that sound scientifically reasonable, they systematically overestimate differential expression and often underperform simple baseline models, indicating they rely on general gene response patterns rather than understanding perturbation-specific mechanisms. This distinction matters because it exposes a fundamental limitation in how knowledge-driven AI systems process biological evidence.
The problem stems from methodology: existing approaches evaluate each perturbation-gene pair independently, preventing models from learning how similar perturbations produce different outcomes on the same gene. The CORE framework addresses this by framing prediction as a comparative task, using biomedical knowledge graphs to present both positive and negative examples from related experiments. Results demonstrate substantial improvements—up to 28.6% on drug-perturbation data and raising per-gene AUROC from chance to 0.703 across cell lines.
This research has implications for computational biology and AI development broadly. In drug discovery and precision medicine, accurate perturbation prediction could reduce costly experimental validation. The findings also highlight how prompt design and evidence organization fundamentally shape AI reasoning capabilities, extending beyond biology into other domains requiring causal inference from sparse data. The work suggests that future LLM applications in scientific domains require architectural changes prioritizing contrastive reasoning rather than isolated analysis.
- →LLMs produce biologically plausible but inaccurate perturbation predictions, conflating general gene response patterns with true mechanistic understanding.
- →CORE's contrastive evidence approach improves prediction accuracy by up to 28.6% by organizing evidence from related perturbations rather than evaluating pairs in isolation.
- →Current evaluation methods masked model failures because biologically plausible explanations don't guarantee predictive accuracy for unobserved conditions.
- →The research demonstrates that evidence organization architecture critically influences LLM reasoning quality in scientific prediction tasks.
- →Contrastive learning frameworks could enhance LLM performance across domains requiring causal inference from limited experimental data.