🧠 AI⚪ NeutralImportance 6/10

Predicting Causal Effects from Natural Language Queries using Structured Representations

arXiv – CS AI|Giuliano Martinelli, Piriyakorn Piriyatamwong, Abelardo Carlos Martinez Lorenzo, Jasmin Baier, Riccardo Orlando, Satvik Garg, Sharif Kazemi, Linxi Wang, Arianna Legovini, Samuel Fraiberger|May 29, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Query2Effect, a 72,000-question benchmark for predicting causal effect sizes from natural language queries using LLMs. A two-step framework combining structured representation generation with supervised encoding reduces prediction error by 27-71% compared to standard LLMs, demonstrating that separating semantic interpretation from numerical estimation improves both in-domain performance and out-of-domain generalization.

Analysis

This research addresses a fundamental challenge in applying machine learning to scientific inference: predicting causal effects without expensive randomized controlled trials. The Query2Effect benchmark represents a significant step toward automating literature synthesis and evidence evaluation, domains where researchers currently spend substantial time manually extracting and interpreting experimental results. By creating a large-scale dataset linking natural language questions to experimental outcomes, the authors enable systematic evaluation of how LLMs handle the nuanced task of causal estimation.

The two-step framework proves particularly insightful. Rather than asking LLMs to directly predict numerical effect sizes from text, the authors first generate structured intermediate representations, then use supervised models for the numerical estimation task. This decomposition yields substantial performance gains—reducing absolute error by up to 71%—and importantly, improves generalization to out-of-domain queries. This suggests that semantic understanding and quantitative prediction benefit from separation rather than end-to-end optimization.

The importance of finetuning relative to prompted baseline models highlights a critical gap in current LLM capabilities. While recent work emphasizes prompt engineering, this research demonstrates that domain-specific supervised training remains essential for scientific applications requiring numerical precision. The variability in query specificity—controlled across dimensions of implicitness, abstraction, and ambiguity—reflects real-world complexity that systems must handle.

For the broader AI ecosystem, this work strengthens the case for structured reasoning over pure language modeling. It suggests future research should focus on hybrid architectures that leverage LLMs for interpretation while employing specialized models for quantitative tasks. This approach may accelerate adoption in high-stakes domains like medicine and policy where prediction accuracy directly impacts decisions.

Key Takeaways

→Query2Effect benchmark with 72,000 natural language questions enables systematic evaluation of LLM-based causal effect prediction
→Two-step framework separating semantic interpretation from numerical estimation reduces prediction error by 27-71% versus prompted LLMs
→Finetuning on domain-specific tasks dramatically outperforms out-of-the-box LLM performance for scientific effect estimation
→Structured intermediate representations improve both in-domain accuracy and out-of-domain generalization capabilities
→Decomposing complex reasoning into semantic and quantitative subtasks outperforms end-to-end approaches for scientific inference

#large-language-models #causal-inference #scientific-ai #structured-reasoning #effect-estimation #benchmark #nlp #machine-learning

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Predicting Causal Effects from Natural Language Queries using Structured Representations

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge