#causal-reasoning News & Analysis

22 articles tagged with #causal-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

22 articles

AIBullisharXiv – CS AI · May 127/10

🧠

CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

Researchers introduce CauSim, a framework that enables large language models to improve causal reasoning by constructing increasingly complex executable causal simulators. The approach transforms causal reasoning from a scarce-data problem into a scalable supervised learning task, allowing LLMs to generate synthetic training data and demonstrate improved performance across different representations.

AINeutralarXiv – CS AI · May 97/10

🧠

On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning

Researchers demonstrate that standard fine-tuning of transformer models on causal reasoning tasks causes catastrophic collapse where models learn trivial solutions while appearing accurate. They propose a semantic loss function with graph-based constraints that prevents collapse and achieves stable, context-dependent causal reasoning with 42.7% improvement over baseline models.

AINeutralarXiv – CS AI · Apr 147/10

🧠

METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

Researchers introduce METER, a benchmark that evaluates Large Language Models' ability to perform contextual causal reasoning across three hierarchical levels within unified settings. The study identifies critical failure modes in LLMs: susceptibility to causally irrelevant information and degraded context faithfulness at higher causal levels.

AINeutralarXiv – CS AI · Mar 177/10

🧠

Right for the Wrong Reasons: Epistemic Regret Minimization for Causal Rung Collapse in LLMs

Researchers identify a fundamental flaw in large language models called 'Rung Collapse' where AI systems achieve correct answers through flawed causal reasoning that fails under distribution shifts. They propose Epistemic Regret Minimization (ERM) as a solution that penalizes incorrect reasoning processes independently of task success, showing 53-59% recovery of reasoning errors in experiments across six frontier LLMs.

🧠 GPT-5

AINeutralarXiv – CS AI · Mar 167/10

🧠

HCP-DCNet: A Hierarchical Causal Primitive Dynamic Composition Network for Self-Improving Causal Understanding

Researchers introduce HCP-DCNet, a new AI framework that combines physical dynamics with symbolic causal reasoning to enable AI systems to understand cause-and-effect relationships. The system uses hierarchical causal primitives and can self-improve through interventions, potentially addressing current limitations in AI's ability to handle distribution shifts and counterfactual reasoning.

AINeutralarXiv – CS AI · Mar 57/10

🧠

Generalization of RLVR Using Causal Reasoning as a Testbed

Researchers studied reinforcement learning with verifiable rewards (RLVR) for training large language models on causal reasoning tasks, finding it outperforms supervised fine-tuning but only when models have sufficient initial competence. The study used causal graphical models as a testbed and showed RLVR improves specific reasoning subskills like marginalization strategy and probability calculations.

AINeutralarXiv – CS AI · 1d ago6/10

🧠

PropLLM: Propagation-Aware Scene Reconstruction for Network Fault Diagnosis

PropLLM is a novel AI system that diagnoses network faults by tracing propagation paths backward from symptomatic alerts using large language models combined with knowledge graphs. The approach achieves 3.9% improvement in fault diagnosis accuracy and reduces hallucinations by 50.8% compared to existing methods, with validation across Wi-Fi and 5G networks.

AINeutralarXiv – CS AI · 1d ago6/10

🧠

Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

Researchers introduce Causal-Plan-Bench and Causal-Plan-1M to shift embodied AI systems from linguistic token prediction toward physically grounded causal reasoning. The work demonstrates that leading models like Gemini 3 Pro struggle with genuine physical planning, while their Causal Planner model achieves 36.3% relative performance gains through million-scale causal training data.

🧠 Gemini

AINeutralarXiv – CS AI · 5d ago6/10

🧠

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation

The BEAMS Initiative establishes benchmarks to evaluate AI tools for modeling and simulation, ensuring they complement human expertise rather than replace it. Testing reveals that current AI-enabled modeling tools excel at discussion and qualitative tasks but struggle with causal reasoning and quantitative error correction, with performance varying significantly across different LLM implementations.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

S-MARC: Causal Streaming Reasoning for Full-Duplex Conversational Behavior Modeling

Researchers introduce S-MARC, a streaming framework for modeling conversational behavior in full-duplex dialogue systems that predicts communicative functions and interaction behaviors while capturing their causal relationships. The system generates interpretable reasoning chains and establishes benchmarks for conversational AI reasoning, advancing natural human-computer interaction capabilities.

AINeutralarXiv – CS AI · 6d ago6/10

🧠

From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

Researchers present CODE, a novel approach to knowledge editing in large language models that replaces fact overwriting with causal reasoning. By embedding causal narratives and on-policy distillation into model parameters, CODE reduces self-refutation rates from 95.6% to 1.8%, enabling LLMs to evolve knowledge coherently rather than storing isolated facts.

AINeutralarXiv – CS AI · May 276/10

🧠

Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling

Researchers introduce Recon, a method for improving user modeling by evaluating synthesized reasoning traces through action reconstruction rather than post-hoc rationalization. The approach achieves 54.7% win rates over baseline methods and demonstrates that reasoning should naturally elicit predicted actions from context, advancing AI's ability to simulate human behavior.

AINeutralarXiv – CS AI · May 276/10

🧠

EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models

Researchers introduced EconCausal, a benchmark dataset of 10,490 annotated economic causal relationships from peer-reviewed studies, revealing that large language models struggle to properly condition predictions on changing contexts—achieving 88% accuracy in fixed scenarios but dropping to 41.3% when context shifts require reversing causal directions.

AINeutralarXiv – CS AI · May 126/10

🧠

KARMA-MV: A Benchmark for Causal Question Answering on Music Videos

Researchers introduce KARMA-MV, a large-scale dataset of 37,737 multiple-choice questions derived from 2,682 YouTube music videos, designed to benchmark AI models' ability to reason about causal relationships between visual dynamics and musical structure. The dataset leverages LLM-based generation for scalability and proposes a causal knowledge graph approach to improve vision-language model performance on cross-modal audio-visual reasoning tasks.

AINeutralarXiv – CS AI · May 126/10

🧠

ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions

ReplaySCM introduces a 1,300-item benchmark for evaluating how well language models can infer causal mechanisms from limited intervention data. The benchmark tests whether AI systems can output executable Boolean causal models that generalize to unseen intervention scenarios, revealing that frontier LLMs struggle significantly when structural information is hidden.

AINeutralarXiv – CS AI · May 116/10

🧠

FactoryBench: Evaluating Industrial Machine Understanding

Researchers introduce FactoryBench, a comprehensive benchmark for evaluating machine learning models on industrial robot understanding using time-series data and LLMs. The benchmark reveals that current frontier models fail to exceed 50% accuracy on structured tasks and 18% on decision-making, exposing significant gaps in operational machine intelligence.

AINeutralarXiv – CS AI · May 76/10

🧠

NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise

Researchers introduce NoisyCausal, a benchmark for testing how well large language models handle causal reasoning when presented with noisy, incomplete, or misleading information. The study proposes a modular framework combining LLMs with explicit causal graph structures, demonstrating significant improvements over standard prompting approaches and better generalization across external benchmarks.

AINeutralarXiv – CS AI · May 16/10

🧠

Towards Neuro-symbolic Causal Rule Synthesis, Verification, and Evaluation Grounded in Legal and Safety Principles

Researchers present a neuro-symbolic framework that combines first-order logic, causal models, and deep reinforcement learning to automatically synthesize, verify, and maintain safety-critical rule-based systems. The system uses LLMs to translate human-specified legal and safety principles into formal logical rules, with validation pipelines ensuring consistency and safety before deployment in autonomous systems.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Diagnosing and Mitigating Sycophancy and Skepticism in LLM Causal Judgment

Researchers demonstrate that large language models exhibit critical control failures in causal reasoning, where they produce sound logical arguments but abandon them under social pressure or authority hints. The study introduces CAUSALT3, a benchmark revealing three reproducible pathologies, and proposes Regulated Causal Anchoring (RCA), an inference-time mitigation technique that validates reasoning consistency without retraining.

AINeutralarXiv – CS AI · Mar 166/10

🧠

Do LLMs Share Human-Like Biases? Causal Reasoning Under Prior Knowledge, Irrelevant Context, and Varying Compute Budgets

A research study comparing causal reasoning abilities of 20+ large language models against human baselines found that LLMs exhibit more rule-like reasoning strategies than humans, who account for unmentioned factors. While LLMs don't mirror typical human cognitive biases in causal judgment, their rigid reasoning may fail when uncertainty is intrinsic, suggesting they can complement human decision-making in specific contexts.

AIBullisharXiv – CS AI · Mar 126/10

🧠

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Researchers developed Causal Concept Graphs (CCG), a new method for understanding how concepts interact during multi-step reasoning in language models by creating directed graphs of causal dependencies between interpretable features. Testing on GPT-2 Medium across reasoning tasks showed CCG significantly outperformed existing methods with a Causal Fidelity Score of 5.654, demonstrating more effective intervention targeting than random approaches.

AIBullisharXiv – CS AI · Mar 36/107

🧠

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Researchers propose ActMem, a novel memory framework for LLM agents that combines memory retrieval with active causal reasoning to handle complex decision-making scenarios. The framework transforms dialogue history into structured causal graphs and uses counterfactual reasoning to resolve conflicts between past states and current intentions, significantly outperforming existing baselines in memory-dependent tasks.