AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce CauSim, a framework that enables large language models to improve causal reasoning by constructing increasingly complex executable causal simulators. The approach transforms causal reasoning from a scarce-data problem into a scalable supervised learning task, allowing LLMs to generate synthetic training data and demonstrate improved performance across different representations.
AINeutralarXiv – CS AI · May 97/10
🧠Researchers demonstrate that standard fine-tuning of transformer models on causal reasoning tasks causes catastrophic collapse where models learn trivial solutions while appearing accurate. They propose a semantic loss function with graph-based constraints that prevents collapse and achieves stable, context-dependent causal reasoning with 42.7% improvement over baseline models.
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers introduce METER, a benchmark that evaluates Large Language Models' ability to perform contextual causal reasoning across three hierarchical levels within unified settings. The study identifies critical failure modes in LLMs: susceptibility to causally irrelevant information and degraded context faithfulness at higher causal levels.
AINeutralarXiv – CS AI · Mar 177/10
🧠Researchers identify a fundamental flaw in large language models called 'Rung Collapse' where AI systems achieve correct answers through flawed causal reasoning that fails under distribution shifts. They propose Epistemic Regret Minimization (ERM) as a solution that penalizes incorrect reasoning processes independently of task success, showing 53-59% recovery of reasoning errors in experiments across six frontier LLMs.
🧠 GPT-5
AINeutralarXiv – CS AI · Mar 167/10
🧠Researchers introduce HCP-DCNet, a new AI framework that combines physical dynamics with symbolic causal reasoning to enable AI systems to understand cause-and-effect relationships. The system uses hierarchical causal primitives and can self-improve through interventions, potentially addressing current limitations in AI's ability to handle distribution shifts and counterfactual reasoning.
AINeutralarXiv – CS AI · Mar 57/10
🧠Researchers studied reinforcement learning with verifiable rewards (RLVR) for training large language models on causal reasoning tasks, finding it outperforms supervised fine-tuning but only when models have sufficient initial competence. The study used causal graphical models as a testbed and showed RLVR improves specific reasoning subskills like marginalization strategy and probability calculations.
AINeutralarXiv – CS AI · 5d ago6/10
🧠The BEAMS Initiative establishes benchmarks to evaluate AI tools for modeling and simulation, ensuring they complement human expertise rather than replace it. Testing reveals that current AI-enabled modeling tools excel at discussion and qualitative tasks but struggle with causal reasoning and quantitative error correction, with performance varying significantly across different LLM implementations.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers introduce S-MARC, a streaming framework for modeling conversational behavior in full-duplex dialogue systems that predicts communicative functions and interaction behaviors while capturing their causal relationships. The system generates interpretable reasoning chains and establishes benchmarks for conversational AI reasoning, advancing natural human-computer interaction capabilities.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers present CODE, a novel approach to knowledge editing in large language models that replaces fact overwriting with causal reasoning. By embedding causal narratives and on-policy distillation into model parameters, CODE reduces self-refutation rates from 95.6% to 1.8%, enabling LLMs to evolve knowledge coherently rather than storing isolated facts.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers introduce Recon, a method for improving user modeling by evaluating synthesized reasoning traces through action reconstruction rather than post-hoc rationalization. The approach achieves 54.7% win rates over baseline methods and demonstrates that reasoning should naturally elicit predicted actions from context, advancing AI's ability to simulate human behavior.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers introduced EconCausal, a benchmark dataset of 10,490 annotated economic causal relationships from peer-reviewed studies, revealing that large language models struggle to properly condition predictions on changing contexts—achieving 88% accuracy in fixed scenarios but dropping to 41.3% when context shifts require reversing causal directions.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce KARMA-MV, a large-scale dataset of 37,737 multiple-choice questions derived from 2,682 YouTube music videos, designed to benchmark AI models' ability to reason about causal relationships between visual dynamics and musical structure. The dataset leverages LLM-based generation for scalability and proposes a causal knowledge graph approach to improve vision-language model performance on cross-modal audio-visual reasoning tasks.
AINeutralarXiv – CS AI · May 126/10
🧠ReplaySCM introduces a 1,300-item benchmark for evaluating how well language models can infer causal mechanisms from limited intervention data. The benchmark tests whether AI systems can output executable Boolean causal models that generalize to unseen intervention scenarios, revealing that frontier LLMs struggle significantly when structural information is hidden.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce FactoryBench, a comprehensive benchmark for evaluating machine learning models on industrial robot understanding using time-series data and LLMs. The benchmark reveals that current frontier models fail to exceed 50% accuracy on structured tasks and 18% on decision-making, exposing significant gaps in operational machine intelligence.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers introduce NoisyCausal, a benchmark for testing how well large language models handle causal reasoning when presented with noisy, incomplete, or misleading information. The study proposes a modular framework combining LLMs with explicit causal graph structures, demonstrating significant improvements over standard prompting approaches and better generalization across external benchmarks.
AINeutralarXiv – CS AI · May 16/10
🧠Researchers present a neuro-symbolic framework that combines first-order logic, causal models, and deep reinforcement learning to automatically synthesize, verify, and maintain safety-critical rule-based systems. The system uses LLMs to translate human-specified legal and safety principles into formal logical rules, with validation pipelines ensuring consistency and safety before deployment in autonomous systems.
AINeutralarXiv – CS AI · Apr 106/10
🧠Researchers demonstrate that large language models exhibit critical control failures in causal reasoning, where they produce sound logical arguments but abandon them under social pressure or authority hints. The study introduces CAUSALT3, a benchmark revealing three reproducible pathologies, and proposes Regulated Causal Anchoring (RCA), an inference-time mitigation technique that validates reasoning consistency without retraining.
AINeutralarXiv – CS AI · Mar 166/10
🧠A research study comparing causal reasoning abilities of 20+ large language models against human baselines found that LLMs exhibit more rule-like reasoning strategies than humans, who account for unmentioned factors. While LLMs don't mirror typical human cognitive biases in causal judgment, their rigid reasoning may fail when uncertainty is intrinsic, suggesting they can complement human decision-making in specific contexts.
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers developed Causal Concept Graphs (CCG), a new method for understanding how concepts interact during multi-step reasoning in language models by creating directed graphs of causal dependencies between interpretable features. Testing on GPT-2 Medium across reasoning tasks showed CCG significantly outperformed existing methods with a Causal Fidelity Score of 5.654, demonstrating more effective intervention targeting than random approaches.
AIBullisharXiv – CS AI · Mar 36/107
🧠Researchers propose ActMem, a novel memory framework for LLM agents that combines memory retrieval with active causal reasoning to handle complex decision-making scenarios. The framework transforms dialogue history into structured causal graphs and uses counterfactual reasoning to resolve conflicts between past states and current intentions, significantly outperforming existing baselines in memory-dependent tasks.