#llm-reasoning News & Analysis

154 articles tagged with #llm-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

154 articles

AIBearisharXiv – CS AI · Jun 257/10

🧠

Failure Modes of Large Language Models on Research-Level Mathematics: A Taxonomy and an Empirical Characterisation

Researchers identify four specific failure modes in large language models attempting research-level mathematics: citation fabrication, premise smuggling, silent problem reformulation, and local-to-global compatibility gaps. Testing reveals that premise smuggling—where models assert unjustified claims as fundamental results—persists even when citations are accurate, suggesting retrieval-augmented generation alone cannot solve LLM reasoning failures.

🧠 Gemini

AIBullisharXiv – CS AI · Jun 237/10

🧠

AgentDSE: Reasoning-Augmented Architectural Design Space Exploration

AgentDSE introduces an LLM-driven methodology that automates architectural design space exploration by reasoning through physical constraints and performance dynamics, achieving competitive results with up to 100x fewer simulator evaluations than traditional methods. The approach eliminates the need for fine-tuning, precomputed databases, or domain-specific optimizers while producing interpretable decision traces.

AIBullisharXiv – CS AI · Jun 237/10

🧠

VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows

Researchers introduce VADAOrchestra, a neurosymbolic framework that combines Large Language Model-based orchestration with symbolic logic programming to execute complex, adaptive workflows. The system addresses key limitations of both traditional business process management and pure LLM-based agents by providing verifiable reasoning traces, improved scalability, and explainability while maintaining runtime adaptability.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Arguments that Alter Minds: LLM Rationales Sway Human (and LLM) Notions of Plausibility

Researchers found that LLM-generated arguments significantly influence both human and AI plausibility judgments on commonsense reasoning tasks, with supportive rationales increasing confidence and opposing ones decreasing it. This reveals both a novel tool for studying human cognition and a concerning vulnerability: AI systems can persuade people to doubt their own common sense reasoning.

AIBullisharXiv – CS AI · Jun 237/10

🧠

PEAR: Permutation-Equivariant Adaptive Routing Multi-Agent Debate

Researchers introduce PEAR, a new multi-agent debate protocol for large language models that dynamically reassigns agent roles across debate rounds to eliminate positional biases. By using permutation-equivariant routing, PEAR improves reasoning accuracy across multiple benchmarks while reducing the sensitivity of LLM outputs to arbitrary role assignments.

AIBullisharXiv – CS AI · Jun 237/10

🧠

SPARC: A Multi-Agent System for Electrical Circuit Question Answering

Researchers introduce SPARC, a multi-agent AI system that answers electrical circuit diagram questions by grounding reasoning in executable physics simulations rather than relying solely on language models. The system achieves 83% accuracy with up to 58% improvement over existing baselines, demonstrating how hybrid AI approaches combining LLMs with domain-specific simulation tools can enhance reasoning reliability.

AIBullisharXiv – CS AI · Jun 237/10

🧠

ARIA: A Causal-Aware Framework for Rescuing LLM Reasoning in Trustworthy Materials Discovery

Researchers introduce ARIA, a causal-aware framework that improves how Large Language Models reason about materials discovery by addressing 'contextual tunneling'—a bias where models over-rely on narrow retrieved evidence. ARIA uses a three-tier approach combining direct causal reasoning, physics-informed analogies, and parametric fallbacks, validated on a knowledge graph of 2,839 materials relations, enabling more trustworthy and auditable AI-assisted scientific discovery.

AINeutralarXiv – CS AI · Jun 237/10

🧠

In LLM Reasoning, there is Irrationality on top of Value Misalignment

Researchers identify 'rational value risk' in large language models, showing that even well-aligned LLMs fail to consistently maximize their intended values during reasoning tasks. The study across major models (Llama, GPT, DeepSeek) reveals that value alignment training alone cannot eliminate this reasoning gap, with performance highly dependent on inference-time strategies.

🧠 GPT-5🧠 Llama

AIBearisharXiv – CS AI · Jun 237/10

🧠

HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs

Researchers introduce HOLMES, a new benchmark for evaluating higher-order logical reasoning in large language models, revealing that current LLMs struggle significantly with complex symbolic reasoning tasks that go beyond simple first-order logic. The benchmark demonstrates critical gaps in AI reliability, with the best-performing models achieving only 59.54% accuracy on tasks involving reasoning over rules, predicates, and constraints across legal and financial domains.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm

Researchers introduce Explore-Execute Chain (E²C), a structured reasoning framework that separates LLM planning from execution into distinct computational phases. The approach achieves 53.3% accuracy on AIME 2024 benchmarks with significantly fewer tokens than existing methods, while enabling efficient domain adaptation through exploration-focused fine-tuning.

AIBullisharXiv – CS AI · Jun 197/10

🧠

Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference

Researchers propose MACR, a novel framework that resolves conflicts between large language models' internal knowledge and external context information using multi-agent reasoning. The approach moves beyond binary choice paradigms to actively reconcile inconsistencies, demonstrating significant performance improvements over existing methods while providing interpretable conflict resolution.

AIBullisharXiv – CS AI · Jun 117/10

🧠

GPO: Learning from Critical Steps to Improve LLM Reasoning

Researchers introduce GPO (Guided Pivotal Optimization), a novel fine-tuning strategy that improves LLM reasoning by identifying and learning from critical steps within reasoning trajectories rather than treating them as whole processes. The method uses advantage function estimation to locate pivotal moments and prioritizes learning on those segments, demonstrating consistent performance improvements across reasoning benchmarks.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

Researchers introduce Dep-LLM, a training-free framework that diagnoses depression from clinical interviews by decomposing dialogue into structured themes and using large language models without fine-tuning. The system outperforms supervised approaches and commercial LLMs while requiring no additional training, addressing critical gaps in mental health AI deployment.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

Researchers introduce Collaboration Policy Tree (Co-pi-tree), a method that distills large language model reasoning into interpretable, executable policy trees for human-AI collaboration. The approach achieves 35% performance improvement while reducing LLM queries by 78% and latency by 97%, addressing key limitations of black-box reinforcement learning and costly real-time LLM querying.

AIBullisharXiv – CS AI · Jun 97/10

🧠

MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs

Researchers present Multi-Agent Reflexion (MAR), a technique that improves LLM reasoning by using multiple AI agents with distinct personas to debate and generate diverse reflections rather than having a single model reflect on itself. The approach achieves 47% accuracy on HotPotQA and 82.7% on HumanEval, outperforming traditional single-agent reflection methods that suffer from repetitive error patterns.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Sound and Complete Neurosymbolic Reasoning with LLM-Grounded Interpretations

Researchers present a neurosymbolic reasoning method that integrates large language models into formal logic systems using paraconsistent logic, enabling sound and complete reasoning while leveraging LLM knowledge. The approach improves factuality evaluation by 6 percentage points and successfully identifies logical contradictions in medical knowledge bases without causing logical explosion.

AINeutralarXiv – CS AI · Jun 87/10

🧠

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

Researchers conducted an empirical comparison of mathematical reasoning between humans and DeepSeek-R1, analyzing 10,247 reasoning steps across 30 AIME problems. The study reveals that while the AI model exhibits surface-level reasoning patterns, it engages in inefficient verification loops and lacks the structured deduction humans employ, suggesting current long-chain-of-thought models may be optimized for appearing to reason rather than reasoning effectively.

AIBearisharXiv – CS AI · Jun 87/10

🧠

How reliable are LLMs when it comes to playing dice?

A comprehensive study of 8 state-of-the-art language models reveals significant limitations in probabilistic reasoning, with accuracy dropping from 96% on standard problems to 59% on counterintuitive ones. The research demonstrates that LLMs are vulnerable to token bias and prompt manipulation, suggesting they lack genuine probability reasoning despite excelling at other mathematical tasks.

AIBullisharXiv – CS AI · Jun 57/10

🧠

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

SUPERNOVA introduces a framework for extending reinforcement learning with verifiable rewards (RLVR) beyond STEM fields by systematically curating data from natural instruction datasets. A 25K-instance dataset trained on smaller models achieves 64.4 percentage point gains on complex reasoning benchmarks, with improvements generalizing across model scales and families.

AINeutralarXiv – CS AI · Jun 57/10

🧠

The End of Software Engineering: How AI Agents Are Fundamentally Restructuring the Software Paradigm

A research paper argues that AI agents powered by large language models represent a fundamental paradigm shift in software development, moving beyond traditional static code toward dynamic, self-modifying systems. The analysis traces this evolution through licensing, SaaS, and proposes Agent-as-a-Service (AaaS) as the next frontier, supported by recent benchmarks demonstrating both transformative potential and current limitations.

AINeutralarXiv – CS AI · Jun 47/10

🧠

R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

Researchers introduce R-APS (Reflective Adversarial Pareto Search), a novel method that enhances large language model reasoning for constrained design tasks by decomposing reasoning modes into separate contexts and orchestrating them across multiple timescales. The approach delivers 3.5x tighter robustness guarantees and 46% faster convergence on mechanical design problems without requiring model fine-tuning.

AIBullisharXiv – CS AI · Jun 37/10

🧠

Inducing Reasoning Primitives from Agent Traces

Researchers introduce Reasoning Primitive Induction, a method that extracts reusable reasoning patterns from ReAct-style LLM agent traces and converts them into a compact library of pseudo-tools. The induced libraries consistently outperform the original agents by 22-44 percentage points across multiple reasoning tasks, suggesting a systematic path to improve LLM reasoning through learned decomposition.

AIBullisharXiv – CS AI · Jun 27/10

🧠

RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

Researchers propose POPO (Group Prioritized Off-Policy Optimization), a new framework that improves reinforcement learning for large language model reasoning by efficiently reusing ineffective training samples without computational overhead. The method addresses a critical limitation in RLVR systems where many training samples yield zero-variance rewards, enabling faster model improvement across mathematics, planning, and visual reasoning tasks.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Beyond One-shot: AI Agents for Learning in Field Experiments

Researchers demonstrated that tool-augmented AI agents can automatically learn from experimental data to design superior interventions, outperforming human-AI collaboration in a large-scale healthcare field study. The AI-generated messaging achieved 69.8% click-through rates, but results suggest domain-specific experimental data—not general reasoning ability—drives performance.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge

Researchers demonstrate that the 'reversal curse' — an autoregressive language model's inability to deduce inverse relationships from forward training data — can be mitigated through a simple data regularization technique called Identity Bridge. By adding self-referential training examples (e.g., 'Alice's name is Alice'), a 1B parameter model achieves 50% success on reversal tasks compared to near-zero baseline performance, suggesting LLMs can learn higher-level logical rules rather than merely memorizing facts.

Page 1 of 7Next →