#llm-reasoning News & Analysis

154 articles tagged with #llm-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

154 articles

AIBullisharXiv – CS AI · Jun 27/10

🧠

KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning

Researchers introduce KACE, a novel context engineering method that improves large language models' mathematical reasoning by separating knowledge storage from usage through difficulty and domain-based organization. The approach achieves 62.2% accuracy on AIME 2025, significantly outperforming existing self-consistency methods while maintaining comparable computational efficiency.

AINeutralarXiv – CS AI · Jun 27/10

🧠

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

Researchers establish a theoretical framework explaining why large language models optimized through outcome-based reinforcement learning develop brittle reasoning despite strong benchmark performance. The study introduces 'Reward-Induced Manifold Collapse' and demonstrates that process reward models can prevent this failure mode by enforcing information constraints on reasoning steps.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Beyond One-shot: AI Agents for Learning in Field Experiments

Researchers demonstrated that tool-augmented AI agents can automatically learn from experimental data to design superior interventions, outperforming human-AI collaboration in a large-scale healthcare field study. The AI-generated messaging achieved 69.8% click-through rates, but results suggest domain-specific experimental data—not general reasoning ability—drives performance.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge

Researchers demonstrate that the 'reversal curse' — an autoregressive language model's inability to deduce inverse relationships from forward training data — can be mitigated through a simple data regularization technique called Identity Bridge. By adding self-referential training examples (e.g., 'Alice's name is Alice'), a 1B parameter model achieves 50% success on reversal tasks compared to near-zero baseline performance, suggesting LLMs can learn higher-level logical rules rather than merely memorizing facts.

AIBearisharXiv – CS AI · Jun 17/10

🧠

The Refutability Gap: Challenges in Validating Reasoning by Large Language Models

A new arXiv paper challenges recent claims about LLM capabilities by arguing they lack scientific rigor under Popper's falsifiability principle. The authors identify methodological flaws in AI reasoning research, including opaque training data, non-reproducibility, and selection bias, then propose transparency guidelines to improve scientific integrity in LLM evaluation.

🏢 Meta

AIBullisharXiv – CS AI · Jun 17/10

🧠

HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs

Researchers introduce Hermes, an AI agent that combines informal reasoning with formally verified mathematical proofs in Lean, achieving up to 40% accuracy improvements on difficult math benchmarks while reducing computational costs by 80%. The system addresses a fundamental limitation in LLM reasoning by interleaving exploratory problem-solving with rigorous formal verification.

AIBullisharXiv – CS AI · May 297/10

🧠

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

DeepTool is a new AI framework that enhances large language models' ability to reason through tool use by implementing process-supervised reinforcement learning. The system dramatically improves performance on mathematical benchmarks like AIME24 (3.2% to 40.4%) while maintaining token efficiency through interleaved thinking and action.

AINeutralarXiv – CS AI · May 297/10

🧠

Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs

Researchers extend the bounded attention prefix oracle (BAPO) model to establish theoretical lower bounds on chain-of-thought reasoning tokens required by LLMs, proving that canonical tasks require Ω(n) tokens as input size n grows. Experiments with frontier models confirm linear scaling behavior, revealing fundamental computational bottlenecks in inference-time scaling.

AINeutralarXiv – CS AI · May 297/10

🧠

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Researchers introduce BeliefTrack, a benchmark for evaluating how large language models manage contextual information over long interactions—deciding when to update beliefs, preserve state, or ignore noise. The study reveals vanilla LLMs fail significantly at this task, while reinforcement learning with belief-state rewards reduces failures by 71% on average.

AINeutralarXiv – CS AI · May 287/10

🧠

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

Researchers challenge the GSM-Symbolic benchmark's conclusions about LLM reasoning capabilities, finding that statistical rigor reveals only half of tested models show significant performance degradation. The analysis uncovers a previously unacknowledged distributional shift in problem integers and identifies distinct, model-specific failure patterns rather than universal reasoning deficits.

AIBullisharXiv – CS AI · May 277/10

🧠

GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought

Researchers introduce GeoFaith, a framework for detecting and improving faithfulness in chain-of-thought reasoning by LLMs, addressing the problem of plausible-sounding but inaccurate explanations. The method combines geometric latent structures with entropy analysis and includes a reinforcement learning approach that achieves superior performance on faithfulness detection while maintaining accuracy.

🧠 GPT-5

AIBullisharXiv – CS AI · May 277/10

🧠

GraphMind: From Operational Traces to Self-Evolving Workflow Automation

GraphMind is an AI system that automates complex operational workflows by extracting structured action graphs from human resolution traces and using multi-agent reasoning to execute and adapt them. Deployed across cloud database services, it demonstrates significant improvements in incident mitigation with reduced hallucinations and demonstrates how operational AI systems can learn and improve from execution feedback.

AINeutralarXiv – CS AI · May 127/10

🧠

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

Researchers introduced MathConstraint, an adaptive benchmark for testing large language models' combinatorial reasoning abilities using constraint satisfaction problems with automated verification. The benchmark reveals significant performance gaps between frontier models, with accuracy dropping from 72-87% on easier instances to 18-66% on harder ones, while tool access via Python solvers roughly doubles performance.

🧠 GPT-5

AINeutralarXiv – CS AI · May 127/10

🧠

Sanity Checks for Long-Form Hallucination Detection

Researchers introduce a controlled-invariance methodology to distinguish whether hallucination detection in large language models actually evaluates reasoning quality or merely exploits surface-level answer cues. Their lightweight TRACT model demonstrates that effective detection relies primarily on lexical trajectory features rather than complex learned representations, suggesting current detection methods conflate endpoint artifacts with genuine reasoning validation.

AINeutralarXiv – CS AI · May 127/10

🧠

Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal

Researchers discovered that large language models internally detect their own reasoning errors with 95% accuracy but verbally express unwarranted confidence in flawed outputs. Despite this hidden awareness, four intervention strategies failed to correct the errors, indicating the signal reflects computation quality rather than a mechanism that can be leveraged for improvement.

🧠 Llama

AIBullisharXiv – CS AI · May 127/10

🧠

expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling

Researchers introduce EXPO, an improved reinforcement learning algorithm for LLM mathematical reasoning that dynamically adjusts KL penalty coefficients and prioritizes moderately difficult problems during training. The method demonstrates significant performance improvements over existing GRPO approaches, achieving a 13.34-point absolute gain on AIME 2025 benchmarks.

AIBearisharXiv – CS AI · May 117/10

🧠

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

A new empirical study evaluates how Large Language Models perform on the Equivalence Class Problem, a simple yet computationally demanding long-chain reasoning task. The research reveals that non-reasoning LLMs fail entirely at the task, while reasoning-capable models perform significantly better but still struggle with complete accuracy, with performance patterns differing based on problem complexity metrics.

AINeutralarXiv – CS AI · May 117/10

🧠

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

Researchers developed a method to extract and analyze search trees from LLM reasoning traces, revealing that large language models use shallower, more myopic planning strategies compared to humans. While LLMs generate extended chain-of-thought reasoning, their actual decision-making is driven primarily by shallow search rather than deep lookahead, contrasting sharply with human expert planning.

AINeutralarXiv – CS AI · May 117/10

🧠

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

Researchers introduce GSM-SEM, a framework for generating semantically diverse variants of math benchmarks like GSM8K to combat memorization in LLM evaluations. Testing 14 state-of-the-art models reveals consistent performance drops averaging 28%, suggesting current leaderboard rankings may overstate true reasoning capabilities.

AIBullisharXiv – CS AI · May 117/10

🧠

Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators

Researchers propose CIKA, a framework using LLMs as interventional simulators to identify which mathematical concepts causally contribute to correct answers, distinguishing genuine causal relationships from spurious correlations. The method achieves 69.7% on Omni-MATH-Rule and 97.2% on GSM8K with a frozen 7B model, outperforming o1-mini on contamination-free benchmarks.

AIBullisharXiv – CS AI · May 117/10

🧠

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

Researchers introduce CASPO, a framework that improves reasoning reliability in large language models by aligning token-level confidence with step-wise logical correctness through preference optimization. The method achieves better performance than tree-search approaches without requiring separate reward models, while introducing CaT inference that dynamically prunes uncertain reasoning branches with minimal computational overhead.

AIBullisharXiv – CS AI · May 117/10

🧠

MAVEN: Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing

Researchers introduce MAVEN, a multi-agent framework that enhances large language model reasoning through explicit role-separation and intermediate verification steps. The system outperforms existing approaches on multiple benchmarks by creating verifiable, modular deliberation trajectories rather than relying on implicit reasoning or post-hoc consensus mechanisms.

AIBullisharXiv – CS AI · May 97/10

🧠

Logic-Regularized Verifier Elicits Reasoning from LLMs

Researchers introduce LOVER, an unsupervised verifier that uses logical constraints to improve LLM reasoning without requiring expensive labeled datasets. The method achieves performance comparable to supervised approaches by enforcing logical consistency rules across multiple reasoning paths.

AIBullisharXiv – CS AI · May 97/10

🧠

Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning

Researchers introduce VeriTime, a framework that enhances large language models for time series analysis through synthetic data generation, intelligent data scheduling, and specialized reinforcement learning. The approach enables smaller models (3B-4B parameters) to match or exceed the reasoning capabilities of larger proprietary LLMs on time series tasks.

AIBullisharXiv – CS AI · May 97/10

🧠

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Researchers introduce ScaleLogic, a synthetic reasoning framework that systematically studies how reinforcement learning improves LLM reasoning across varying task difficulty and logical complexity. The study reveals that RL training compute follows a power law with reasoning depth, with scaling efficiency improving when models train on more expressively complex logic, suggesting that training content quality matters as much as training volume.

← PrevPage 2 of 7Next →