y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-reasoning News & Analysis

113 articles tagged with #llm-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

113 articles
AINeutralarXiv – CS AI · 3d ago7/10
🧠

R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

Researchers introduce R-APS (Reflective Adversarial Pareto Search), a novel method that enhances large language model reasoning for constrained design tasks by decomposing reasoning modes into separate contexts and orchestrating them across multiple timescales. The approach delivers 3.5x tighter robustness guarantees and 46% faster convergence on mechanical design problems without requiring model fine-tuning.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

Inducing Reasoning Primitives from Agent Traces

Researchers introduce Reasoning Primitive Induction, a method that extracts reusable reasoning patterns from ReAct-style LLM agent traces and converts them into a compact library of pseudo-tools. The induced libraries consistently outperform the original agents by 22-44 percentage points across multiple reasoning tasks, suggesting a systematic path to improve LLM reasoning through learned decomposition.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion

Researchers introduce eMoT (evolving Memory-of-Thought), a framework that enhances LLM reasoning by treating reasoning processes as dynamic, evolving memories rather than static sequences. The system combines memory corrosion mechanisms, symbolic anchoring for deterministic computation, and consistency refinement to reduce hallucinations and improve multi-step reasoning accuracy, achieving 100% on Game of 24 and significant gains on mathematical benchmarks.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

Language-Native Materials Processing Design by Lightly Structured Text Database and Reasoning Large Language Model

Researchers have developed an AI framework that transforms materials synthesis procedures from unstructured narrative text into actionable, computable knowledge using large language models and structured databases. The system successfully optimized boron nitride nanosheet synthesis in three iterations, demonstrating AI's potential to accelerate complex materials discovery beyond traditional trial-and-error approaches.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning

Researchers introduce KACE, a novel context engineering method that improves large language models' mathematical reasoning by separating knowledge storage from usage through difficulty and domain-based organization. The approach achieves 62.2% accuracy on AIME 2025, significantly outperforming existing self-consistency methods while maintaining comparable computational efficiency.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

Researchers propose POPO (Group Prioritized Off-Policy Optimization), a new framework that improves reinforcement learning for large language model reasoning by efficiently reusing ineffective training samples without computational overhead. The method addresses a critical limitation in RLVR systems where many training samples yield zero-variance rewards, enabling faster model improvement across mathematics, planning, and visual reasoning tasks.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

Beyond One-shot: AI Agents for Learning in Field Experiments

Researchers demonstrated that tool-augmented AI agents can automatically learn from experimental data to design superior interventions, outperforming human-AI collaboration in a large-scale healthcare field study. The AI-generated messaging achieved 69.8% click-through rates, but results suggest domain-specific experimental data—not general reasoning ability—drives performance.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge

Researchers demonstrate that the 'reversal curse' — an autoregressive language model's inability to deduce inverse relationships from forward training data — can be mitigated through a simple data regularization technique called Identity Bridge. By adding self-referential training examples (e.g., 'Alice's name is Alice'), a 1B parameter model achieves 50% success on reversal tasks compared to near-zero baseline performance, suggesting LLMs can learn higher-level logical rules rather than merely memorizing facts.

AINeutralarXiv – CS AI · 5d ago7/10
🧠

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

Researchers establish a theoretical framework explaining why large language models optimized through outcome-based reinforcement learning develop brittle reasoning despite strong benchmark performance. The study introduces 'Reward-Induced Manifold Collapse' and demonstrates that process reward models can prevent this failure mode by enforcing information constraints on reasoning steps.

AIBullisharXiv – CS AI · 6d ago7/10
🧠

HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs

Researchers introduce Hermes, an AI agent that combines informal reasoning with formally verified mathematical proofs in Lean, achieving up to 40% accuracy improvements on difficult math benchmarks while reducing computational costs by 80%. The system addresses a fundamental limitation in LLM reasoning by interleaving exploratory problem-solving with rigorous formal verification.

AIBearisharXiv – CS AI · 6d ago7/10
🧠

The Refutability Gap: Challenges in Validating Reasoning by Large Language Models

A new arXiv paper challenges recent claims about LLM capabilities by arguing they lack scientific rigor under Popper's falsifiability principle. The authors identify methodological flaws in AI reasoning research, including opaque training data, non-reproducibility, and selection bias, then propose transparency guidelines to improve scientific integrity in LLM evaluation.

🏢 Meta
AINeutralarXiv – CS AI · May 297/10
🧠

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Researchers introduce BeliefTrack, a benchmark for evaluating how large language models manage contextual information over long interactions—deciding when to update beliefs, preserve state, or ignore noise. The study reveals vanilla LLMs fail significantly at this task, while reinforcement learning with belief-state rewards reduces failures by 71% on average.

AINeutralarXiv – CS AI · May 297/10
🧠

Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs

Researchers extend the bounded attention prefix oracle (BAPO) model to establish theoretical lower bounds on chain-of-thought reasoning tokens required by LLMs, proving that canonical tasks require Ω(n) tokens as input size n grows. Experiments with frontier models confirm linear scaling behavior, revealing fundamental computational bottlenecks in inference-time scaling.

AINeutralarXiv – CS AI · May 287/10
🧠

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

Researchers challenge the GSM-Symbolic benchmark's conclusions about LLM reasoning capabilities, finding that statistical rigor reveals only half of tested models show significant performance degradation. The analysis uncovers a previously unacknowledged distributional shift in problem integers and identifies distinct, model-specific failure patterns rather than universal reasoning deficits.

AIBullisharXiv – CS AI · May 277/10
🧠

GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought

Researchers introduce GeoFaith, a framework for detecting and improving faithfulness in chain-of-thought reasoning by LLMs, addressing the problem of plausible-sounding but inaccurate explanations. The method combines geometric latent structures with entropy analysis and includes a reinforcement learning approach that achieves superior performance on faithfulness detection while maintaining accuracy.

🧠 GPT-5
AIBullisharXiv – CS AI · May 277/10
🧠

GraphMind: From Operational Traces to Self-Evolving Workflow Automation

GraphMind is an AI system that automates complex operational workflows by extracting structured action graphs from human resolution traces and using multi-agent reasoning to execute and adapt them. Deployed across cloud database services, it demonstrates significant improvements in incident mitigation with reduced hallucinations and demonstrates how operational AI systems can learn and improve from execution feedback.

AINeutralarXiv – CS AI · May 127/10
🧠

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

Researchers introduced MathConstraint, an adaptive benchmark for testing large language models' combinatorial reasoning abilities using constraint satisfaction problems with automated verification. The benchmark reveals significant performance gaps between frontier models, with accuracy dropping from 72-87% on easier instances to 18-66% on harder ones, while tool access via Python solvers roughly doubles performance.

🧠 GPT-5
AIBullisharXiv – CS AI · May 127/10
🧠

expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling

Researchers introduce EXPO, an improved reinforcement learning algorithm for LLM mathematical reasoning that dynamically adjusts KL penalty coefficients and prioritizes moderately difficult problems during training. The method demonstrates significant performance improvements over existing GRPO approaches, achieving a 13.34-point absolute gain on AIME 2025 benchmarks.

AINeutralarXiv – CS AI · May 127/10
🧠

Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal

Researchers discovered that large language models internally detect their own reasoning errors with 95% accuracy but verbally express unwarranted confidence in flawed outputs. Despite this hidden awareness, four intervention strategies failed to correct the errors, indicating the signal reflects computation quality rather than a mechanism that can be leveraged for improvement.

🧠 Llama
AINeutralarXiv – CS AI · May 127/10
🧠

Sanity Checks for Long-Form Hallucination Detection

Researchers introduce a controlled-invariance methodology to distinguish whether hallucination detection in large language models actually evaluates reasoning quality or merely exploits surface-level answer cues. Their lightweight TRACT model demonstrates that effective detection relies primarily on lexical trajectory features rather than complex learned representations, suggesting current detection methods conflate endpoint artifacts with genuine reasoning validation.

AIBullisharXiv – CS AI · May 117/10
🧠

Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators

Researchers propose CIKA, a framework using LLMs as interventional simulators to identify which mathematical concepts causally contribute to correct answers, distinguishing genuine causal relationships from spurious correlations. The method achieves 69.7% on Omni-MATH-Rule and 97.2% on GSM8K with a frozen 7B model, outperforming o1-mini on contamination-free benchmarks.

AIBullisharXiv – CS AI · May 117/10
🧠

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

Researchers introduce CASPO, a framework that improves reasoning reliability in large language models by aligning token-level confidence with step-wise logical correctness through preference optimization. The method achieves better performance than tree-search approaches without requiring separate reward models, while introducing CaT inference that dynamically prunes uncertain reasoning branches with minimal computational overhead.

AIBullisharXiv – CS AI · May 117/10
🧠

MAVEN: Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing

Researchers introduce MAVEN, a multi-agent framework that enhances large language model reasoning through explicit role-separation and intermediate verification steps. The system outperforms existing approaches on multiple benchmarks by creating verifiable, modular deliberation trajectories rather than relying on implicit reasoning or post-hoc consensus mechanisms.

AINeutralarXiv – CS AI · May 117/10
🧠

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

Researchers introduce GSM-SEM, a framework for generating semantically diverse variants of math benchmarks like GSM8K to combat memorization in LLM evaluations. Testing 14 state-of-the-art models reveals consistent performance drops averaging 28%, suggesting current leaderboard rankings may overstate true reasoning capabilities.

Page 1 of 5Next →