#llm-reasoning News & Analysis

154 articles tagged with #llm-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

154 articles

AIBullisharXiv – CS AI · May 97/10

🧠

ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

ReFlect introduces a training-free harness system that wraps around LLMs to detect and recover from reasoning failures in complex, multi-step tasks. Testing across six models shows significant improvements in task success rates, with gains inversely correlated to baseline performance, though the approach reveals limitations in how smaller models handle structured reasoning.

🧠 GPT-4🧠 Claude🧠 Sonnet

AIBullisharXiv – CS AI · May 97/10

🧠

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Researchers introduce ScaleLogic, a synthetic reasoning framework that systematically studies how reinforcement learning improves LLM reasoning across varying task difficulty and logical complexity. The study reveals that RL training compute follows a power law with reasoning depth, with scaling efficiency improving when models train on more expressively complex logic, suggesting that training content quality matters as much as training volume.

AINeutralarXiv – CS AI · May 77/10

🧠

The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning

Researchers identify the 'Reasoning Trap,' a fundamental information-theoretic limitation where multi-agent language model debates preserve answer accuracy while degrading reasoning quality. The study introduces the Supported Faithfulness Score metric and Evidence-Grounded Socratic Reasoning framework, demonstrating that closed-system reasoning protocols following standard multi-agent debate structures inevitably lose information fidelity according to the Data Processing Inequality.

AINeutralarXiv – CS AI · May 77/10

🧠

Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction

Researchers introduce Oracle, a novel benchmark that evaluates LLM reasoning through black-box environment interaction, where models must deduce hidden functions by exploring unknown systems. Testing 19 models reveals that OpenAI's o3 leads in performance but struggles with complex tasks, exposing a universal weakness: LLMs lack strategic planning capabilities for efficient hypothesis refinement.

🏢 OpenAI

AINeutralarXiv – CS AI · May 17/10

🧠

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

Researchers systematically investigated whether Large Language Models can decouple fundamental reasoning patterns from specific problem instances by introducing reasoning conflicts between parametric knowledge and contextual instructions. The study reveals that LLMs prioritize task-appropriate reasoning over compliance with conflicting instructions, though mechanistic interventions at the activation level can steer models toward better instruction following by up to 29%.

AIBearisharXiv – CS AI · May 17/10

🧠

In-Context Examples Suppress Scientific Knowledge Recall in LLMs

Research shows that in-context examples in large language models suppress recall of scientific knowledge, causing models to shift from knowledge-driven reasoning to empirical pattern fitting even when examples are generated from the same formulas they should reinforce. This finding across 60 tasks and four models suggests practitioners deploying LLMs for scientific work should be cautious about using examples, as they may undermine rather than support domain expertise.

AIBearisharXiv – CS AI · Apr 207/10

🧠

The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination

Researchers demonstrate that enhancing LLM reasoning capabilities through reinforcement learning paradoxically increases tool hallucination—where models incorrectly invoke non-existent or inappropriate tools. The study reveals a fundamental trade-off where stronger reasoning correlates with higher hallucination rates, suggesting current AI agent development approaches may inherently compromise reliability for capability.

🏢 OpenAI

AINeutralarXiv – CS AI · Apr 157/10

🧠

Evaluating Relational Reasoning in LLMs with REL

Researchers introduce REL, a benchmark framework that evaluates relational reasoning in large language models by measuring Relational Complexity (RC)—the number of entities that must be simultaneously bound to apply a relation. The study reveals that frontier LLMs consistently degrade in performance as RC increases, exposing a fundamental limitation in higher-arity reasoning that persists even with increased compute and in-context learning.

AIBearisharXiv – CS AI · Apr 147/10

🧠

Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation

A new study reveals that large language models fail at counterfactual reasoning when policy findings contradict intuitive expectations, despite performing well on obvious cases. The research demonstrates that chain-of-thought prompting paradoxically worsens performance on counter-intuitive scenarios, suggesting current LLMs engage in 'slow talking' rather than genuine deliberative reasoning.

AIBullisharXiv – CS AI · Apr 147/10

🧠

FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning

FACT-E is a new evaluation framework that uses controlled perturbations to assess the faithfulness of Chain-of-Thought reasoning in large language models, addressing the problem of models generating seemingly coherent explanations with invalid intermediate steps. By measuring both internal chain consistency and answer alignment, FACT-E enables more reliable detection of flawed reasoning and selection of trustworthy reasoning trajectories for in-context learning.

AINeutralarXiv – CS AI · Apr 147/10

🧠

Can Large Language Models Infer Causal Relationships from Real-World Text?

Researchers developed the first real-world benchmark for evaluating whether large language models can infer causal relationships from complex academic texts. The study reveals that LLMs struggle significantly with this task, with the best models achieving only 0.535 F1 scores, highlighting a critical gap in AI reasoning capabilities needed for AGI advancement.

AINeutralarXiv – CS AI · Apr 147/10

🧠

When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

Researchers challenge the assumption that longer reasoning chains always improve LLM performance, discovering that extended test-time compute leads to diminishing returns and 'overthinking' where models abandon correct answers. The study demonstrates that optimal compute allocation varies by problem difficulty, enabling significant efficiency gains without sacrificing accuracy.

AINeutralarXiv – CS AI · Apr 147/10

🧠

Your Model Diversity, Not Method, Determines Reasoning Strategy

Researchers demonstrate that a large language model's diversity profile—how probability mass spreads across different solution approaches—should determine whether reasoning strategies prioritize breadth or depth exploration. Testing on Qwen and Olmo model families reveals that lightweight refinement signals work well for low-diversity aligned models but offer limited value for high-diversity base models, suggesting optimal inference strategies must be model-specific rather than universal.

AIBearisharXiv – CS AI · Apr 137/10

🧠

Robust Reasoning Benchmark

Researchers have developed a 14-technique perturbation pipeline to test the robustness of large language models' reasoning capabilities on mathematical problems. Testing reveals that while frontier models maintain resilience, open-weight models experience catastrophic accuracy collapses up to 55%, and all tested models degrade when solving sequential problems in a single context window, suggesting fundamental architectural limitations in current reasoning systems.

🧠 Claude🧠 Opus

AIBullisharXiv – CS AI · Mar 117/10

🧠

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Researchers have developed Zipage, a new high-concurrency inference engine for large language models that uses Compressed PagedAttention to solve memory bottlenecks. The system achieves 95% performance of full KV inference engines while delivering over 2.1x speedup on mathematical reasoning tasks.

AIBullisharXiv – CS AI · Mar 37/105

🧠

Expressive Power of Implicit Models: Rich Equilibria and Test-Time Scaling

Researchers provide mathematical proof that implicit models can achieve greater expressive power through increased test-time computation, explaining how these memory-efficient architectures can match larger explicit networks. The study validates this scaling property across image reconstruction, scientific computing, operations research, and LLM reasoning domains.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

Researchers identify 'cliff tokens'—specific points in LLM reasoning where a single token triggers failure in mathematical problem-solving. By deleting these tokens and resampling, models recover near-perfect accuracy, demonstrating that failures stem from precise decision points rather than diffuse errors. A taxonomy of cliff types enables targeted optimization that improves model reasoning by up to 6.6%.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Teaching LLMs String Matching, Backtracking, and Error Recovery to Deduce Bases and Truth Tables for the Combinatorially Exploding Bit Manipulation Puzzles

Researchers developed a novel approach to help Large Language Models solve bit manipulation puzzles by reframing the problem as string matching and base selection rather than arithmetic logic. Their method achieved 96% validation accuracy on the NVIDIA Nemotron Challenge, placing 7th overall by using backtracking search, error recovery mechanisms, and specialized tokenization to enable LLMs to deduce hidden logical rules from binary string transformations.

🏢 Nvidia

AINeutralarXiv – CS AI · Jun 236/10

🧠

Graph-Enhanced Large Language Models for Spatial Search

Researchers propose enhancing Large Language Models with graph-based spatial reasoning capabilities to address current limitations in understanding physical world questions. The work aims to enable search engines and LLMs to better answer complex spatial queries relevant to urban planning, engineering, and travel domains by integrating graph data structures.

AINeutralarXiv – CS AI · Jun 236/10

🧠

When Do Intrinsic Rewards Work for Code Reasoning? A Comprehensive Study

Researchers conducted a systematic empirical study of intrinsic reward methods for code generation using reinforcement learning, finding that certainty-based approaches achieve early gains but inevitably collapse as models progressively shorten outputs and lose reasoning capability. The study reveals that pre-training with intrinsic rewards offers no significant improvement over training from scratch, challenging the transferability of these methods from mathematical reasoning to code generation tasks.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Beyond Fixed Budgets: Characterizing the Inelasticity and Limitations of Tree-of-Thought Reasoning Strategies

Researchers evaluated two Tree-of-Thought (ToT) search strategies for improving LLM reasoning and found that both methods have fundamental limitations under different computational constraints. DPTS struggles with low-budget scenarios due to cold-start bottlenecks, while SSDP depletes its search frontier through aggressive pruning, suggesting adaptive strategies are necessary for effective reasoning across varying resource levels.

🧠 Llama

AIBullisharXiv – CS AI · Jun 196/10

🧠

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

Researchers introduce the Independent Combinatorial Tokens (ICT) framework to improve Large Language Model reasoning by addressing entropy collapse and explosion problems in reinforcement learning. Using Jensen-Shannon divergence to identify critical token branching points, ICT achieves 4.58% average improvement in pass@4 scores across math, commonsense, and Olympiad benchmarks on Qwen models.

AINeutralarXiv – CS AI · Jun 196/10

🧠

VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving

VERITAS introduces a zero-shot framework for formal theorem proving that leverages rich verifier feedback signals rather than binary pass/fail outcomes. Using a two-phase approach combining Best-of-N sampling with critic-guided Monte Carlo Tree Search, the system achieves 40.6% accuracy on miniF2F benchmarks and demonstrates particular strength in combinatorial problems where iterative lemma recovery is critical.

AIBullisharXiv – CS AI · Jun 116/10

🧠

Mind the Perspective: Let's Reason Recursively for Theory of Mind

Researchers introduce RecToM, a framework that improves Large Language Models' Theory of Mind reasoning by modeling nested beliefs through recursive perspective construction. The approach achieves state-of-the-art results on multiple benchmarks, including 100% accuracy on Hi-ToM, demonstrating significant advances in how AI systems infer agent beliefs and intentions.

🧠 GPT-5

AIBullisharXiv – CS AI · Jun 116/10

🧠

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

Researchers introduce MODF-SIR, a multi-agent framework using lightweight multimodal large language models enhanced with knowledge distillation for social intelligence reasoning. The system identifies long-tail events through explicit text formatting and integrates test-time adaptation with Chain-of-Thought prompting, achieving state-of-the-art results on multiple benchmarks with only 30% of standard training data.

🏢 Hugging Face

← PrevPage 3 of 7Next →