#llm-reasoning News & Analysis

154 articles tagged with #llm-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

154 articles

AINeutralarXiv – CS AI · Jun 16/10

🧠

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories

Researchers demonstrate that Large Language Models improve their reasoning performance when search histories are explicitly structured with parent pointers (LinTree), rather than implicitly represented. The finding suggests that LLMs benefit from tree-aware representations during problem-solving, outperforming both implicit trace-based reasoning and traditional heuristic-guided search across multiple domains.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate

Researchers demonstrate that large language models engaged in multi-agent debate can achieve superior truth-seeking performance by leveraging collective reasoning dynamics similar to human argumentative discourse. The study provides empirical evidence that distributed epistemic reasoning outperforms individual model performance and proposes a novel benchmarking methodology to measure intrinsic model properties like hallucination propensity.

AIBullisharXiv – CS AI · Jun 16/10

🧠

Symbolic Intermediaries as a Linguistic-Numerical Interface for LLM-Driven Geometric Reasoning

Researchers propose symbolic intermediaries—compact mathematical expressions derived from symbolic regression—to bridge the gap between Large Language Models and physics simulators by converting continuous numerical outputs into interpretable symbolic forms. LLM-based agents using this interface outperformed genetic algorithms by 19-53% on mechanism synthesis tasks, demonstrating that translating simulator behavior into symbolic language enables grounded geometric reasoning without model retraining.

AINeutralarXiv – CS AI · Jun 16/10

🧠

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

SEMA-RAG introduces a multi-agent framework that decouples medical reasoning tasks into three specialized agents to improve retrieval-augmented generation for clinical question answering. The approach achieves 6.46 percentage point accuracy improvements over existing baselines by addressing hallucinations and knowledge obsolescence through iterative, evidence-driven retrieval rather than single-round static lookups.

AINeutralarXiv – CS AI · May 296/10

🧠

ReasonOps: Operator Segmentation for LLM Reasoning Traces

Researchers introduced ReasonOps, an unsupervised method for analyzing chain-of-thought traces from large language models that identifies seven universal reasoning operators (backtracking, inferring, hypothesizing, etc.) appearing consistently across 12 different LLM families. The framework enables model identification, correctness prediction, and early quality estimation without manual annotation, revealing that each model family has a distinctive reasoning fingerprint.

AIBullisharXiv – CS AI · May 296/10

🧠

OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation

OptSkills, a new AI system, advances automated optimization problem-solving by clustering problems by underlying mathematical archetypes rather than surface narratives, achieving 68.27% accuracy on diverse benchmarks and outperforming DeepSeek-V3.2-Thinking on large-scale problems. The system uses skill distillation and trajectory learning to improve generalization across both known and novel problem types.

AIBullisharXiv – CS AI · May 296/10

🧠

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

Researchers introduce RePoT (Recoverable Program-of-Thought), an enhanced AI reasoning method that fixes failed code generation by replaying execution to identify the first error point, then using a single LLM call to recover rather than restarting. The technique improves accuracy by 3-11 percentage points across multiple models and benchmarks, with particularly strong gains on smaller models like GPT-4 mini.

🧠 GPT-5🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · May 286/10

🧠

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

Researchers propose a taxonomy of chain-of-thought (CoT) reasoning in LLM post-training, distinguishing between explicit, composed, and implicit reasoning formats. The study reveals that compressed reasoning data requires different training approaches, with composed CoT benefiting from data scaling while implicit CoT risks memorization, and that reinforcement learning can decompose compressed steps learned during supervised fine-tuning.

AINeutralarXiv – CS AI · May 286/10

🧠

Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

Researchers present a systematic evaluation of large language models' reasoning capabilities on Boolean satisfiability problems, introducing a paired-formula protocol with Accurate Differentiation Rate (ADR) metric that reveals conventional accuracy metrics can be misleading, as models often succeed through heuristics rather than genuine reasoning.

AINeutralarXiv – CS AI · May 286/10

🧠

Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict

Researchers found that large language models' chain-of-thought reasoning remains remarkably consistent even when reaching opposite conclusions about conflicting information, suggesting CoT explanations don't faithfully reflect the underlying decision mechanism. While model confidence shows weak but genuine predictive signal for decisions, internal reasoning tokens proved more decision-sensitive than user-facing explanations, indicating models may not transparently report how they actually choose between document claims and training knowledge.

🧠 GPT-4🧠 Claude🧠 Sonnet

AINeutralarXiv – CS AI · May 286/10

🧠

Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning

Researchers examine how Large Language Models use anthropomorphic reflection markers like 'wait' and 'hmm' during reasoning tasks. The study finds these markers are not uniformly necessary for performance and can often be suppressed without degrading—or even while improving—task outcomes, suggesting they function as surface-level cues rather than indicators of genuine reflection mechanisms.

AIBullisharXiv – CS AI · May 286/10

🧠

Skill-Conditioned Gated Self-Distillation for LLM Reasoning

Researchers propose Skill-Conditioned Gated Self-Distillation (SGSD), a novel method for improving large language model reasoning by leveraging an experience-derived skill bank rather than trusted reference answers. The approach validates skills through a multi-teacher framework and demonstrates consistent improvements over existing methods on mathematical reasoning benchmarks.

AINeutralarXiv – CS AI · May 286/10

🧠

RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

Researchers present a novel framework analyzing how reinforcement learning (RL) and supervised fine-tuning (SFT) differently shape reasoning in large language models. The study reveals that RL compresses incorrect reasoning paths while SFT expands correct ones, explaining why the two-stage training approach produces superior reasoning capabilities across models of 1.5B to 14B parameters.

AIBullisharXiv – CS AI · May 276/10

🧠

Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

Researchers demonstrate that knowledge graphs significantly outperform traditional document stores for LLM-based industrial asset operations, achieving 100% accuracy on 467 maintenance scenarios compared to 65% with flat data structures. The study reveals that data architecture, not LLM orchestration design, is the primary performance bottleneck in structured operational domains.

🏢 Hugging Face🧠 GPT-4

AIBullisharXiv – CS AI · May 276/10

🧠

ReasonOps: A Unified Operational Paradigm for Trustworthy Verified LLM Reasoning

Researchers introduce ReasonOps, a unified operational framework that treats AI reasoning as a continuously monitored and verifiable process rather than isolated inference. The paradigm integrates formal verification, symbolic reasoning, and runtime assurance to address critical reliability gaps in LLM-based reasoning systems, particularly for safety-critical applications.

AINeutralarXiv – CS AI · May 276/10

🧠

How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation

Researchers have developed a mechanistic interpretability framework that reverses information flow through Chain-of-Thought prompting to understand how AI models reason. The study reveals CoT functions as a decoding space pruner that uses answer templates to guide outputs, with task-dependent neuron modulation that reduces activation in open-domain tasks but increases it in closed-domain scenarios.

AIBullisharXiv – CS AI · May 276/10

🧠

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

Researchers propose PTA-GRPO, a two-stage framework that enhances LLM reasoning by combining high-level planning with reinforcement learning. The method first guides models to summarize reasoning into compact guidance, then uses this guidance to optimize both final outputs and reasoning quality, demonstrating consistent improvements across ten benchmarks.

AINeutralarXiv – CS AI · May 276/10

🧠

Vital Trace: Protocol-Constrained Patient-State Reasoning for Longitudinal Clinical Trajectories

Researchers present Vital Trace, a protocol-constrained multi-agent AI framework designed to improve clinical risk prediction in intensive care units by tracking patient trajectories over extended periods. The system uses compact patient-state memory and structured reasoning agents rather than unbounded text histories, demonstrating better temporal consistency and interpretability on MIMIC-IV and eICU datasets.