y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-reasoning News & Analysis

109 articles tagged with #llm-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

109 articles
AINeutralarXiv – CS AI · 5d ago6/10
🧠

Formally Solving Answer-Construction Problems in Lean

Researchers introduce Enumerate-Conjecture-Prove (ECP), a neuro-symbolic framework that combines general LLMs and prover LLMs to formally solve mathematical answer-construction problems in Lean. The approach addresses a critical gap where current AI systems struggle with generating both candidate answers and rigorous formal proofs, achieving higher success rates than baseline LLM approaches on competition mathematics benchmarks.

AINeutralarXiv – CS AI · 5d ago6/10
🧠

CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

Researchers propose Credit-Attenuated Privileged Feedback (CAPF), a training mechanism that guides LLM search agents by providing verifier feedback during training to improve learning on difficult problems. The approach improves performance on open-domain QA benchmarks by leveraging information already available in reinforcement learning systems, increasing exact-match accuracy from 44.7% to 48.5% on Qwen3-4B.

AINeutralarXiv – CS AI · 5d ago6/10
🧠

LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

Researchers introduce LLM-WikiRace, a benchmark that tests large language models' planning and reasoning abilities by requiring them to navigate Wikipedia links from a source to target page. While frontier models like Gemini-3 achieve superhuman performance on easy tasks, success rates plummet to 23% on hard difficulty, revealing significant limitations in long-horizon planning and recovery from failures.

🧠 GPT-5🧠 Claude🧠 Opus
AINeutralarXiv – CS AI · 5d ago6/10
🧠

Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning

Researchers demonstrate that large language models fail to accurately predict gene expression changes in cellular perturbation experiments despite producing biologically plausible explanations. They introduce CORE, a contrastive learning method that significantly improves prediction accuracy by organizing evidence from related perturbations rather than evaluating them in isolation.

AINeutralarXiv – CS AI · 5d ago6/10
🧠

LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

Researchers introduce LC-ERD, a framework for improving Large Language Model reasoning by mining high-quality supervision signals through consistency-regulated reward decomposition. The method addresses critical challenges in self-aligned LLM training by reducing label noise, providing granular step-level guidance, and preventing distributional collapse, demonstrating potential improvements in reasoning quality and generalization.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories

Researchers demonstrate that Large Language Models improve their reasoning performance when search histories are explicitly structured with parent pointers (LinTree), rather than implicitly represented. The finding suggests that LLMs benefit from tree-aware representations during problem-solving, outperforming both implicit trace-based reasoning and traditional heuristic-guided search across multiple domains.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate

Researchers demonstrate that large language models engaged in multi-agent debate can achieve superior truth-seeking performance by leveraging collective reasoning dynamics similar to human argumentative discourse. The study provides empirical evidence that distributed epistemic reasoning outperforms individual model performance and proposes a novel benchmarking methodology to measure intrinsic model properties like hallucination propensity.

AIBullisharXiv – CS AI · 6d ago6/10
🧠

Symbolic Intermediaries as a Linguistic-Numerical Interface for LLM-Driven Geometric Reasoning

Researchers propose symbolic intermediaries—compact mathematical expressions derived from symbolic regression—to bridge the gap between Large Language Models and physics simulators by converting continuous numerical outputs into interpretable symbolic forms. LLM-based agents using this interface outperformed genetic algorithms by 19-53% on mechanism synthesis tasks, demonstrating that translating simulator behavior into symbolic language enables grounded geometric reasoning without model retraining.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

SEMA-RAG introduces a multi-agent framework that decouples medical reasoning tasks into three specialized agents to improve retrieval-augmented generation for clinical question answering. The approach achieves 6.46 percentage point accuracy improvements over existing baselines by addressing hallucinations and knowledge obsolescence through iterative, evidence-driven retrieval rather than single-round static lookups.

AINeutralarXiv – CS AI · May 296/10
🧠

ReasonOps: Operator Segmentation for LLM Reasoning Traces

Researchers introduced ReasonOps, an unsupervised method for analyzing chain-of-thought traces from large language models that identifies seven universal reasoning operators (backtracking, inferring, hypothesizing, etc.) appearing consistently across 12 different LLM families. The framework enables model identification, correctness prediction, and early quality estimation without manual annotation, revealing that each model family has a distinctive reasoning fingerprint.

AIBullisharXiv – CS AI · May 296/10
🧠

OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation

OptSkills, a new AI system, advances automated optimization problem-solving by clustering problems by underlying mathematical archetypes rather than surface narratives, achieving 68.27% accuracy on diverse benchmarks and outperforming DeepSeek-V3.2-Thinking on large-scale problems. The system uses skill distillation and trajectory learning to improve generalization across both known and novel problem types.

AIBullisharXiv – CS AI · May 296/10
🧠

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

Researchers introduce RePoT (Recoverable Program-of-Thought), an enhanced AI reasoning method that fixes failed code generation by replaying execution to identify the first error point, then using a single LLM call to recover rather than restarting. The technique improves accuracy by 3-11 percentage points across multiple models and benchmarks, with particularly strong gains on smaller models like GPT-4 mini.

🧠 GPT-5🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · May 286/10
🧠

Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

Researchers present a systematic evaluation of large language models' reasoning capabilities on Boolean satisfiability problems, introducing a paired-formula protocol with Accurate Differentiation Rate (ADR) metric that reveals conventional accuracy metrics can be misleading, as models often succeed through heuristics rather than genuine reasoning.

AINeutralarXiv – CS AI · May 286/10
🧠

Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict

Researchers found that large language models' chain-of-thought reasoning remains remarkably consistent even when reaching opposite conclusions about conflicting information, suggesting CoT explanations don't faithfully reflect the underlying decision mechanism. While model confidence shows weak but genuine predictive signal for decisions, internal reasoning tokens proved more decision-sensitive than user-facing explanations, indicating models may not transparently report how they actually choose between document claims and training knowledge.

🧠 GPT-4🧠 Claude🧠 Sonnet
AINeutralarXiv – CS AI · May 286/10
🧠

Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning

Researchers examine how Large Language Models use anthropomorphic reflection markers like 'wait' and 'hmm' during reasoning tasks. The study finds these markers are not uniformly necessary for performance and can often be suppressed without degrading—or even while improving—task outcomes, suggesting they function as surface-level cues rather than indicators of genuine reflection mechanisms.

AIBullisharXiv – CS AI · May 286/10
🧠

Skill-Conditioned Gated Self-Distillation for LLM Reasoning

Researchers propose Skill-Conditioned Gated Self-Distillation (SGSD), a novel method for improving large language model reasoning by leveraging an experience-derived skill bank rather than trusted reference answers. The approach validates skills through a multi-teacher framework and demonstrates consistent improvements over existing methods on mathematical reasoning benchmarks.

AINeutralarXiv – CS AI · May 286/10
🧠

RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

Researchers present a novel framework analyzing how reinforcement learning (RL) and supervised fine-tuning (SFT) differently shape reasoning in large language models. The study reveals that RL compresses incorrect reasoning paths while SFT expands correct ones, explaining why the two-stage training approach produces superior reasoning capabilities across models of 1.5B to 14B parameters.

AINeutralarXiv – CS AI · May 286/10
🧠

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

Researchers propose a taxonomy of chain-of-thought (CoT) reasoning in LLM post-training, distinguishing between explicit, composed, and implicit reasoning formats. The study reveals that compressed reasoning data requires different training approaches, with composed CoT benefiting from data scaling while implicit CoT risks memorization, and that reinforcement learning can decompose compressed steps learned during supervised fine-tuning.

AIBullisharXiv – CS AI · May 276/10
🧠

Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

Researchers demonstrate that knowledge graphs significantly outperform traditional document stores for LLM-based industrial asset operations, achieving 100% accuracy on 467 maintenance scenarios compared to 65% with flat data structures. The study reveals that data architecture, not LLM orchestration design, is the primary performance bottleneck in structured operational domains.

🏢 Hugging Face🧠 GPT-4
AIBullisharXiv – CS AI · May 276/10
🧠

ReasonOps: A Unified Operational Paradigm for Trustworthy Verified LLM Reasoning

Researchers introduce ReasonOps, a unified operational framework that treats AI reasoning as a continuously monitored and verifiable process rather than isolated inference. The paradigm integrates formal verification, symbolic reasoning, and runtime assurance to address critical reliability gaps in LLM-based reasoning systems, particularly for safety-critical applications.

AINeutralarXiv – CS AI · May 276/10
🧠

How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation

Researchers have developed a mechanistic interpretability framework that reverses information flow through Chain-of-Thought prompting to understand how AI models reason. The study reveals CoT functions as a decoding space pruner that uses answer templates to guide outputs, with task-dependent neuron modulation that reduces activation in open-domain tasks but increases it in closed-domain scenarios.

AIBullisharXiv – CS AI · May 276/10
🧠

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

Researchers propose PTA-GRPO, a two-stage framework that enhances LLM reasoning by combining high-level planning with reinforcement learning. The method first guides models to summarize reasoning into compact guidance, then uses this guidance to optimize both final outputs and reasoning quality, demonstrating consistent improvements across ten benchmarks.

AINeutralarXiv – CS AI · May 276/10
🧠

Vital Trace: Protocol-Constrained Patient-State Reasoning for Longitudinal Clinical Trajectories

Researchers present Vital Trace, a protocol-constrained multi-agent AI framework designed to improve clinical risk prediction in intensive care units by tracking patient trajectories over extended periods. The system uses compact patient-state memory and structured reasoning agents rather than unbounded text histories, demonstrating better temporal consistency and interpretability on MIMIC-IV and eICU datasets.

AINeutralarXiv – CS AI · May 126/10
🧠

AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization

Researchers introduce AgentPSO, a framework that evolves multi-agent reasoning skills in large language models using particle swarm optimization principles. Rather than relying on static agents or inference-time debate, the system enables agents to iteratively improve their reasoning capabilities through self-reflection and collective learning, demonstrating improved performance and cross-benchmark transferability without modifying underlying model parameters.

AINeutralarXiv – CS AI · May 126/10
🧠

How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors

Researchers propose IMAX, a framework that uses trainable prefix tuning to improve exploration in reinforcement learning with verifiable rewards (RLVR) for language model reasoning. The approach addresses entropy collapse by creating diverse reasoning trajectories, achieving performance gains up to 11.60% in Pass@4 accuracy across multiple model scales.

← PrevPage 3 of 5Next →