#reasoning-models News & Analysis

138 articles tagged with #reasoning-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

138 articles

AINeutralarXiv – CS AI · Jun 46/10

🧠

Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

Researchers propose a consequence-aware compute allocation system for reasoning models that prioritizes high-impact tasks based on real-world failure costs rather than just predicted difficulty. Testing on software engineering benchmarks shows the method reduces cost-weighted loss by 22-33% compared to difficulty-based routing, with a practical predictor-driven variant retaining over 90% of theoretical gains.

AIBullisharXiv – CS AI · Jun 46/10

🧠

Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots

Researchers propose PivotTrace, a data-efficient framework for training large reasoning models that selects unlabeled samples for annotation without prior supervision. The method achieves 29.3% annotation efficiency while converging 2.75x faster than standard supervised approaches by leveraging attention dynamics to quantify uncertainty.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

Researchers propose a rollout-level advantage-prioritized experience replay system for GRPO (Group Relative Policy Optimization) that improves sample efficiency in LLM post-training. By storing individual rollouts with age-based eviction and prioritizing high-advantage samples, the method achieves 4.35 percentage point gains on math benchmarks while maintaining on-policy data freshness.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Learning When to Translate for Multilingual Reasoning

Researchers introduce Luar, a reinforcement learning framework that trains reasoning language models to selectively translate non-English inputs to English only when necessary for reliable reasoning. The approach achieves superior multilingual reasoning performance compared to standard baselines, particularly benefiting low-resource languages while avoiding unnecessary translation overhead.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Recognize Your Orchestrator: An Entropy Dynamics Perspective for LLM Multi-Agent Systems

Researchers propose a Mean-Field Entropy Dynamics framework to analyze failure modes in Large Language Model multi-agent systems, identifying a "Reasoning Trap" where sophisticated reasoning models paradoxically perform poorly as orchestrators due to context limitations. The study introduces Inverse Workflow Generation for benchmarking and provides physically interpretable parameters for predicting system stability.

AINeutralarXiv – CS AI · Jun 16/10

🧠

CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

Researchers introduce CodeGolf Bench, a new benchmark for evaluating Large Language Models' ability to generate concise code across 60 programming languages. The study reveals that reasoning-capable models significantly outperform standard LLMs, achieving 70.97% average percentile performance on code golf tasks, particularly excelling in languages with strict syntax requirements.

AINeutralarXiv – CS AI · Jun 16/10

🧠

LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation

LARK introduces a learnability-grounded approach to trajectory selection for reasoning distillation, enabling student models to learn more efficiently from teacher-generated reasoning paths. The method uses a learnability factor to identify trajectories that maximize learning speed while maintaining distributional coverage, outperforming existing heuristic-based selection methods across multiple reasoning tasks.

AINeutralarXiv – CS AI · Jun 15/10

🧠

Trust-Region Behavior Blending for On-Policy Distillation

Researchers propose Trust-Region behavior Blending (TRB), a warmup technique that improves on-policy distillation by having student models learn from a teacher-aligned policy during early training stages rather than weak student rollouts. The method anneals the constraint over time until training returns to pure student policy, demonstrating stronger performance in math-reasoning tasks.

AINeutralarXiv – CS AI · May 296/10

🧠

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

Researchers identify harmful continuation in long chain-of-thought training data where LLMs continue reasoning after the answer is sufficiently supported, degrading fine-tuning performance. Using a delete-only editor, they remove post-conclusion continuations and demonstrate improved SFT outcomes, introducing Harmful Continuation Cut (HCC) as a lightweight solution to detect and eliminate this problematic pattern.

AINeutralarXiv – CS AI · May 296/10

🧠

Rubric-Guided Process Reward for Stepwise Model Routing

Researchers introduce RoRo, a novel framework for stepwise model routing in Large Reasoning Models that uses process-based rewards rather than outcome-only rewards to evaluate intermediate routing decisions. The approach combines rubric-guided evaluation with reinforcement learning to improve efficiency and accuracy across multiple reasoning benchmarks.

AIBullisharXiv – CS AI · May 286/10

🧠

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

Researchers propose Reasoning-Conditioned Direct Preference Optimization (RC-DPO), a training method that reduces hallucinations in multimodal large reasoning models by treating chain-of-thought reasoning as a condition for answer generation rather than a monolithic output. The approach uses Monte Carlo Tree Search to generate better training data and demonstrates improved reliability across multiple benchmarks.

AINeutralarXiv – CS AI · May 286/10

🧠

The Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces

Researchers analyzed backtracking patterns in reasoning traces from the Qwen3-8B model, finding that correct reasoning typically shows early, isolated self-corrections while incorrect reasoning exhibits persistent, clustered revisions occurring late in traces. The study demonstrates that burst-aware filtering of reasoning traces can improve model reliability by identifying unstable reasoning patterns before completion.

AINeutralarXiv – CS AI · May 285/10

🧠

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

Researchers introduce REFT, a method that improves Reinforcement Learning with Verifiable Rewards (RLVR) by diversifying the first token generated after reasoning markers, addressing a previously overlooked bottleneck in rollout diversity. The technique achieves measurable improvements across multiple model sizes and difficulty levels without requiring changes to existing RLVR pipelines.

AINeutralarXiv – CS AI · May 286/10

🧠

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

Researchers introduced HRBench, a unified evaluation framework for testing hybrid-reasoning LLMs that allow dynamic switching between fast and slow reasoning modes. The framework systematically compares 12+ prior methods across three switching strategy families and four training approaches, revealing that prompt-based methods offer better token-accuracy trade-offs while routing methods provide more stable cost reduction.

AINeutralarXiv – CS AI · May 286/10

🧠

ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

ADWIN is a new framework for on-policy distillation that optimizes training efficiency by adaptively adjusting rollout lengths instead of requiring full completions for every update. The method reduces training costs by up to 4.1x while maintaining or improving accuracy on math and code reasoning tasks by identifying when shorter teacher-anchored sequences contain sufficient signal for learning.

AINeutralarXiv – CS AI · May 286/10

🧠

ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning

Researchers introduce ECHO, a novel test-time reinforcement learning algorithm that addresses rollout collapse and noisy pseudo-labels through entropy-confidence hybrid optimization. The method improves sampling efficiency and training robustness across mathematical and visual reasoning benchmarks while performing better under limited computational budgets.

AINeutralarXiv – CS AI · May 276/10

🧠

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

A controlled study of 432 experiments across six LLM models challenges the assumption that higher-capability models require less structural guidance. The research reveals non-monotone harness sensitivity patterns, where frontier models like Gemini 2.5 Flash show performance degradation with increased harness complexity, while reasoning-focused models benefit from stricter constraints.

🧠 Gemini

AINeutralarXiv – CS AI · May 276/10

🧠

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

Researchers conducted a controlled study on reinforcement learning with verifiable rewards (RLVR) for reasoning models, revealing that training data allocation across multiple reasoning dimensions—depth, environment complexity, and reasoning types—significantly impacts model performance. The study found that joint coverage of these dimensions outperforms single-axis training approaches, and that models exhibit systematic weaknesses in abductive reasoning regardless of training setup.

AINeutralarXiv – CS AI · May 126/10

🧠

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

Researchers demonstrate that reasoning-capable LLMs improve judgment accuracy significantly on complex tasks like math and coding, but offer minimal or negative benefits on simpler evaluations while consuming substantially more computational resources. They introduce RACER, an adaptive routing algorithm that dynamically selects between reasoning and non-reasoning judges under budget constraints while accounting for distribution shifts.

AINeutralarXiv – CS AI · May 126/10

🧠

OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

Researchers introduce OPT-BENCH, a benchmark evaluating whether large language models can self-improve through iterative feedback in complex problem spaces. Testing 19 LLMs across machine learning and NP-hard problems reveals that while stronger models adapt better, even the most advanced systems remain constrained by their base capabilities and fall short of human expert performance.

AINeutralarXiv – CS AI · May 126/10

🧠

Internalizing Safety Understanding in Large Reasoning Models via Verification

Researchers propose Safety Internal (SInternal), a framework that trains large reasoning models to verify the safety of their own outputs rather than relying on external compliance mechanisms. The approach demonstrates that models can internalize safety understanding through verification tasks, significantly improving robustness against adversarial jailbreaks and out-of-domain attacks.

AINeutralarXiv – CS AI · May 116/10

🧠

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Researchers introduce POISE, a reinforcement learning method that uses a language model's internal hidden states to estimate baseline values for policy optimization, eliminating the computational overhead of separate critic models. The approach demonstrates comparable performance to existing methods while requiring significantly less compute, enabling more efficient training of large reasoning models.

AINeutralarXiv – CS AI · May 116/10

🧠

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Researchers introduce Prune-OPD, a framework that optimizes on-policy distillation for AI reasoning models by detecting when student predictions diverge from teacher guidance and dynamically truncating unreliable training sequences. The method reduces training time by 37-68% on challenging math benchmarks while maintaining or improving performance.

AINeutralarXiv – CS AI · May 116/10

🧠

KL for a KL: On-Policy Distillation with Control Variate Baseline

Researchers propose vOPD (On-Policy Distillation with control variate baseline), a stabilization technique for training large language models that reduces gradient variance without adding computational overhead. The method leverages reinforcement learning principles to make on-policy distillation more reliable and efficient, matching expensive full-vocabulary baselines while maintaining lightweight single-sample estimation.

AINeutralarXiv – CS AI · May 116/10

🧠

CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

Researchers introduce CoCoReviewBench, a new benchmark dataset of 3,900 papers from ICLR and NeurIPS designed to reliably evaluate AI review systems. The benchmark addresses critical gaps in current evaluation methods by prioritizing correctness over mere overlap with human reviews, revealing that existing AI reviewers struggle with hallucinations and reasoning accuracy.

← PrevPage 4 of 6Next →