y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#reasoning-models News & Analysis

92 articles tagged with #reasoning-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

92 articles
AINeutralarXiv – CS AI · May 116/10
🧠

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Researchers introduce POISE, a reinforcement learning method that uses a language model's internal hidden states to estimate baseline values for policy optimization, eliminating the computational overhead of separate critic models. The approach demonstrates comparable performance to existing methods while requiring significantly less compute, enabling more efficient training of large reasoning models.

AINeutralarXiv – CS AI · May 116/10
🧠

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Researchers introduce Prune-OPD, a framework that optimizes on-policy distillation for AI reasoning models by detecting when student predictions diverge from teacher guidance and dynamically truncating unreliable training sequences. The method reduces training time by 37-68% on challenging math benchmarks while maintaining or improving performance.

AINeutralarXiv – CS AI · May 116/10
🧠

KL for a KL: On-Policy Distillation with Control Variate Baseline

Researchers propose vOPD (On-Policy Distillation with control variate baseline), a stabilization technique for training large language models that reduces gradient variance without adding computational overhead. The method leverages reinforcement learning principles to make on-policy distillation more reliable and efficient, matching expensive full-vocabulary baselines while maintaining lightweight single-sample estimation.

AINeutralarXiv – CS AI · May 116/10
🧠

CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

Researchers introduce CoCoReviewBench, a new benchmark dataset of 3,900 papers from ICLR and NeurIPS designed to reliably evaluate AI review systems. The benchmark addresses critical gaps in current evaluation methods by prioritizing correctness over mere overlap with human reviews, revealing that existing AI reviewers struggle with hallucinations and reasoning accuracy.

AIBullisharXiv – CS AI · May 116/10
🧠

Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis

Researchers developed a novel framework for synthesizing training data that enables reasoning models to generate high-quality mathematical and reasoning problems by explicitly planning problem directions and adapting difficulty to solver capabilities. The approach achieved a 3.4% cumulative improvement across 10 benchmarks, demonstrating scalable alternatives to manual dataset curation.

AIBullisharXiv – CS AI · May 116/10
🧠

Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models

Researchers introduce Miner, a novel reinforcement learning method that leverages a model's intrinsic uncertainty as a self-supervised reward signal to improve training efficiency for large reasoning models. The approach achieves state-of-the-art results on reasoning benchmarks, with performance gains up to 4.58 points in Pass@1 metrics compared to existing methods, addressing a critical inefficiency in current critic-free RL training.

AINeutralarXiv – CS AI · May 116/10
🧠

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Researchers introduce ThinkSafe, a self-generated safety alignment framework that improves AI reasoning models' resistance to harmful prompts without relying on external teacher models. The approach leverages models' latent safety knowledge through lightweight refusal steering, achieving superior safety outcomes compared to existing methods while preserving reasoning capabilities and reducing computational costs.

AINeutralarXiv – CS AI · May 116/10
🧠

Test-Time Compute Games

Researchers identify a market inefficiency in LLM-as-a-service pricing where providers are financially incentivized to increase test-time compute usage beyond what meaningfully improves output quality, inflating costs for users. They propose a reverse second-price auction mechanism where providers compete on both price and quality, with users paying only for marginal value created relative to alternatives.

🧠 Llama
AINeutralarXiv – CS AI · May 96/10
🧠

OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

Researchers demonstrate that On-Policy Self-Distillation (OPSD) functions primarily as a compression mechanism rather than a correction tool for thinking-enabled mathematical reasoning models. They propose a revised training pipeline (SFT → RLVR → OPSD) that leverages OPSD's strengths in shortening responses while preserving accuracy on correct outputs.

AINeutralarXiv – CS AI · May 96/10
🧠

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Researchers propose Listwise Policy Optimization (LPO), a new framework for training large language models that improves upon existing reinforcement learning approaches by explicitly projecting policies toward target distributions on the response simplex. The method demonstrates consistent performance improvements across reasoning tasks while maintaining training stability and response diversity.

AINeutralarXiv – CS AI · May 46/10
🧠

Retrieval-Augmented Reasoning for Chartered Accountancy

Researchers introduce CA-ThinkFlow, a parameter-efficient AI framework combining retrieval-augmented generation with a 14B quantized reasoning model to address chartered accountancy tasks in India. The system achieves performance comparable to GPT-4o and Claude 3.5 Sonnet while operating efficiently on limited resources, though it still struggles with complex regulatory reasoning in areas like taxation.

🧠 GPT-4🧠 Claude
AINeutralarXiv – CS AI · Apr 206/10
🧠

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

Researchers introduce AtManRL, a method that combines differentiable attention manipulation with reinforcement learning to improve the faithfulness of chain-of-thought reasoning in large language models. By training attention masks to identify which tokens genuinely influence model predictions, the approach demonstrates that LLM reasoning traces can be made more interpretable and transparent.

🧠 Llama
AINeutralarXiv – CS AI · Apr 146/10
🧠

LLMs for Text-Based Exploration and Navigation Under Partial Observability

Researchers evaluated whether large language models can function as text-only controllers for navigation and exploration in unknown environments under partial observability. Testing nine contemporary LLMs on ASCII gridworld tasks, they found reasoning-tuned models reliably complete navigation goals but remain inefficient compared to optimal paths, with few-shot prompting reducing invalid moves and improving path efficiency.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

Researchers demonstrate that deliberative alignment—a method for improving LLM safety by distilling reasoning from stronger models—still allows unsafe behaviors from base models to persist despite learning safer reasoning patterns. They propose a Best-of-N sampling technique that reduces attack success rates by 28-35% across multiple benchmarks while maintaining utility.

AIBullisharXiv – CS AI · Apr 146/10
🧠

MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

Researchers introduced MMR-AD, a large-scale multimodal dataset designed to benchmark general anomaly detection using Multimodal Large Language Models (MLLMs). The study reveals that current state-of-the-art MLLMs fall short of industrial requirements for anomaly detection, though a proposed baseline model called Anomaly-R1 demonstrates significant improvements through reasoning-based approaches enhanced by reinforcement learning.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

Researchers identify that reasoning language models exhibit worse performance in low-resource languages due to failures in language understanding rather than reasoning capability itself. The study proposes Selective Translation, which strategically adds English translations only when understanding failures are detected, achieving near full-translation performance while translating just 20% of inputs.

AIBullisharXiv – CS AI · Apr 136/10
🧠

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Researchers introduce Sequence-Level PPO (SPPO), a new algorithm that improves how large language models are trained for reasoning tasks by addressing stability and computational efficiency issues in standard reinforcement learning approaches. SPPO matches the performance of resource-heavy methods while significantly reducing memory and computational costs, potentially accelerating LLM alignment for complex problem-solving.

AIBullisharXiv – CS AI · Apr 136/10
🧠

Chain-in-Tree: Back to Sequential Reasoning in LLM Tree Search

Researchers introduce Chain-in-Tree (CiT), a framework that optimizes large language model tree search by selectively branching only when necessary rather than at every step. The approach reduces computational overhead by 75-85% on math reasoning tasks with minimal accuracy loss, making inference-time scaling more practical for resource-constrained deployments.

AINeutralarXiv – CS AI · Apr 106/10
🧠

Reasoning Fails Where Step Flow Breaks

Researchers introduce Step-Saliency, a diagnostic tool that reveals how large reasoning models fail during multi-step reasoning tasks by identifying two critical information-flow breakdowns: shallow layers that ignore context and deep layers that lose focus on reasoning. They propose StepFlow, a test-time intervention that repairs these flows and improves model accuracy without retraining.

AINeutralarXiv – CS AI · Apr 106/10
🧠

On the Step Length Confounding in LLM Reasoning Data Selection

Researchers identify a critical flaw in naturalness-based data selection methods for large language model reasoning datasets, where algorithms systematically favor longer reasoning steps rather than higher-quality reasoning. The study proposes two corrective methods (ASLEC-DROP and ASLEC-CASL) that successfully mitigate this 'step length confounding' bias across multiple LLM benchmarks.

AIBullisharXiv – CS AI · Apr 106/10
🧠

Rectifying LLM Thought from Lens of Optimization

Researchers introduce RePro, a novel post-training technique that optimizes large language models' reasoning processes by framing chain-of-thought as gradient descent and using process-level rewards to reduce overthinking. The method demonstrates consistent performance improvements across mathematics, science, and coding benchmarks while mitigating inefficient reasoning behaviors in LLMs.

AINeutralarXiv – CS AI · Apr 76/10
🧠

What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features

Researchers challenge the assumption that multilingual AI reasoning should simply mimic English patterns, finding that effective reasoning features vary significantly across languages. The study analyzed Large Reasoning Models across 10 languages and discovered that English-derived reasoning approaches may not translate effectively to other languages, suggesting need for adaptive, language-specific AI training methods.

AIBullisharXiv – CS AI · Mar 266/10
🧠

Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

Researchers introduce Generative Adversarial Reasoner, a new training framework that improves LLM mathematical reasoning by using adversarial reinforcement learning between a reasoner and discriminator model. The method achieved significant performance gains on mathematical benchmarks, improving DeepSeek models by 7-10 percentage points on AIME24 tests.

🧠 Llama
AIBullisharXiv – CS AI · Mar 176/10
🧠

Shorten After You're Right: Lazy Length Penalties for Reasoning RL

Researchers propose a new method to reduce the length of reasoning paths in large AI models like OpenAI o1 and DeepSeek R1 without additional training stages. The approach integrates reward designs directly into reinforcement learning, achieving 40% shorter responses in logic tasks with 14% performance improvement, and 33% reduction in math problems while maintaining accuracy.

🏢 OpenAI🧠 o1
← PrevPage 3 of 4Next →