y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#mathematical-reasoning News & Analysis

58 articles tagged with #mathematical-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

58 articles
AINeutralarXiv – CS AI · 2d ago6/10
🧠

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

Researchers propose TokUR, a framework that enables large language models to estimate uncertainty at the token level during reasoning tasks, allowing LLMs to self-assess response quality and improve performance on mathematical problems. The approach uses low-rank random weight perturbation to generate predictive distributions, demonstrating strong correlation with answer correctness and potential for enhancing LLM reliability.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models

Researchers systematically evaluated how sampling temperature and prompting strategies affect extended reasoning performance in large language models, finding that zero-shot prompting peaks at moderate temperatures (T=0.4-0.7) while chain-of-thought performs better at extremes. The study reveals that extended reasoning benefits grow substantially with higher temperatures, suggesting that T=0 is suboptimal for reasoning tasks.

🧠 Grok
AIBullisharXiv – CS AI · 6d ago6/10
🧠

$S^3$: Stratified Scaling Search for Test-Time in Diffusion Language Models

Researchers introduce S³ (Stratified Scaling Search), a test-time scaling method for diffusion language models that improves output quality by reallocating compute during the denoising process rather than simple best-of-K sampling. The technique uses a lightweight verifier to evaluate and selectively resample candidate trajectories at each step, demonstrating consistent performance gains across mathematical reasoning and knowledge tasks without requiring model retraining.

AIBearisharXiv – CS AI · 6d ago6/10
🧠

Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?

Researchers found that large language models experience accuracy drops of 0.3% to 5.9% when math problems are presented in unfamiliar cultural contexts, even when the underlying mathematical logic remains identical. Testing 14 models across culturally adapted variants of the GSM8K benchmark reveals that LLM mathematical reasoning is not culturally neutral, with errors stemming from both reasoning failures and calculation mistakes.

🏢 OpenAI🏢 Anthropic🧠 Claude
AIBullisharXiv – CS AI · Apr 76/10
🧠

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

Researchers introduce vocabulary dropout, a technique to prevent diversity collapse in co-evolutionary language model training where one model generates problems and another solves them. The method sustains proposer diversity and improves mathematical reasoning performance by +4.4 points on average in Qwen3 models.

AIBullisharXiv – CS AI · Apr 66/10
🧠

LLM Reasoning with Process Rewards for Outcome-Guided Steps

Researchers introduce PROGRS, a new framework that improves mathematical reasoning in large language models by using process reward models while maintaining focus on outcome correctness. The approach addresses issues with current reinforcement learning methods that can reward fluent but incorrect reasoning steps.

AIBearisharXiv – CS AI · Apr 66/10
🧠

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

A new study reveals that large language models, despite excelling at benchmark math problems, struggle significantly with contextual mathematical reasoning where problems are embedded in real-world scenarios. The research shows performance drops of 13-34 points for open-source models and 13-20 points for proprietary models when abstract math problems are presented in contextual settings.

AIBullisharXiv – CS AI · Mar 266/10
🧠

Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

Researchers introduce Generative Adversarial Reasoner, a new training framework that improves LLM mathematical reasoning by using adversarial reinforcement learning between a reasoner and discriminator model. The method achieved significant performance gains on mathematical benchmarks, improving DeepSeek models by 7-10 percentage points on AIME24 tests.

🧠 Llama
AIBullisharXiv – CS AI · Mar 176/10
🧠

GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models

Researchers propose GRPO (Group Relative Policy Optimization) combined with reflection reward mechanisms to enhance mathematical reasoning in large language models. The four-stage framework encourages self-reflective capabilities during training and demonstrates state-of-the-art performance over existing methods like supervised fine-tuning and LoRA.

AIBullisharXiv – CS AI · Mar 176/10
🧠

Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs

Researchers introduce AdaAnchor, a new AI reasoning framework that performs silent computation in latent space rather than generating verbose step-by-step reasoning. The system adaptively determines when to stop refining its internal reasoning process, achieving up to 5% better accuracy while reducing token generation by 92-93% and cutting refinement steps by 48-60%.

AINeutralarXiv – CS AI · Mar 176/10
🧠

More Agents Improve Math Problem Solving but Adversarial Robustness Gap Persists

Research reveals that while increasing the number of LLM agents improves mathematical problem-solving accuracy, these multi-agent systems remain vulnerable to adversarial attacks. The study found that human-like typos pose the greatest threat to robustness, and the adversarial vulnerability gap persists regardless of agent count.

🧠 Llama
AINeutralarXiv – CS AI · Mar 96/10
🧠

VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

Researchers introduced VisioMath, a new benchmark with 1,800 K-12 math problems designed to test Large Multimodal Models' ability to distinguish between visually similar diagrams. The study reveals that current state-of-the-art models struggle with fine-grained visual reasoning, often relying on shallow positional heuristics rather than proper image-text alignment.

AIBullisharXiv – CS AI · Mar 37/107
🧠

Learning Structured Reasoning via Tractable Trajectory Control

Researchers propose Ctrl-R, a new framework that improves large language models' reasoning abilities by systematically discovering and reinforcing diverse reasoning patterns through structured trajectory control. The method enables better exploration of complex reasoning behaviors and shows consistent improvements across mathematical reasoning tasks in both language and vision-language models.

AIBullisharXiv – CS AI · Mar 37/107
🧠

Tool Verification for Test-Time Reinforcement Learning

Researchers introduce T³RL (Tool-Verification for Test-Time Reinforcement Learning), a new method that improves self-evolving AI reasoning models by using external tool verification to prevent incorrect learning from biased consensus. The approach shows significant improvements on mathematical problem-solving tasks, with larger gains on harder problems.

AIBullisharXiv – CS AI · Mar 36/106
🧠

One-Token Verification for Reasoning Correctness Estimation

Researchers introduce One-Token Verification (OTV), a new method that estimates reasoning correctness in large language models during a single forward pass, reducing computational overhead. OTV reduces token usage by up to 90% through early termination while improving accuracy on mathematical reasoning tasks compared to existing verification methods.

AIBullisharXiv – CS AI · Mar 36/109
🧠

Provable and Practical In-Context Policy Optimization for Self-Improvement

Researchers introduce In-Context Policy Optimization (ICPO), a new method that allows AI models to improve their responses during inference through multi-round self-reflection without parameter updates. The practical ME-ICPO algorithm demonstrates competitive performance on mathematical reasoning tasks while maintaining affordable inference costs.

AIBullisharXiv – CS AI · Mar 37/107
🧠

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Researchers propose Likelihood-Free Policy Optimization (LFPO), a new framework for improving Diffusion Large Language Models by bypassing likelihood computation issues that plague existing methods. LFPO uses geometric velocity rectification to optimize denoising logits directly, achieving better performance on code and reasoning tasks while reducing inference time by 20%.

AIBullisharXiv – CS AI · Mar 36/105
🧠

Re4: Scientific Computing Agent with Rewriting, Resolution, Review and Revision

Researchers have developed Re4, a multi-agent AI framework that uses three specialized LLMs (Consultant, Reviewer, and Programmer) working collaboratively to solve scientific computing problems. The system employs a rewriting-resolution-review-revision process that significantly improves bug-free code generation and reduces non-physical solutions in mathematical and scientific reasoning tasks.

$LINK
AIBullisharXiv – CS AI · Mar 36/104
🧠

RL for Reasoning by Adaptively Revealing Rationales

Researchers introduce AdaBack, a new reinforcement learning algorithm that uses partial supervision to help AI models learn complex reasoning tasks. The method dynamically adjusts the amount of guidance provided to each training sample, enabling models to solve mathematical reasoning problems that traditional supervised learning and reinforcement learning methods cannot handle.

AIBullisharXiv – CS AI · Mar 36/103
🧠

Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning

Researchers propose Quantile Advantage Estimation (QAE) to stabilize Reinforcement Learning with Verifiable Rewards (RLVR) for large language model reasoning. The method replaces mean baselines with group-wise K-quantile baselines to prevent entropy collapse and explosion, showing sustained improvements on mathematical reasoning tasks.

AIBullisharXiv – CS AI · Mar 36/104
🧠

Reliable Fine-Grained Evaluation of Natural Language Math Proofs

Researchers have developed ProofGrader, a new AI system that can reliably evaluate natural language mathematical proofs generated by large language models on a fine-grained 0-7 scale. The system was trained using ProofBench, the first expert-annotated dataset of proof ratings covering 145 competition math problems and 435 LLM solutions, achieving significant improvements over basic evaluation methods.

AIBullisharXiv – CS AI · Mar 36/104
🧠

Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

Researchers introduce PSN-RLVR, a new reinforcement learning method that uses parameter-space noise to improve AI exploration and reasoning capabilities. The technique addresses limitations in existing approaches by enabling better discovery of new problem-solving strategies rather than just reweighting existing solutions.

AIBullisharXiv – CS AI · Mar 26/1014
🧠

Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

Researchers propose SCOPE, a new framework for Reinforcement Learning from Verifiable Rewards (RLVR) that improves AI reasoning by salvaging partially correct solutions rather than discarding them entirely. The method achieves 46.6% accuracy on math reasoning tasks and 53.4% on out-of-distribution problems by using step-wise correction to maintain exploration diversity.

AIBullisharXiv – CS AI · Mar 26/1016
🧠

A Minimal Agent for Automated Theorem Proving

Researchers propose a minimal baseline architecture for AI-based theorem proving that achieves competitive performance with state-of-the-art systems while using significantly simpler design. The open-source implementation demonstrates that iterative proof refinement approaches are more sample-efficient and cost-effective than single-shot generation methods.

AIBullisharXiv – CS AI · Feb 276/107
🧠

Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance

Researchers identified why AI mathematical reasoning guidance is inconsistent and developed Selective Strategy Retrieval (SSR), a framework that improves AI math performance by combining human and model strategies. The method showed significant improvements of up to 13 points on mathematical benchmarks by addressing the gap between strategy usage and executability.

← PrevPage 2 of 3Next →