#mathematical-reasoning News & Analysis

136 articles tagged with #mathematical-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

136 articles

AIBullisharXiv – CS AI · Mar 47/103

🧠

LEDOM: Reverse Language Model

Researchers have developed LEDOM, an open-source reverse autoregressive language model that trains right-to-left instead of the traditional left-to-right approach. The model demonstrates unique capabilities like abductive inference and question synthesis, and when combined with forward models through 'Reverse Reward' scoring, achieves significant performance gains of up to 15% on mathematical reasoning tasks.

AIBullisharXiv – CS AI · Mar 47/105

🧠

NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect

Researchers introduce NeuroProlog, a neurosymbolic framework that improves mathematical reasoning in Large Language Models by converting math problems into executable Prolog programs. The multi-task 'Cocktail' training approach shows significant accuracy improvements of 3-5% across different model sizes, with larger models demonstrating better error correction capabilities.

AIBullisharXiv – CS AI · Mar 47/103

🧠

LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Researchers introduce LaDiR (Latent Diffusion Reasoner), a novel framework that combines continuous latent representation with iterative refinement capabilities to enhance Large Language Models' reasoning abilities. The system uses a Variational Autoencoder to encode reasoning steps and a latent diffusion model for parallel generation of diverse reasoning trajectories, showing improved accuracy and interpretability in mathematical reasoning benchmarks.

AINeutralarXiv – CS AI · Mar 37/105

🧠

DAG-Math: Graph-of-Thought Guided Mathematical Reasoning in LLMs

Researchers introduce DAG-Math, a new framework for evaluating mathematical reasoning in Large Language Models that models Chain-of-Thought as rule-based processes over directed acyclic graphs. The framework includes a 'logical closeness' metric that reveals significant differences in reasoning quality between LLM families, even when final answer accuracy appears comparable.

AIBullisharXiv – CS AI · Mar 37/104

🧠

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Researchers released two open-source datasets, SwallowCode and SwallowMath, that significantly improve large language model performance in coding and mathematics through systematic data rewriting rather than filtering. The datasets boost Llama-3.1-8B performance by +17.0 on HumanEval for coding and +12.4 on GSM8K for math tasks.

AIBullisharXiv – CS AI · Mar 37/104

🧠

AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent

Researchers introduced AgentMath, a new AI framework that combines language models with code interpreters to solve complex mathematical problems more efficiently than current Large Reasoning Models. The system achieves state-of-the-art performance on mathematical competition benchmarks, with AgentMath-30B-A3B reaching 90.6% accuracy on AIME24 while remaining competitive with much larger models like OpenAI-o3.

AINeutralarXiv – CS AI · Feb 277/107

🧠

LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories)

Researchers introduced LeanCat, a benchmark comprising 100 category-theory tasks in Lean to test AI's formal theorem proving capabilities. State-of-the-art models achieved only 12% success rates, revealing significant limitations in abstract mathematical reasoning, while a new retrieval-augmented approach doubled performance to 24%.

AINeutralarXiv – CS AI · Feb 277/106

🧠

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Researchers identify a critical trade-off in AI model training where optimizing for Pass@k metrics (multiple attempts) degrades Pass@1 performance (single attempt). The study reveals this occurs due to gradient conflicts when the training process reweights toward low-success prompts, creating interference that hurts single-shot performance.

AIBullishOpenAI News · May 317/109

🧠

Improving mathematical reasoning with process supervision

Researchers have developed a new AI training method called 'process supervision' that rewards each correct reasoning step rather than just the final answer, achieving state-of-the-art performance in mathematical problem solving. This approach not only improves performance but also ensures the AI's reasoning process aligns with human-endorsed thinking patterns.

AIBullisharXiv – CS AI · Jun 256/10

🧠

ExTra: Exploratory Trajectory Optimization for Language Model Reinforcement Learning

Researchers introduce ExTra, a reinforcement learning framework that improves language model reasoning by extracting exploration signals from model rollouts. The method combines novelty rewards for diverse solutions with entropy-guided trajectory regeneration, achieving 5-7 point improvements over baseline GRPO across mathematical reasoning benchmarks.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Learning with a Single Rollout via Monte Carlo Pass@k Critic

Researchers propose SR-PPO, a reinforcement learning method that trains language models using single rollouts and Monte Carlo Pass@k critics for token-level credit assignment. The approach reduces computational costs while improving reasoning performance on mathematical benchmarks like HMMT26 and AIME24 by using reachability-based advantage estimation instead of repeated sampling.

AINeutralarXiv – CS AI · Jun 236/10

🧠

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

VeriEvol is a new framework for scaling multimodal mathematical reasoning in AI by treating data creation as a verifiable problem, combining evolved prompts with a multi-source verifier to ensure answer reliability. Testing shows the approach increases visual math accuracy from 35.42% to 54.73% when scaling from 10K to 250K samples, with reinforcement learning adding further gains of 3.88% points.

AIBullisharXiv – CS AI · Jun 196/10

🧠

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

Researchers introduce the Independent Combinatorial Tokens (ICT) framework to improve Large Language Model reasoning by addressing entropy collapse and explosion problems in reinforcement learning. Using Jensen-Shannon divergence to identify critical token branching points, ICT achieves 4.58% average improvement in pass@4 scores across math, commonsense, and Olympiad benchmarks on Qwen models.

AINeutralarXiv – CS AI · Jun 196/10

🧠

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

Researchers introduce CombEval, a dynamic benchmark framework for evaluating how well large language models handle combinatorial counting problems. Testing 11 LLMs reveals significant brittleness in handling ordered objects, indistinguishable elements, and nested dependencies, with code-augmented approaches showing modest improvements over direct reasoning.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Too long; didn't solve

A new study examining mathematical benchmarks used to evaluate large language models reveals that both prompt length and solution length correlate with increased model failure rates. The research, conducted on an adversarial dataset of expert-authored math problems, demonstrates that structural complexity is a significant factor in model performance difficulty.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

Researchers introduce DiRL, a reinforcement learning framework that distinguishes between genuine reasoning and memorization in large language models by anchoring exploration to an internal reasoning-memorization direction. The method integrates with Group Relative Policy Optimization to improve performance on mathematical and reasoning benchmarks while suppressing exploration of memorized shortcuts.

AINeutralarXiv – CS AI · Jun 106/10

🧠

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Researchers introduce ComBench, a new benchmark containing 100 Olympiad-level combinatorics problems designed to evaluate large language models' mathematical reasoning capabilities. The benchmark reveals that even frontier models struggle with combinatorial problems, with the best performance reaching only 65.4%, and identifies that rigorous proof reasoning and constructive problem-solving are distinct capabilities that models handle unevenly.

🧠 GPT-5

AINeutralarXiv – CS AI · Jun 106/10

🧠

Evaluating Research-Level Math Proofs via Strict Step-Level Verification

Researchers developed a step-level verification framework that improves Large Language Models' ability to evaluate complex mathematical proofs by maintaining detailed context for each deduction and constraining theorem sources, rather than relying on global evaluation. Testing on research-level proofs revealed that unconstrained approaches fail to catch subtle logical errors, while the new method reveals that remaining verification failures stem from implicit domain conventions rather than hallucinations.

AINeutralarXiv – CS AI · Jun 96/10

🧠

PAEC: Position-Aware Entropy Calibration for LLM Reasoning in RLVR

Researchers propose Position-Aware Entropy Calibration (PAEC), a novel technique that selectively manages entropy in reinforcement learning systems used to improve large language model reasoning. The method addresses policy-entropy collapse by applying targeted entropy penalties only at decision-critical token positions rather than uniformly across all tokens, demonstrating improved performance on mathematical reasoning benchmarks.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

Researchers introduce ISPO (Intrinsic Signal Policy Optimization), a new reinforcement learning method that improves long-chain reasoning in large language models by densifying reward signals with intrinsic metrics derived from the model's own probabilities. The approach addresses critical failure modes in existing GRPO-based methods and shows consistent improvements across mathematical reasoning benchmarks.

AINeutralarXiv – CS AI · Jun 96/10

🧠

(Auto)formalization is supposed to be easy: Trellis process semantics for spelling out rigorous proofs

Researchers present Trellis, an autoformalization system that uses LLM agents within constrained workflows to convert natural language mathematical proofs into Lean formal code. The system achieves reliable formalization on modest computational budgets by enforcing incremental progress through iterative refinement, demonstrated by formalizing a recent Ramsey theory breakthrough.

AIBullisharXiv – CS AI · Jun 96/10

🧠

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

Researchers introduce CLPO, a curriculum learning framework that dynamically adapts training difficulty for large language models during reinforcement learning. The approach automatically identifies solved, medium, and hard problems, then strategically restructures tasks to match the model's evolving capabilities, achieving substantial improvements over existing methods on mathematical and reasoning benchmarks.

AINeutralarXiv – CS AI · Jun 86/10

🧠

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

Researchers introduce CrowdMath, a dataset of 164 expert-annotated collaborative mathematical problem-solving discussions from MIT PRIMES and Art of Problem Solving (2016-2025). While frontier AI models achieve 83-88% accuracy in predicting next posts, they struggle significantly with understanding the functional roles of contributions in mathematical reasoning, revealing a gap between solving isolated problems and comprehending collaborative research progress.

AIBullisharXiv – CS AI · Jun 46/10

🧠

Characterizing initial human-AI proof formalization workflows

Researchers conducted mixed-methods studies on how mathematicians use AI tools to formalize proofs, finding that users prefer AI assistance while maintaining high-level control over proof discovery. A controlled user study showed participants achieved higher formalization accuracy with AI access than without, despite current tool limitations.

AINeutralarXiv – CS AI · Jun 36/10

🧠

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

Researchers introduced GTBench, a curriculum-based benchmark with 63 graph theory problems designed to evaluate LLMs as mathematical research assistants. Testing five frontier models revealed significant performance gaps, with GPT-5 substantially outperforming competitors on advanced proofs while all models struggled with graduate-level reasoning, raising concerns about AI reliability in technical education and research.

🧠 GPT-5🧠 Claude🧠 Sonnet

← PrevPage 3 of 6Next →