y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#reasoning-benchmarks News & Analysis

12 articles tagged with #reasoning-benchmarks. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

12 articles
AIBullisharXiv – CS AI · May 297/10
🧠

PuzzleClone: A DSL-Powered Framework for Synthesizing Verifiable Data

Researchers introduce PuzzleClone, a DSL-driven framework that automatically synthesizes large-scale, verifiable datasets for training LLMs on mathematical and logical reasoning tasks. The team generates PC-83K, a benchmark of 83,000+ diverse puzzles, and demonstrates that models fine-tuned on this dataset achieve substantial performance improvements across multiple logic and mathematical benchmarks.

AIBullisharXiv – CS AI · May 127/10
🧠

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

Researchers introduce OPT-BENCH, a framework for training LLMs on NP-hard optimization problems using quality-aware reinforcement learning. Testing on Qwen2.5-7B achieves 93.1% success rate and 46.6% quality ratio, substantially outperforming GPT-4o, with demonstrated transfer benefits across mathematics, logic, and reasoning tasks.

🧠 GPT-4
AIBullisharXiv – CS AI · May 127/10
🧠

Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning

Researchers propose Theorem-SFT, a novel supervised fine-tuning approach that teaches language models to apply mathematical rules explicitly rather than memorize surface-level correlations between problems and solutions. The method demonstrates significant performance improvements across benchmarks while revealing that feed-forward layers, not memorization itself, are the primary locus of reasoning capability.

AIBullisharXiv – CS AI · May 117/10
🧠

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

Researchers introduce rubric-grounded reinforcement learning, a framework that trains AI models using structured, multi-criterion rewards from an LLM judge rather than binary outcomes. Training Llama-3.1-8B on scientific documents achieved 71.7% normalized reward and demonstrated improved performance on multiple reasoning benchmarks, suggesting that document-grounded training signals can produce generalizable reasoning capabilities.

🧠 Llama
AIBullisharXiv – CS AI · Mar 127/10
🧠

Training Language Models via Neural Cellular Automata

Researchers developed a method using neural cellular automata (NCA) to generate synthetic data for pre-training language models, achieving up to 6% improvement in downstream performance with only 164M synthetic tokens. This approach outperformed traditional pre-training on 1.6B natural language tokens while being more computationally efficient and transferring well to reasoning benchmarks.

AINeutralarXiv – CS AI · Jun 46/10
🧠

Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models

Researchers introduce Dynamic Infilling Anchors (DIA), a training-free method that improves how diffusion large language models generate structured outputs like JSON or reasoning templates. By dynamically adjusting generation length constraints, DIA achieves better format compliance and accuracy on mathematical reasoning benchmarks without requiring model retraining.

AINeutralarXiv – CS AI · Jun 16/10
🧠

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Researchers propose Bottom-up Policy Optimization (BuPO), a novel reinforcement learning approach that optimizes internal layers of language models rather than treating them as unified policies. The study reveals that LLMs contain distinct internal policy structures with different entropy patterns across layers, offering new insights into how transformer-based models process reasoning tasks.

🧠 Llama
AINeutralarXiv – CS AI · May 286/10
🧠

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

Researchers propose SC-SDPO, an improved machine learning technique that enhances how large language models learn from their own feedback during training. By weighting training examples based on question difficulty, the method achieves 3-4% performance gains on reasoning benchmarks while maintaining stable training dynamics.

AINeutralarXiv – CS AI · May 276/10
🧠

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering and Reasoning

Researchers introduced EpiQAL, the first benchmark for evaluating large language models on epidemiological reasoning tasks. Testing 15 models reveals significant performance gaps in multi-step inference and evidence synthesis, indicating current LLMs struggle with population-level disease analysis despite their general capabilities.

AINeutralarXiv – CS AI · May 126/10
🧠

NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

NoisyCoconut is an inference-time method that improves LLM reliability by injecting controlled noise into internal representations to generate diverse reasoning paths, enabling models to abstain when uncertain without requiring retraining. The technique reduces error rates from 40-70% to below 15% on mathematical reasoning tasks through unanimous agreement among noise-perturbed paths, offering practical reliability improvements compatible with existing models.

AIBullisharXiv – CS AI · May 126/10
🧠

HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

Researchers introduce HTPO, a novel reinforcement learning algorithm that optimizes Large Language Models by assigning different learning objectives to different tokens based on their functional roles in reasoning tasks. The method achieves significant performance improvements on challenging benchmarks like AIME, demonstrating that granular token-level control can better balance exploration and exploitation in AI training.