y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#ai-reasoning News & Analysis

72 articles tagged with #ai-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

72 articles
AINeutralarXiv – CS AI · 2d ago7/10
🧠

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

FormInv introduces a measurement protocol that audits mathematical reasoning benchmarks for semantic consistency, revealing that current evaluation methods mask significant ranking volatility across AI models. The study found 3.1% semantically incorrect paraphrases in MathCheck that altered model rankings and discovered that models achieving similar accuracy scores (86-96%) exhibit drastically different consistency rates (50-82%) when tested against semantically equivalent problem restatements.

🧠 GPT-4🧠 Claude🧠 Haiku
AINeutralarXiv – CS AI · 3d ago7/10
🧠

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

Researchers prove that large language models fundamentally cannot perform causal discovery through standard training methods, establishing this limitation as intrinsic to supervised learning rather than a model-specific flaw. They propose Agentic Causal Bayesian Optimization (A-CBO), which bypasses this constraint by using frozen language models as query oracles within an external optimization loop, achieving superior performance on causal inference benchmarks.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience

Researchers demonstrate that knowledge graphs extracted from a single neuroscience textbook can be converted into high-quality training data to fine-tune language models, enabling expert-level reasoning that outperforms larger LLMs while using far fewer parameters. This approach challenges the prevailing assumption that domain expertise requires massive, diverse datasets, showing instead that structured, curated knowledge can produce superior specialized AI systems.

AIBearisharXiv – CS AI · 4d ago7/10
🧠

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Researchers introduced LiveK12Bench, a dynamic benchmark for evaluating Large Multimodal Models on realistic high school examinations across multiple disciplines. The study reveals that advanced LMMs like GPT-4 experience significant performance degradation when subjected to exam-realistic constraints, dropping from 79 to 53 points when process rigor and efficiency are jointly evaluated, exposing critical gaps between theoretical capabilities and practical educational readiness.

🧠 GPT-5
AIBullisharXiv – CS AI · May 127/10
🧠

CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

Researchers introduce CauSim, a framework that enables large language models to improve causal reasoning by constructing increasingly complex executable causal simulators. The approach transforms causal reasoning from a scarce-data problem into a scalable supervised learning task, allowing LLMs to generate synthetic training data and demonstrate improved performance across different representations.

AIBullisharXiv – CS AI · May 127/10
🧠

TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning

TimeClaw is a new AI framework that improves how large language models analyze time-series data by learning from exploratory execution rather than just solving individual problems. The system uses a four-stage loop to compare, distill, and reuse successful reasoning patterns, showing consistent improvements over baseline models in finance and weather prediction tasks.

AIBullisharXiv – CS AI · May 127/10
🧠

M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models

Researchers introduce M2A, a novel model merging paradigm that combines mathematical and agentic reasoning in large language models without retraining. The approach improves a Qwen3-8B model's software engineering benchmark performance from 44.0% to 51.2% by strategically injecting mathematical reasoning capabilities along directions that preserve agent behavior.

AIBullisharXiv – CS AI · May 127/10
🧠

Deep Arguing

Researchers introduce Deep Arguing, a neurosymbolic method that combines deep learning with argumentation reasoning to create interpretable AI classification models. The approach constructs argumentative structures where data points support or attack predictions, enabling end-to-end learning while providing human-understandable explanations for model decisions.

AIBullisharXiv – CS AI · May 17/10
🧠

ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

Researchers introduce ANCORA, a self-play framework enabling language models to generate verifiable problems, solve them, and improve without human supervision. The method achieves 81.5% pass rate on Dafny2Verus tasks, significantly outperforming baseline approaches and demonstrating advances in autonomous AI reasoning capabilities.

AIBullisharXiv – CS AI · Apr 147/10
🧠

AI Achieves a Perfect LSAT Score

A frontier language model has achieved a perfect score on the LSAT, marking the first documented instance of an AI system answering all questions without error on the standardized law school admission test. Research shows that extended reasoning and thinking processes are critical to this performance, with ablation studies revealing up to 8 percentage point drops in accuracy when these mechanisms are removed.

AIBullisharXiv – CS AI · Mar 177/10
🧠

Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation

Researchers developed Token-Selective Dual Knowledge Distillation (TSD-KD), a new framework that improves AI reasoning by allowing smaller models to learn from larger ones more effectively. The method achieved up to 54.4% better accuracy than baseline models on reasoning benchmarks, with student models sometimes outperforming their teachers by up to 20.3%.

AIBearisharXiv – CS AI · Mar 177/10
🧠

Faithful or Just Plausible? Evaluating the Faithfulness of Closed-Source LLMs in Medical Reasoning

Researchers evaluated the faithfulness of closed-source AI models like ChatGPT and Gemini in medical reasoning, finding that their explanations often appear plausible but don't reflect actual reasoning processes. The study revealed these models frequently incorporate external hints without acknowledgment and their chain-of-thought reasoning doesn't causally drive predictions, raising safety concerns for medical applications.

🧠 ChatGPT🧠 Gemini
AIBullisharXiv – CS AI · Mar 177/10
🧠

Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning

Researchers at NVIDIA developed NEMOTRON-CROSSTHINK, a new AI framework that uses reinforcement learning with multi-domain data to improve language model reasoning across diverse fields beyond just mathematics. The system shows significant performance improvements on both mathematical and non-mathematical reasoning benchmarks while using 28% fewer tokens for correct answers.

AIBullishMarkTechPost · Mar 167/10
🧠

Mistral AI Releases Mistral Small 4: A 119B-Parameter MoE Model that Unifies Instruct, Reasoning, and Multimodal Workloads

Mistral AI has launched Mistral Small 4, a 119-billion parameter Mixture of Experts (MoE) model that unifies instruction following, reasoning, and multimodal capabilities into a single deployment. This represents the first model from Mistral to consolidate the functions of their previously separate Mistral Small, Magistral, and Pixtral models.

Mistral AI Releases Mistral Small 4: A 119B-Parameter MoE Model that Unifies Instruct, Reasoning, and Multimodal Workloads
🏢 Mistral
AIBullisharXiv – CS AI · Mar 167/10
🧠

Efficient Reasoning with Balanced Thinking

Researchers propose ReBalance, a training-free framework that optimizes Large Reasoning Models by addressing overthinking and underthinking issues through confidence-based guidance. The solution dynamically adjusts reasoning trajectories without requiring model retraining, showing improved accuracy across multiple AI benchmarks.

AIBullisharXiv – CS AI · Mar 117/10
🧠

AlphaApollo: A System for Deep Agentic Reasoning

AlphaApollo is a new AI reasoning system that addresses limitations in foundation models through multi-turn agentic reasoning, learning, and evolution components. The system demonstrates significant performance improvements across math reasoning benchmarks, with success rates exceeding 85% for tool calls and substantial gains from reinforcement learning across different model scales.

AIBullisharXiv – CS AI · Mar 117/10
🧠

Logos: An evolvable reasoning engine for rational molecular design

Researchers introduce Logos, a compact AI model that combines multi-step logical reasoning with chemical consistency for molecular design. The model achieves strong performance in structural accuracy and chemical validity while using fewer parameters than larger language models, and provides transparent reasoning that can be inspected by humans.

AIBullisharXiv – CS AI · Mar 97/10
🧠

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

Researchers propose a new method for training large language models (LLMs) that addresses the diversity loss problem in reinforcement learning approaches. Their technique uses the α-divergence family to better balance precision and diversity in reasoning tasks, achieving state-of-the-art performance on theorem-proving benchmarks.

AINeutralarXiv – CS AI · Mar 56/10
🧠

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

Research reveals that Large Language Models show varying vulnerabilities to different types of Chain-of-Thought reasoning perturbations, with math errors causing 50-60% accuracy loss in small models while unit conversion issues remain challenging even for the largest models. The study tested 13 models across parameter ranges from 3B to 1.5T parameters, finding that scaling provides protection against some perturbations but limited defense against dimensional reasoning tasks.

AIBullisharXiv – CS AI · Mar 57/10
🧠

Perfect score on IPhO 2025 theory by Gemini agent

Google's Gemini 3.1 Pro Preview achieved a perfect score on IPhO 2025 theory problems across five runs, surpassing previous AI performance that fell behind top human contestants. However, the researchers acknowledge potential data contamination since the model was released after the competition.

🧠 Gemini
AIBullisharXiv – CS AI · Mar 56/10
🧠

R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning

Researchers developed R1-Code-Interpreter, a large language model that uses multi-stage reinforcement learning to autonomously generate code for step-by-step reasoning across diverse tasks. The 14B parameter model achieves 72.4% accuracy on test tasks, outperforming GPT-4o variants and demonstrating emergent self-checking capabilities through code generation.

🏢 Hugging Face🧠 GPT-4
AIBullisharXiv – CS AI · Mar 47/102
🧠

$\texttt{SEM-CTRL}$: Semantically Controlled Decoding

Researchers introduce SEM-CTRL, a new approach that ensures Large Language Models produce syntactically and semantically correct outputs without requiring fine-tuning. The system uses token-level Monte Carlo Tree Search guided by Answer Set Grammars to enforce context-sensitive constraints, allowing smaller pre-trained LLMs to outperform larger models on tasks like reasoning and planning.

AIBullisharXiv – CS AI · Mar 37/103
🧠

ExGRPO: Learning to Reason from Experience

Researchers introduce ExGRPO, a new framework that improves AI reasoning by reusing and prioritizing valuable training experiences based on correctness and entropy. The method shows consistent performance gains of +3.5-7.6 points over standard approaches across multiple model sizes while providing more stable training.

AINeutralarXiv – CS AI · Mar 37/104
🧠

When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models

Researchers analyzed compression effects on large reasoning models (LRMs) through quantization, distillation, and pruning methods. They found that dynamically quantized 2.51-bit models maintain near-original performance, while identifying critical weight components and showing that protecting just 2% of excessively compressed weights can improve accuracy by 6.57%.

Page 1 of 3Next →