#reasoning-capabilities News & Analysis

11 articles tagged with #reasoning-capabilities. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

11 articles

AIBearisharXiv – CS AI · Jun 97/10

🧠

More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs

A new study demonstrates that small language models (SLMs) have severely limited self-correction capabilities, gaining only 4.4% accuracy improvement even when provided correct answers and explicit hints. The research reveals that longer deliberation actually harms performance, challenging assumptions that increased compute budgets automatically improve reasoning abilities in smaller models.

AINeutralarXiv – CS AI · Apr 147/10

🧠

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

Researchers introduce General365, a benchmark revealing that leading LLMs achieve only 62.8% accuracy on general reasoning tasks despite excelling in domain-specific domains. The findings highlight a critical gap: current AI models rely heavily on specialized knowledge rather than developing robust, transferable reasoning capabilities applicable to real-world scenarios.

AIBearisharXiv – CS AI · Apr 137/10

🧠

On the Limits of Layer Pruning for Generative Reasoning in Large Language Models

Research demonstrates that layer pruning—a compression technique for large language models—effectively reduces model size while maintaining classification performance, but critically fails to preserve generative reasoning capabilities like arithmetic and code generation. Even with extensive post-training on 400B tokens, models cannot recover lost reasoning abilities, revealing fundamental limitations in current compression approaches.

AIBullisharXiv – CS AI · Mar 167/10

🧠

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Research shows that large language models' performance on short tasks may underestimate their capabilities, as small improvements in single-step accuracy lead to exponential gains in handling longer tasks. The study reveals that larger models excel at execution over many steps, though they suffer from 'self-conditioning' where previous errors increase the likelihood of future mistakes, which can be mitigated through 'thinking' mechanisms.

AINeutralarXiv – CS AI · Mar 46/103

🧠

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Researchers introduce CFE-Bench, a new multimodal benchmark for evaluating AI reasoning across 20+ STEM domains using authentic university exam problems. The best performing model, Gemini-3.1-pro-preview, achieved only 59.69% accuracy, highlighting significant gaps in AI reasoning capabilities, particularly in maintaining correct intermediate states through multi-step solutions.

AIBullisharXiv – CS AI · Mar 47/103

🧠

LEDOM: Reverse Language Model

Researchers have developed LEDOM, an open-source reverse autoregressive language model that trains right-to-left instead of the traditional left-to-right approach. The model demonstrates unique capabilities like abductive inference and question synthesis, and when combined with forward models through 'Reverse Reward' scoring, achieves significant performance gains of up to 15% on mathematical reasoning tasks.

AINeutralarXiv – CS AI · Jun 196/10

🧠

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

Researchers introduce QMFOL, an automated framework for generating controlled-complexity logical reasoning benchmarks to evaluate large language models. The resulting QMFOLBench dataset of 2,880 instances reveals that LLM reasoning performance degrades significantly with increased logical complexity, with models showing consistent bias toward true-labeled tasks over false or unknown ones.

AIBullisharXiv – CS AI · Jun 96/10

🧠

MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution

MetaEvo is a new framework that enables large language model-based agents to continuously improve through task experience by focusing on learning mechanisms rather than just memory storage. The two-stage approach combines preference-based optimization with modular architecture to help AI agents develop abstract principles and enhance reasoning capabilities over time.

AINeutralarXiv – CS AI · Jun 16/10

🧠

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

Researchers introduce PlanningBench, a framework for generating scalable and verifiable planning datasets to evaluate and train large language models on complex task coordination. The system uses a constraint-driven synthesis pipeline with adaptive difficulty control and finds that current frontier LLMs struggle with coupled constraints, though reinforcement learning on verified data improves performance across planning and instruction-following tasks.

AINeutralarXiv – CS AI · Apr 156/10

🧠

Why Did Apple Fall: Evaluating Curiosity in Large Language Models

Researchers have developed a comprehensive evaluation framework based on human curiosity scales to assess whether large language models exhibit curiosity-driven learning. The study finds that LLMs demonstrate stronger knowledge-seeking than humans but remain conservative in uncertain situations, with curiosity correlating positively to improved reasoning and active learning capabilities.

AINeutralarXiv – CS AI · Mar 36/103

🧠

Understanding the Role of Training Data in Test-Time Scaling

Research paper analyzes test-time scaling in large language models, revealing that longer reasoning chains (CoTs) can reduce training data requirements but may harm performance if relevant skills aren't present in training data. The study provides theoretical framework showing that diverse, relevant, and challenging training tasks optimize test-time scaling performance.