y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#reasoning-capabilities News & Analysis

7 articles tagged with #reasoning-capabilities. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

7 articles
AINeutralarXiv – CS AI Β· 4d ago7/10
🧠

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

Researchers introduce General365, a benchmark revealing that leading LLMs achieve only 62.8% accuracy on general reasoning tasks despite excelling in domain-specific domains. The findings highlight a critical gap: current AI models rely heavily on specialized knowledge rather than developing robust, transferable reasoning capabilities applicable to real-world scenarios.

AIBearisharXiv – CS AI Β· 5d ago7/10
🧠

On the Limits of Layer Pruning for Generative Reasoning in Large Language Models

Research demonstrates that layer pruningβ€”a compression technique for large language modelsβ€”effectively reduces model size while maintaining classification performance, but critically fails to preserve generative reasoning capabilities like arithmetic and code generation. Even with extensive post-training on 400B tokens, models cannot recover lost reasoning abilities, revealing fundamental limitations in current compression approaches.

AIBullisharXiv – CS AI Β· Mar 167/10
🧠

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Research shows that large language models' performance on short tasks may underestimate their capabilities, as small improvements in single-step accuracy lead to exponential gains in handling longer tasks. The study reveals that larger models excel at execution over many steps, though they suffer from 'self-conditioning' where previous errors increase the likelihood of future mistakes, which can be mitigated through 'thinking' mechanisms.

AINeutralarXiv – CS AI Β· Mar 46/103
🧠

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Researchers introduce CFE-Bench, a new multimodal benchmark for evaluating AI reasoning across 20+ STEM domains using authentic university exam problems. The best performing model, Gemini-3.1-pro-preview, achieved only 59.69% accuracy, highlighting significant gaps in AI reasoning capabilities, particularly in maintaining correct intermediate states through multi-step solutions.

AIBullisharXiv – CS AI Β· Mar 47/103
🧠

LEDOM: Reverse Language Model

Researchers have developed LEDOM, an open-source reverse autoregressive language model that trains right-to-left instead of the traditional left-to-right approach. The model demonstrates unique capabilities like abductive inference and question synthesis, and when combined with forward models through 'Reverse Reward' scoring, achieves significant performance gains of up to 15% on mathematical reasoning tasks.

AINeutralarXiv – CS AI Β· 3d ago6/10
🧠

Why Did Apple Fall: Evaluating Curiosity in Large Language Models

Researchers have developed a comprehensive evaluation framework based on human curiosity scales to assess whether large language models exhibit curiosity-driven learning. The study finds that LLMs demonstrate stronger knowledge-seeking than humans but remain conservative in uncertain situations, with curiosity correlating positively to improved reasoning and active learning capabilities.

AINeutralarXiv – CS AI Β· Mar 36/103
🧠

Understanding the Role of Training Data in Test-Time Scaling

Research paper analyzes test-time scaling in large language models, revealing that longer reasoning chains (CoTs) can reduce training data requirements but may harm performance if relevant skills aren't present in training data. The study provides theoretical framework showing that diverse, relevant, and challenging training tasks optimize test-time scaling performance.