#reasoning-capability News & Analysis

8 articles tagged with #reasoning-capability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

8 articles

AIBearisharXiv – CS AI · Jun 87/10

🧠

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Researchers measured how well frontier AI models perform complex reasoning without explicit chain-of-thought (CoT) tokens, finding that no-CoT task-completion time horizons have doubled yearly over six years. GPT-5.5 now reaches over 3 minutes of reasoning complexity, with projections suggesting frontier models could exceed 7 minutes by 2028 and 25 minutes by 2030, raising concerns about the effectiveness of current AI safety monitoring approaches.

🧠 GPT-5

AINeutralarXiv – CS AI · May 117/10

🧠

Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary

Researchers propose that AI agents should invoke external tools only when epistemically necessary—when internal reasoning cannot reliably complete a task. The Theory of Agent framework treats tool use as a decision under uncertainty rather than a simple action optimization problem, arguing that unnecessary delegation wastes resources and prevents development of internal reasoning capabilities.

AIBullisharXiv – CS AI · May 97/10

🧠

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

Researchers propose Lorem Perturbation for Exploration (LoPE), a training technique that addresses the zero-advantage problem in reinforcement learning for large language models by prepending random Latin-based text to prompts, enabling broader reasoning exploration across 1.7B to 7B parameter models.

🏢 Perplexity

AINeutralarXiv – CS AI · Apr 207/10

🧠

PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research

Researchers introduced PRL-Bench, a comprehensive benchmark measuring large language models' ability to conduct autonomous physics research across five subfields. Testing frontier AI models revealed performance below 50%, exposing a significant capability gap between current LLMs and the demands of real-world scientific discovery.

AIBearisharXiv – CS AI · Jun 116/10

🧠

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

Researchers developed MentisOculi, a benchmark suite to test whether frontier multimodal AI models can use visual reasoning and mental imagery to solve complex problems. Testing shows that visual strategies—from latent tokens to generated images—fail to improve performance, revealing that despite their theoretical appeal, current models cannot effectively leverage visual thoughts for reasoning.

AINeutralarXiv – CS AI · Jun 106/10

🧠

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

Researchers introduce T1-Bench, a comprehensive benchmark for evaluating large language model-based agents across 25 domains with multi-step, multi-domain tasks that better reflect real-world complexity than existing benchmarks. The framework tests 12 models on structured reasoning, tool utilization, and conversational quality, with both automated and human evaluation methods.

AINeutralarXiv – CS AI · Jun 56/10

🧠

OneReason Technical Report

OneReason introduces a novel framework for improving reasoning capabilities in generative recommendation models by addressing perception and cognition limitations. The approach combines semantic grounding of item tokens with multi-level chain-of-thought sequences, demonstrating that effective reasoning requires both language understanding and coherent interest modeling rather than scaling alone.

AINeutralarXiv – CS AI · May 76/10

🧠

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

Researchers introduce CreativityBench, a benchmark with 4K entities and 150K+ affordance annotations to evaluate how well large language models can creatively repurpose tools by reasoning about their properties rather than canonical uses. Evaluations across 10 state-of-the-art LLMs reveal significant limitations: models struggle to identify correct parts, affordances, and physical mechanisms needed for non-obvious solutions, with performance gains from scaling and reasoning strategies like Chain-of-Thought proving limited.