y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#agent-capabilities News & Analysis

6 articles tagged with #agent-capabilities. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

6 articles
AINeutralarXiv – CS AI · Jun 97/10
🧠

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Researchers introduce WeaveBench, a comprehensive benchmark for evaluating computer-use agents across hybrid interfaces combining GUI, CLI, and code operations. The benchmark reveals significant capability gaps, with the best frontier models achieving only 41.2% success rates on 114 real-world tasks, indicating that current AI agents struggle with complex multi-interface orchestration.

AIBullisharXiv – CS AI · Jun 57/10
🧠

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

Researchers introduce Retrospective Harness Optimization (RHO), a self-supervised method that enables AI agents to improve their capabilities using only historical trajectory data without requiring external validation sets. The approach improved performance on SWE-Bench Pro from 59% to 78% pass rate in a single optimization round, demonstrating practical effectiveness across software engineering, technical work, and knowledge domains.

AIBearisharXiv – CS AI · Jun 27/10
🧠

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

A new study challenges claims that multimodal AI agents genuinely benefit from tool use, finding that 93-96% of problems solved with tools are also solvable without them. The research suggests these agents learn tool-calling patterns rather than actual tool-dependent capabilities, raising questions about how benchmark improvements are interpreted.

AINeutralarXiv – CS AI · Jun 86/10
🧠

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

Researchers introduce MacArena, a comprehensive benchmark with 421 tasks across 50 macOS applications to evaluate computer-use agents on Apple's native platform. The benchmark reveals significant performance gaps between Linux-based benchmarks and macOS environments, with leading AI models showing over 26% performance degradation on macOS-native tasks, indicating that existing evaluations may overestimate cross-platform GUI competence.

AIBullisharXiv – CS AI · May 286/10
🧠

SkillGrad: Optimizing Agent Skills Like Gradient Descent

SkillGrad introduces a gradient-descent-inspired framework for automatically optimizing LLM agent skills, treating skill packages as parameters to be refined through task execution feedback and systematic diagnosis. The method outperforms existing training-based approaches by 6.7 percentage points on benchmark tasks, demonstrating measurable improvements in agent reliability and capability.

AINeutralarXiv – CS AI · May 116/10
🧠

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

Researchers introduced AgentEscapeBench, a benchmark that evaluates how well LLM-based agents can reason through complex, multi-step tasks requiring external tool use and long-range dependency tracking. Testing 16 LLM agents against 270 escape-room-style problems revealed significant performance degradation as task complexity increased, with the best models dropping from 90% success to 60% as dependency depth tripled, highlighting a critical limitation in current AI agent capabilities.