y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#long-horizon-tasks News & Analysis

4 articles tagged with #long-horizon-tasks. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles
AINeutralarXiv โ€“ CS AI ยท 4d ago7/10
๐Ÿง 

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Researchers introduce HORIZON, a diagnostic benchmark for identifying and analyzing why large language model agents fail at long-horizon tasks requiring extended action sequences. By evaluating state-of-the-art models across multiple domains and proposing an LLM-as-a-Judge attribution pipeline, the study provides systematic methodology for understanding agent limitations and improving reliability.

๐Ÿง  GPT-5๐Ÿง  Claude
AIBullisharXiv โ€“ CS AI ยท Mar 167/10
๐Ÿง 

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Research shows that large language models' performance on short tasks may underestimate their capabilities, as small improvements in single-step accuracy lead to exponential gains in handling longer tasks. The study reveals that larger models excel at execution over many steps, though they suffer from 'self-conditioning' where previous errors increase the likelihood of future mistakes, which can be mitigated through 'thinking' mechanisms.

AIBullisharXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL Problems

Researchers developed ELMUR, a new AI architecture that uses external memory to help robots make better decisions over extremely long time periods. The system achieved 100% success on tasks requiring memory of up to one million steps and nearly doubled performance on robotic manipulation tasks compared to existing methods.

AIBullisharXiv โ€“ CS AI ยท Feb 276/104
๐Ÿง 

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

Researchers have developed Hierarchy-of-Groups Policy Optimization (HGPO), a new reinforcement learning method that improves AI agents' performance on long-horizon tasks by addressing context inconsistency issues in stepwise advantage estimation. The method shows significant improvements over existing approaches when tested on challenging agentic tasks using Qwen2.5 models.