#long-horizon-tasks News & Analysis

14 articles tagged with #long-horizon-tasks. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles

AIBullisharXiv – CS AI · 2d ago7/10

🧠

ASH: Agents that Self-Hone via Embodied Learning

Researchers introduce ASH, an agentic system that learns embodied policies from unlabeled internet video without reward shaping or expert demonstration. Through a self-improvement loop using Inverse Dynamics Models, ASH achieves sustained progression on long-horizon tasks in Pokemon Emerald and Legend of Zelda, significantly outperforming baseline approaches.

AIBullisharXiv – CS AI · 6d ago7/10

🧠

Periodic RoPE for Infinite Context LLMs

Researchers propose Periodic RoPE (P-RoPE), a novel positional encoding mechanism that combines sliding window attention for local dependencies with global attention layers lacking positional constraints, enabling language models to theoretically support infinite context windows without performance degradation. The approach addresses a fundamental limitation in current LLMs where model performance degrades when sequence length exceeds the pre-trained range of positional encodings like RoPE.

AIBearishDecrypt – AI · May 277/10

🧠

Huawei's New Benchmark Gives AI Agents Months of Your Life—Then Watches Them Fail

Huawei has introduced Claw-Anything, a benchmark that tests AI agents' ability to handle complex digital tasks over extended simulated timeframes. GPT-5.5, currently the best-performing model, achieved only 34.5% on the benchmark, highlighting significant limitations in current AI agents' capacity to maintain performance during long-horizon tasks.

🧠 GPT-5

AIBullisharXiv – CS AI · May 127/10

🧠

The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents

Researchers propose Agent Cybernetics, a theoretical framework applying mid-20th century control systems theory to modern LLM-based AI agents. The framework addresses critical gaps in how foundation agents are designed, offering scientific principles for reliability, continuous operation, and safe self-improvement across long-horizon tasks.

AIBullisharXiv – CS AI · May 127/10

🧠

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

Researchers propose DeMem, a decision-centric memory framework that optimizes agent memory allocation based on preserving distinctions needed for sound decision-making rather than descriptive accuracy. Using rate-distortion theory, the approach identifies what information can be safely forgotten under memory constraints and demonstrates performance gains on long-horizon language agent tasks.

AIBullisharXiv – CS AI · May 97/10

🧠

ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

ReFlect introduces a training-free harness system that wraps around LLMs to detect and recover from reasoning failures in complex, multi-step tasks. Testing across six models shows significant improvements in task success rates, with gains inversely correlated to baseline performance, though the approach reveals limitations in how smaller models handle structured reasoning.

🧠 GPT-4🧠 Claude🧠 Sonnet

AIBullisharXiv – CS AI · May 97/10

🧠

Milestone-Guided Policy Learning for Long-Horizon Language Agents

Researchers introduce BEACON, a milestone-guided policy learning framework that significantly improves training efficiency for long-horizon language agents by solving credit misattribution and sample inefficiency problems. The approach achieves 92.9% success rates on complex tasks—nearly double previous benchmarks—while improving sample utilization from 23.7% to 82.0%.

AINeutralarXiv – CS AI · Apr 157/10

🧠

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Researchers introduce HORIZON, a diagnostic benchmark for identifying and analyzing why large language model agents fail at long-horizon tasks requiring extended action sequences. By evaluating state-of-the-art models across multiple domains and proposing an LLM-as-a-Judge attribution pipeline, the study provides systematic methodology for understanding agent limitations and improving reliability.

🧠 GPT-5🧠 Claude

AIBullisharXiv – CS AI · Mar 167/10

🧠

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Research shows that large language models' performance on short tasks may underestimate their capabilities, as small improvements in single-step accuracy lead to exponential gains in handling longer tasks. The study reveals that larger models excel at execution over many steps, though they suffer from 'self-conditioning' where previous errors increase the likelihood of future mistakes, which can be mitigated through 'thinking' mechanisms.

AIBullisharXiv – CS AI · Mar 57/10

🧠

ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL Problems

Researchers developed ELMUR, a new AI architecture that uses external memory to help robots make better decisions over extremely long time periods. The system achieved 100% success on tasks requiring memory of up to one million steps and nearly doubled performance on robotic manipulation tasks compared to existing methods.

AIBullisharXiv – CS AI · 2d ago6/10

🧠

Learning Agent-Compatible Context Management for Long-Horizon Tasks

Researchers introduce Adaptive Context Management (AdaCoM), an external LLM-based system that optimizes how AI agents handle long-context tasks by learning agent-specific compression strategies through reinforcement learning. The approach improves performance on web search and research benchmarks while avoiding the need to retrain frozen agents, revealing that high-performing agents benefit from preserving context fidelity while weaker agents need more aggressive compression.

AINeutralarXiv – CS AI · May 276/10

🧠

Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL

Researchers introduce CARL, a hierarchical reinforcement learning algorithm that discovers reusable skills by exploiting local dynamics regularity—the observation that similar action sequences solve similar local transitions across different contexts. When integrated with existing HRL methods like HIQL, CARL demonstrates improved performance on complex tasks and meaningful skill clustering in humanoid environments.

AIBullisharXiv – CS AI · May 116/10

🧠

AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

AgentProg introduces a novel program-guided context management system for long-horizon GUI agents that addresses the critical bottleneck of expanding interaction history overhead. By reframing interaction history as structured programs with variables and control flow, the approach preserves semantic information while reducing context requirements, achieving state-of-the-art performance on AndroidWorld benchmarks while maintaining robustness on extended tasks.

AIBullisharXiv – CS AI · Feb 276/104

🧠

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

Researchers have developed Hierarchy-of-Groups Policy Optimization (HGPO), a new reinforcement learning method that improves AI agents' performance on long-horizon tasks by addressing context inconsistency issues in stepwise advantage estimation. The method shows significant improvements over existing approaches when tested on challenging agentic tasks using Qwen2.5 models.