#long-horizon-reasoning News & Analysis

14 articles tagged with #long-horizon-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles

AIBullisharXiv – CS AI · Jun 257/10

🧠

RAVEN: Long-Horizon Reasoning & Navigation with a Visuo-Spatio-Temporal Memory

Researchers introduce RAVEN, an agentic memory system that enables robots to perform long-horizon navigation and question-answering tasks by storing visual embeddings with spatial-temporal metadata in a vector database. The system achieves 10× lower retrieval costs than caption-based approaches while matching frontier vision-language models, and has been successfully deployed on physical robots for real-world navigation.

AIBullisharXiv – CS AI · Jun 117/10

🧠

Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

Researchers introduce Autopilot, an execution framework for long-horizon LLM agents that prevents false success claims through a verifiable finite-state machine architecture. Testing across 3,150 cases shows Autopilot reduces fabrication rates to 0.95% compared to 8.10% and 25.05% for competing systems, with dramatic improvements on complex software engineering benchmarks.

AIBullisharXiv – CS AI · Jun 57/10

🧠

AdaMEM: Test-Time Adaptive Memory for Language Agents

Researchers introduce AdaMEM, a test-time adaptive memory framework that enables language agents to dynamically adjust behavior during inference without updating model parameters. The system combines persistent offline trajectory memory with dynamically generated on-the-fly strategy memory, demonstrating 11-13% performance improvements on complex reasoning and web interaction tasks.

AINeutralarXiv – CS AI · Jun 47/10

🧠

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Researchers introduce AutoLab, a benchmark testing whether frontier AI models can solve complex, multi-step engineering tasks over extended time horizons. Testing 17 state-of-the-art models reveals that persistence and iterative refinement—not initial quality—predict success, with most models failing to sustain long-horizon optimization despite their capabilities.

AIBullisharXiv – CS AI · Jun 27/10

🧠

TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety

Researchers introduce TRACE, a novel safety detection system for long-horizon LLM agents that compresses extended trajectories into compact evidence states to better identify distributed risk signals. The method achieves up to 12.6 percentage points improvement over baselines across multiple safety benchmarks while maintaining performance stability as context length increases.

AIBullisharXiv – CS AI · Jun 27/10

🧠

MemPro: Agentic Memory Systems as Evolvable Programs

Researchers introduce MemPro, an evolution framework that treats autonomous agent memory systems as adaptable programs rather than static pipelines. By iteratively diagnosing failures and refining the entire memory-construction-retrieval pipeline, MemPro outperforms fixed baselines on multiple benchmarks while maintaining computational efficiency.

AIBearisharXiv – CS AI · Jun 17/10

🧠

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Researchers introduce LongDS, a benchmark revealing significant limitations in AI agents performing long-horizon data analysis tasks. Testing five state-of-the-art models shows best performance of only 48.45% accuracy with performance degrading by 47 points across task progression, indicating that maintaining analytical context over extended interactions remains a critical unsolved problem.

AIBullisharXiv – CS AI · May 297/10

🧠

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Researchers introduce Metacognitive Memory Policy Optimization (MMPO), a novel training method that improves how AI language model agents manage memory across long-horizon tasks. The approach uses Belief Entropy—a self-supervised metric measuring uncertainty about task state—to provide fine-grained supervision during memory summarization, enabling agents to maintain 97.1% performance even with 1.75M-token contexts.

AIBullisharXiv – CS AI · May 287/10

🧠

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

Researchers propose a sleep-like mechanism for transformer language models that periodically consolidates context into persistent fast weights, reducing the computational burden of long sequences. The method shifts heavy computation offline while maintaining fast inference speeds, showing significant improvements on reasoning tasks that standard transformers struggle with.

AIBullisharXiv – CS AI · May 127/10

🧠

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

Researchers introduce a learnable approach to commitment depth—the number of primitive actions executed before replanning—in vision-language models for long-horizon reasoning. Their adaptive policy outperforms fixed-depth baselines and surpasses GPT-4.5 and Claude Sonnet on puzzle-solving tasks, achieving higher solve rates with fewer actions.

🧠 GPT-5🧠 Claude

AIBullisharXiv – CS AI · May 127/10

🧠

Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents

Researchers introduce Slipstream, a system that validates LLM agent trajectory compression by running compaction asynchronously alongside continued agent execution, enabling independent validation of summarized context. The approach improves task accuracy by up to 8.8 percentage points while reducing latency by 39.7% on long-horizon coding and web-browsing tasks.

AINeutralarXiv – CS AI · Jun 26/10

🧠

LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

Researchers introduce LLM-WikiRace, a benchmark that tests large language models' planning and reasoning abilities by requiring them to navigate Wikipedia links from a source to target page. While frontier models like Gemini-3 achieve superhuman performance on easy tasks, success rates plummet to 23% on hard difficulty, revealing significant limitations in long-horizon planning and recovery from failures.

🧠 GPT-5🧠 Claude🧠 Opus

AINeutralarXiv – CS AI · Apr 146/10

🧠

Belief-Aware VLM Model for Human-like Reasoning

Researchers propose a belief-aware Vision Language Model framework that enhances human-like reasoning by integrating retrieval-based memory and reinforcement learning. The approach addresses limitations in current VLMs and VLAs by approximating belief states through vector-based memory, demonstrating improved performance on vision-question-answering tasks compared to zero-shot baselines.

AIBullisharXiv – CS AI · Mar 37/109

🧠

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Researchers have developed MM-Mem, a new pyramidal multimodal memory architecture that enables AI systems to better understand long-horizon videos by mimicking human cognitive memory processes. The system addresses current limitations in multimodal large language models by creating a hierarchical memory structure that progressively distills detailed visual information into high-level semantic understanding.