AIBullisharXiv – CS AI · Feb 277/107
🧠Molmo2 is a new open-source family of vision-language models that achieves state-of-the-art performance among open models, particularly excelling in video understanding and pixel-level grounding tasks. The research introduces 7 new video datasets and 2 multi-image datasets collected without using proprietary VLMs, along with an 8B parameter model that outperforms existing open-weight models and even some proprietary models on specific tasks.
AINeutralarXiv – CS AI · Jun 56/10
🧠Researchers introduce SPG-LLM, a novel approach that leverages large language models to optimize the grounding process in classical planning by identifying irrelevant objects and actions before computation. The method achieves significantly faster grounding times—often by orders of magnitude—across seven challenging benchmarks while maintaining or improving plan quality.
AINeutralarXiv – CS AI · Jun 46/10
🧠Researchers introduce NoRA, a visual reasoning benchmark that evaluates whether AI models can generate and justify appropriate actions in first-person video scenarios through explicit reasoning graphs. The benchmark reveals that current multimodal language models struggle to construct complete action spaces and properly ground decisions in visible evidence, highlighting a critical gap between selecting plausible actions and explaining them through verifiable reasoning.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers introduce ROVER, a lightweight plugin that enhances multimodal large language models' ability to reason across multiple images by intelligently routing visual evidence to specific objects. The approach achieves significant performance improvements on grounded reasoning benchmarks while reducing computational overhead compared to existing methods.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce TRACE, a benchmark dataset for evaluating tourism recommendation systems that combine multi-turn dialogue, verifiable review citations, and rejection recovery. The dataset reveals a significant gap in existing conversational recommender systems: LLMs excel at recall but cite weakly, while retrieval-based systems ground better but struggle with accuracy and adaptation.
AIBullisharXiv – CS AI · Apr 76/10
🧠Researchers introduced GroundedKG-RAG, a new retrieval-augmented generation system that creates knowledge graphs directly grounded in source documents to improve long-document question answering. The system reduces resource consumption and hallucinations while maintaining accuracy comparable to state-of-the-art models at lower cost.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers propose a hierarchical planning framework to analyze why LLM-based web agents fail at complex navigation tasks. The study reveals that while structured PDDL plans outperform natural language plans, low-level execution and perceptual grounding remain the primary bottlenecks rather than high-level reasoning.
AIBullishGoogle DeepMind Blog · Dec 176/103
🧠Researchers have introduced FACTS Grounding, a new benchmark designed to evaluate how accurately large language models ground their responses in source material and avoid hallucinations. The benchmark includes a comprehensive evaluation system and online leaderboard to measure LLM factuality performance.