AIBullisharXiv – CS AI · May 47/10
🧠Researchers introduce Odysseus, an open framework for training vision-language models (VLMs) to handle 100+ turn decision-making tasks using reinforcement learning, demonstrated through Super Mario Land gameplay. The work achieves 3x better performance than existing models while maintaining general capabilities, advancing the frontier of embodied AI agents.
AIBullisharXiv – CS AI · May 17/10
🧠OmniDrive-R1 is a new Vision-Language Model framework that addresses critical reliability failures in autonomous driving by combining perception and reasoning through an interleaved multi-modal chain-of-thought mechanism, achieving significant accuracy improvements (37.81% to 73.62%) without requiring dense localization labels.
AIBullisharXiv – CS AI · Mar 277/10
🧠Researchers introduce the Wireless World Model (WWM), a multi-modal AI framework for 6G networks that predicts wireless channel evolution by understanding electromagnetic wave propagation through 3D geometry. The model demonstrates superior performance across five downstream tasks and real-world measurements, outperforming existing foundation models.
AIBullisharXiv – CS AI · 15h ago6/10
🧠Hi-SAM is a new hierarchical multi-modal recommendation framework that improves how AI systems process diverse data types (text, images) for personalized suggestions. The system addresses tokenization inefficiencies and architectural misalignments in existing approaches, achieving 6.55% improvement in core metrics when deployed at scale.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce TAP (Two-Stage Adaptive Personalization), a novel federated learning framework that enables personalized fine-tuning of foundation models across clients with heterogeneous tasks and modalities. The method uses mismatched architectures to prevent cross-task interference and post-FL distillation to recover shared knowledge, advancing practical deployment of AI systems in distributed environments.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers propose a novel knowledge distillation method for multi-modal AI systems that transfers modality relationship information from teacher to student networks by learning the teacher's Gram Matrix. This approach goes beyond existing methods that only focus on final output, enabling deeper knowledge transfer across different data modalities.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce Spatial-Gym, a benchmarking environment that evaluates AI models on spatial reasoning tasks through step-by-step pathfinding in 2D grids rather than one-shot generation. Testing eight models reveals a significant performance gap, with the best model achieving only 16% solve rate versus 98% for humans, exposing critical limitations in how AI systems scale reasoning effort and process spatial information.
AIBullisharXiv – CS AI · Apr 136/10
🧠Researchers introduce VISOR, a new agentic visual retrieval-augmented generation system that improves how AI models reason over multi-page visual documents. By addressing key technical challenges in evidence gathering and context management, VISOR achieves state-of-the-art results on complex visual reasoning tasks.
AINeutralarXiv – CS AI · Apr 106/10
🧠Researchers have developed an enhanced version of YOLOv5 that combines visual and textual data through cross-attention mechanisms to improve UI control detection in software screenshots. Tested on over 16,000 annotated images across 23 control classes, the multi-modal approach significantly outperforms pixel-only detection, with convolutional fusion showing the strongest results for semantically complex elements.