y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#embodied-ai News & Analysis

109 articles tagged with #embodied-ai. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

109 articles
AIBullisharXiv – CS AI · 2d ago7/10
🧠

Planning with the Views via Scene Self-Exploration

Researchers introduce ViewSuite, a benchmark revealing that Vision Language Models struggle to plan multi-step camera movements in 3D environments despite understanding individual view transformations. A self-exploration framework with view graph distillation dramatically improves planning capability, boosting Qwen2.5-VL-7B performance from 2.5% to 47.8% accuracy.

🧠 GPT-5🧠 Gemini
AIBullisharXiv – CS AI · 2d ago7/10
🧠

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

JAEGER is a new AI framework that extends audio-visual large language models from 2D to 3D space, enabling spatial grounding and reasoning in physical environments through RGB-D observations and multi-channel audio. The researchers introduce Neural Intensity Vector (Neural IV) for enhanced directional audio analysis and release SpatialSceneQA, a 61k-sample benchmark for training and evaluation.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Alibaba's Qwen team released Qwen-VLA, a unified foundation model that combines vision, language, and action capabilities for robotics across multiple tasks and robot types. The model demonstrates strong performance on manipulation, navigation, and trajectory prediction benchmarks while generalizing well to out-of-distribution scenarios and real-world robot deployments.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

Causal-JEPA: Learning World Models through Object-Level Latent Masking

Researchers introduce Causal-JEPA (C-JEPA), an object-centric world model that uses masked latent prediction to learn interaction-dependent dynamics more effectively. The approach demonstrates significant improvements in visual reasoning tasks and enables more efficient AI planning with substantially fewer input features than existing patch-based models.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

HumanoidMimicGen: Data Generation for Loco-Manipulation via Whole-Body Planning

Researchers introduce HumanoidMimicGen, a method for automatically generating training data for humanoid robots performing complex locomotion and manipulation tasks. The approach enables imitation learning at scale without labor-intensive teleoperation, achieving 20% performance improvements over models trained solely on real-world demonstrations.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

Researchers introduce a hierarchical decomposition method to improve large language models' spatial reasoning capabilities, a persistent weakness limiting their real-world applications. The approach combines task decomposition with a novel MCTS-Guided Group Relative Policy Optimization algorithm to enhance LLM performance on navigation, planning, and strategic games.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

Researchers propose VLA-Pruner, a novel token pruning method that accelerates Vision-Language-Action models for embodied AI by addressing the mismatch between semantic and action-critical visual processing. The method achieves up to 1.99x speedup while maintaining manipulation performance by considering both semantic context and temporal action relevance, unlike existing VLM pruning approaches.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

Identifiable Token Correspondence for World Models

Researchers introduce Identifiable Token Correspondence (ITC), a decoding technique that improves token-based transformer world models for visual reinforcement learning by treating next-frame prediction as a structured assignment problem. The method addresses temporal inconsistency issues like object duplication and disappearance, achieving state-of-the-art results on multiple benchmarks including a significant performance jump on Craftax-classic.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

Researchers introduce FineVLA, a framework that enhances Vision-Language-Action models for robotics by incorporating fine-grained instruction supervision beyond simple goal-level commands. The system combines 972,247 trajectories into a curated dataset of 47,159 fine-grained trajectories and demonstrates that mixing fine-grained and coarse instructions improves real-world robot manipulation success rates to 62.7% compared to 49.9% with goal-level instructions alone.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

Neuro-Inspired Inverse Learning for Planning and Control

Researchers present Inverse Learning (IL), a neuro-inspired framework for embodied AI planning that outperforms offline reinforcement learning and diffusion-based planners on D4RL benchmarks by an average of 24.2% while requiring 1-2 orders of magnitude less inference compute. The approach optimizes entire action sequences through forward models rather than step-by-step decisions, enabling faster, smoother control policies applicable to robotics and quantum gate synthesis.

AIBullisharXiv – CS AI · May 127/10
🧠

Octopus Protocol: One-Shot Hardware Discovery and Control for AI Agents via Infrastructure-as-Prompts

Octopus Protocol automates hardware discovery and control for AI agents through a single command, eliminating the need for manual driver and SDK development. The system uses a five-stage pipeline to detect connected devices, generate typed tools via Model Context Protocol, and deploy live endpoints, reducing hardware onboarding from weeks to 10-15 minutes.

AIBullisharXiv – CS AI · May 127/10
🧠

Geometry Guided Self-Consistency for Physical AI

Researchers introduce KeyStone, an inference-time method that improves physical AI model performance by generating multiple candidate action trajectories in parallel and selecting the most physically coherent one using geometric clustering. The technique achieves up to 13.3% improvement in task success rates across vision-language-action and world-action models without additional latency or training costs.

AIBullisharXiv – CS AI · May 127/10
🧠

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

SimWorld Studio is an open-source platform that automatically generates diverse 3D environments for training embodied AI agents using an evolving coding agent called SimCoder. The system demonstrates significant performance improvements through self-evolution and co-evolution mechanisms, achieving 18-point success-rate gains in navigation tasks compared to fixed environments.

AIBullisharXiv – CS AI · May 127/10
🧠

RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models

Researchers introduce RePO-VLA, a policy optimization framework that improves Vision-Language-Action models' ability to recover from failures in complex manipulation tasks. The method increases adversarial robustness from 20% to 75% by learning from recovery trajectories rather than discarding failed attempts, with validation on both simulated and real-world robotic tasks.

AIBullisharXiv – CS AI · May 127/10
🧠

NEXUS: Continual Learning of Symbolic Constraints for Safe and Robust Embodied Planning

Researchers introduce NEXUS, a framework enabling embodied AI agents to learn symbolic constraints for safer decision-making in physical environments. The system addresses the gap between probabilistic language models and the deterministic safety requirements of robotics by decoupling physical feasibility from safety specifications, achieving improved task success while refusing unsafe instructions.

AIBullisharXiv – CS AI · May 127/10
🧠

LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations

Researchers introduce Least Action World Models (LaWM), a framework that applies physics principles to improve visual prediction in AI systems. By embedding the Principle of Least Action into learned latent spaces, LaWM enables longer, more physically consistent predictions for embodied AI and robotic planning without requiring external constraints or auxiliary losses.

AINeutralarXiv – CS AI · May 127/10
🧠

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

Researchers introduce EnactToM, a benchmark testing whether AI agents can understand and act on others' beliefs in multi-agent embodied environments. Current frontier models achieve 0% on functional theory of mind tasks, revealing a critical gap in AI reasoning capabilities despite performing better on direct belief questions.

AIBullisharXiv – CS AI · May 117/10
🧠

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

Researchers introduce OneWM-VLA, a new approach to vision-language-action models that compresses visual input to a single token per frame while maintaining or improving long-horizon task performance. The method achieves significant improvements on robotics benchmarks including 61.3% success on MetaWorld MT50 and 60% on real-world cloth folding tasks, demonstrating that excessive visual bandwidth in world models may be unnecessary.

AIBullisharXiv – CS AI · May 117/10
🧠

Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

Researchers introduce Sword, a world model framework that improves Vision-Language-Action (VLA) models' ability to simulate environments for policy training. By addressing visual style sensitivity and error accumulation in long-horizon predictions, Sword demonstrates significant performance gains on the LIBERO benchmark, advancing the feasibility of training AI agents entirely within simulated environments.

AIBullisharXiv – CS AI · May 117/10
🧠

CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations

Researchers introduce Cached State Representation (CSR), a framework that reduces latency in deploying large language models for robotics by 26-fold through optimized token caching and asynchronous state management. The approach enables real-time robot control with massive language models while maintaining full contextual understanding over infinite operational horizons.

AIBullisharXiv – CS AI · May 117/10
🧠

ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

ForgeVLA introduces a federated learning framework that enables Vision-Language-Action models to train on distributed robot data without centralizing sensitive information or requiring manual language annotations. The system uses embodied instruction classifiers to automatically generate missing language labels and addresses vision-language feature collapse through contrastive learning and adaptive aggregation.

AIBullisharXiv – CS AI · May 97/10
🧠

Continually Evolving Skill Knowledge in Vision Language Action Model

Researchers introduce Stellar VLA, a continual learning framework for vision-language-action models that improves knowledge accumulation without adding network parameters. The approach uses knowledge-guided expert routing and hierarchical task structures, achieving strong performance on robotics benchmarks with minimal data replay and validated real-world transfer capabilities.

AIBearisharXiv – CS AI · May 97/10
🧠

How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

Researchers present ImmersedPrivacy, an evaluation framework that tests Vision-Language Models' ability to recognize and respect privacy in physical environments. Testing 12 state-of-the-art VLMs reveals significant deficiencies: all models struggle with cluttered scenes, none exceed 65% accuracy when social context changes, and even the best model only balances task completion with privacy preservation 51% of the time.

AINeutralarXiv – CS AI · May 77/10
🧠

iWorld-Bench: A Benchmark for Interactive World Models with a Unified Action Generation Framework

Researchers introduced iWorld-Bench, a comprehensive benchmark dataset and evaluation framework for training and testing interactive world models with 330k video clips and 4.9k test samples. The framework unifies evaluation across different model architectures through a standardized Action Generation Framework and assesses capabilities in visual generation, trajectory following, and memory tasks.

Page 1 of 5Next →