#embodied-ai News & Analysis

49 articles tagged with #embodied-ai. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

49 articles

AINeutralarXiv – CS AI · Mar 266/10

🧠

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Researchers introduce GameplayQA, a new benchmarking framework for evaluating multimodal large language models on 3D virtual agent perception and reasoning tasks. The framework uses densely annotated multiplayer gameplay videos with 2.4K diagnostic QA pairs, revealing substantial performance gaps between current frontier models and human-level understanding.

AIBullisharXiv – CS AI · Mar 176/10

🧠

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

Researchers propose OxyGen, a unified KV cache management system for Vision-Language-Action Models that enables efficient multi-task parallelism in embodied AI agents. The system achieves up to 3.7x speedup by sharing computational resources across tasks and eliminating redundant processing of shared observations.

AIBullisharXiv – CS AI · Mar 176/10

🧠

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Researchers introduce VLA-Thinker, a new AI framework that enhances Vision-Language-Action models by enabling dynamic visual reasoning during robotic tasks. The system achieved a 97.5% success rate on LIBERO benchmarks through a two-stage training pipeline combining supervised fine-tuning and reinforcement learning.

AIBullisharXiv – CS AI · Mar 176/10

🧠

RoCo Challenge at AAAI 2026: Benchmarking Robotic Collaborative Manipulation for Assembly Towards Industrial Automation

The RoCo Challenge at AAAI 2026 introduces a new benchmark for robotic collaborative manipulation in industrial assembly tasks, featuring a planetary gearbox assembly challenge. Over 60 teams participated in both simulation and real-world rounds, with winning solutions demonstrating the effectiveness of dual-model frameworks and recovery-from-failure curriculum learning for long-horizon robotic tasks.

AINeutralarXiv – CS AI · Mar 176/10

🧠

EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos

EgoGrasp introduces the first method to reconstruct world-space hand-object interactions from egocentric videos using open-vocabulary objects. The multi-stage framework combines vision foundation models with body-guided diffusion models to achieve state-of-the-art performance in 3D scene reconstruction and hand pose estimation.

AIBullisharXiv – CS AI · Mar 116/10

🧠

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

FALCON introduces a novel vision-language-action model that bridges the spatial reasoning gap by injecting 3D spatial tokens into action heads while preserving language reasoning capabilities. The system achieves state-of-the-art performance across simulation benchmarks and real-world tasks by leveraging spatial foundation models to provide geometric priors from RGB input alone.

AINeutralarXiv – CS AI · Mar 36/109

🧠

EmCoop: A Framework and Benchmark for Embodied Cooperation Among LLM Agents

Researchers introduce EmCoop, a new benchmark framework for studying cooperation among LLM-based embodied multi-agent systems in dynamic environments. The framework separates cognitive coordination from physical interaction layers and provides process-level metrics to analyze collaboration quality beyond just task completion success.

AIBullisharXiv – CS AI · Mar 37/108

🧠

Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning

Researchers propose EfficientZero-Multitask (EZ-M), a multi-task model-based reinforcement learning algorithm that scales the number of tasks rather than samples per task for robotics training. The approach achieves state-of-the-art performance on HumanoidBench with significantly higher sample efficiency by leveraging shared world models across diverse tasks.

AIBullisharXiv – CS AI · Mar 37/107

🧠

PEPA: a Persistently Autonomous Embodied Agent with Personalities

Researchers developed PEPA, a three-layer cognitive architecture that enables robots to operate autonomously using personality traits to generate goals without external supervision. The system was successfully tested on a quadruped robot in a real-world office environment, demonstrating sustained autonomous behavior across five personality prototypes.

AIBullisharXiv – CS AI · Mar 36/104

🧠

Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation

Researchers introduce BrainNav, a bio-inspired navigation framework that mimics biological spatial cognition to enhance Vision-and-Language Navigation in mobile robots. The system addresses spatial hallucination issues when transferring from simulation to real-world environments, demonstrating superior performance in zero-shot real-world testing.

AIBullisharXiv – CS AI · Mar 36/103

🧠

HIMM: Human-Inspired Long-Term Memory Modeling for Embodied Exploration and Question Answering

Researchers propose HIMM, a new memory framework for AI embodied agents that separates episodic and semantic memory to improve long-term performance. The system achieves significant gains on benchmarks, with 7.3% improvement in LLM-Match and 11.4% in LLM MatchXSPL, addressing key challenges in deploying multimodal language models as embodied agent brains.

AIBullisharXiv – CS AI · Mar 26/1010

🧠

SAGE-LLM: Towards Safe and Generalizable LLM Controller with Fuzzy-CBF Verification and Graph-Structured Knowledge Retrieval for UAV Decision

Researchers propose SAGE-LLM, a novel framework that combines Large Language Models with Control Barrier Functions for safe UAV autonomous decision-making. The system addresses LLM safety limitations through formal verification mechanisms and graph-based knowledge retrieval, demonstrating improved safety and generalization in drone control scenarios.

AINeutralarXiv – CS AI · Mar 27/1015

🧠

SLA-Aware Distributed LLM Inference Across Device-RAN-Cloud

Researchers tested distributed AI inference across device, edge, and cloud tiers in a 5G network, finding that sub-second AI response times required for embodied AI are challenging to achieve. On-device execution took multiple seconds, while RAN-edge deployment with quantized models could meet 0.5-second deadlines, and cloud deployment achieved 100% success for 1-second deadlines.

$NEAR

AINeutralarXiv – CS AI · Mar 27/1023

🧠

SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

Researchers introduce SWITCH, a new benchmark for testing autonomous AI agents' ability to interact with physical interfaces like switches and appliance panels in real-world scenarios. The benchmark reveals significant gaps in current AI models' capabilities for long-horizon tasks requiring causal reasoning and verification.

AIBullisharXiv – CS AI · Mar 27/1019

🧠

SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation

Researchers developed SocialNav, a foundation model for socially-aware robot navigation that uses a hierarchical architecture to understand social norms and generate compliant movement paths. The model was trained on 7 million samples and achieved 38% better success rates and 46% improved social compliance compared to existing methods.

AINeutralarXiv – CS AI · Feb 275/105

🧠

CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines

Researchers propose Contrastive World Models (CWM), a new approach for training AI agents to better distinguish between physically feasible and infeasible actions in embodied environments. The method uses contrastive learning with hard negative examples to outperform traditional supervised fine-tuning, achieving 6.76 percentage point improvement in precision and better safety margins under stress conditions.

AIBullisharXiv – CS AI · Feb 276/103

🧠

SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation

Researchers have developed SignVLA, the first sign language-driven Vision-Language-Action framework for human-robot interaction that directly translates sign gestures into robotic commands without requiring intermediate gloss annotations. The system currently focuses on real-time alphabet-level finger-spelling for robotic control and is designed to support future expansion to word and sentence-level understanding.

AIBullisharXiv – CS AI · Feb 276/103

🧠

Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning

Researchers developed Hierarchical Co-Self-Play (HCSP), a reinforcement learning framework that enables teams of drones to learn complex 3v3 volleyball through a three-stage training process. The system achieved an 82.9% win rate against baselines and demonstrated emergent team behaviors like role switching and coordinated formations.

AIBullishHugging Face Blog · Jan 56/105

🧠

NVIDIA brings agents to life with DGX Spark and Reachy Mini

NVIDIA announced DGX Spark and Reachy Mini, new hardware solutions designed to bring AI agents to life with enhanced physical interaction capabilities. These products represent NVIDIA's expansion into embodied AI and robotics applications.

AINeutralarXiv – CS AI · Mar 164/10

🧠

Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation

Researchers introduce Steve-Evolving, a new AI framework for open-world embodied agents that uses fine-grained diagnosis and knowledge distillation to improve long-horizon task performance. The system organizes interaction experiences into structured tuples and continuously evolves without model parameter updates, showing improvements in Minecraft testing environments.

AINeutralarXiv – CS AI · Mar 115/10

🧠

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

Researchers introduce MA-EgoQA, a benchmark for evaluating AI models' ability to understand multiple egocentric video streams from embodied agents simultaneously. The benchmark includes 1.7k questions across five categories and reveals current approaches struggle with multi-agent system-level understanding.

AINeutralarXiv – CS AI · Mar 54/10

🧠

HAMLET: A Hierarchical and Adaptive Multi-Agent Framework for Live Embodied Theatrics

Researchers have developed HAMLET, a hierarchical multi-agent AI framework that creates immersive, interactive theatrical experiences using large language models. The system generates narrative blueprints from simple topics and enables AI actors to perform with adaptive reasoning, emotional states, and physical interactions with scene props.

AINeutralarXiv – CS AI · Mar 44/104

🧠

ConEQsA: Concurrent and Asynchronous Embodied Questions Scheduling and Answering

Researchers introduce ConEQsA, an AI framework that enables embodied agents to handle multiple questions simultaneously in 3D environments with urgency-aware scheduling. The system uses shared memory to reduce redundant exploration and includes a new benchmark with 200 questions across 40 indoor scenes.

AIBullisharXiv – CS AI · Feb 274/105

🧠

DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces

Researchers introduced DICArt, a new AI framework for articulated object pose estimation that uses discrete diffusion processes instead of continuous space regression. The method incorporates kinematic constraints and hierarchical structure modeling to improve accuracy in estimating 6D poses of complex objects in embodied AI applications.

← PrevPage 2 of 2