#scene-understanding News & Analysis

7 articles tagged with #scene-understanding. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

7 articles

AIBullisharXiv – CS AI · Mar 37/105

🧠

Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy

Researchers propose Vid-LLM, a new video-based 3D multimodal large language model that processes video inputs without requiring external 3D data for scene understanding. The model uses a Cross-Task Adapter module and Metric Depth Model to integrate geometric cues and maintain consistency across 3D tasks like question answering and visual grounding.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Can Segmentation Models Understand the World? Towards Proactive Affordance Reasoning via Visual Chain-of-Thought

Researchers introduce SegWorld, a segmentation model that uses visual chain-of-thought reasoning to understand scenes and segment object parts based on high-level intent rather than explicit target descriptions. The model proactively observes scenes, infers affordances, and maps user instructions to specific physical interaction points, outperforming baselines on intent-level tasks while matching them on traditional target-referential instructions.

AINeutralarXiv – CS AI · 3d ago5/10

🧠

Manboformer: Learning Gaussian Representations via Spatial-temporal Attention Mechanism

Researchers propose Manboformer, an improvement to GaussianFormer that enhances 3D semantic occupancy prediction for autonomous driving by incorporating spatial-temporal attention mechanisms. The method addresses performance limitations in the original Gaussian-based approach by leveraging temporal information, with evaluation ongoing on the NuScenes dataset.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Unified Panoramic Geometry Estimation via Multi-View Foundation Models

Researchers introduce PaGeR, a framework that adapts 3D foundation models trained on perspective images to work with panoramic imagery, enabling geometry estimation from 360-degree scenes. The unified model predicts depth, surface normals, and sky masks from both standard and panoramic images in a single pass, achieving state-of-the-art performance on indoor and outdoor scenes.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

Researchers propose CroBo, a new visual state representation learning framework that helps robotic agents better understand dynamic environments by encoding both semantic identities and spatial locations of scene elements. The framework uses a global-to-local reconstruction method that compresses observations into compact tokens, achieving state-of-the-art performance on robot policy learning benchmarks.

AIBullisharXiv – CS AI · Mar 26/1017

🧠

LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans

Researchers have developed LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic 3D virtual replicas suitable for AR/VR, gaming, robotics, and digital twins. The system features scene understanding, object retrieval, material painting, and physics integration to create graphics-ready environments that support object individuality and physically-based rendering.

AINeutralarXiv – CS AI · Apr 74/10

🧠

TreeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and Understanding

TreeGaussian introduces a new framework for 3D scene understanding that uses tree-guided cascaded contrastive learning to better capture hierarchical semantic relationships in complex 3D environments. The method addresses limitations in existing 3D Gaussian Splatting approaches by implementing structured learning across object-part hierarchies and improving segmentation consistency.