AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers developed RieMind, a new AI framework that improves spatial reasoning in indoor scenes by 16-50% by separating visual perception from logical reasoning using explicit 3D scene graphs. The system grounds language models in structured geometric representations rather than processing videos end-to-end, achieving significantly better performance on spatial understanding benchmarks.
AINeutralarXiv – CS AI · Jun 96/10
🧠Researchers propose CAPruner, a scene graph pruning method that enhances how large language models perform 3D spatial reasoning by preserving task-relevant relations rather than relying solely on spatial proximity. The approach combines fuzzy semantic relevance with spatial proximity to identify critical relations, addressing computational inefficiencies in 3D vision-language tasks.
AINeutralarXiv – CS AI · Jun 96/10
🧠Researchers introduce PhysScene, the first scene graph dataset specifically designed for physics experiments, enabling AI systems to understand complex scientific setups through structured visual reasoning. The dataset prioritizes semantic accuracy and relational density over scale, addressing a gap in domain-specific AI training data for scientific applications.
AINeutralarXiv – CS AI · Jun 26/10
🧠Researchers introduce PSG-Nav, a novel navigation system that uses probabilistic scene graphs to help AI agents navigate complex environments while accounting for perception uncertainty. The system achieves state-of-the-art results on three major benchmarks by employing multiverse decision-making and an evidential calibrator to reduce false positives in open-vocabulary navigation tasks.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduce a computational method for pre-capture portrait photography planning that generates optimal human poses, camera angles, lighting, and exposure settings within 3D scenes before photos are taken. Rather than focusing on post-production editing, this approach uses a Photographic Scene Graph to represent scene affordances and lighting structure, enabling AI-guided planning that produces aesthetically superior portraits while maintaining physical feasibility.
AINeutralarXiv – CS AI · May 116/10
🧠Response-G1 introduces a novel framework for real-time video understanding that uses explicit scene graphs to align video evidence with query-specific response conditions, enabling Video-LLMs to make more accurate timing decisions during streaming video analysis without requiring fine-tuning.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers introduce Spatial Atlas, a compute-grounded reasoning system that combines deterministic spatial computation with large language models to create spatial-aware research agents. The framework demonstrates competitive performance on two benchmarks—FieldWorkArena for multimodal spatial question-answering and MLE-Bench for machine learning competitions—while improving interpretability by grounding reasoning in structured spatial scene graphs rather than relying on hallucinated outputs.
🏢 OpenAI🏢 Anthropic
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce 3D-VCD, an inference-time framework that reduces hallucinations in 3D-LLM embodied agents by contrasting predictions against distorted scene graphs. The method addresses failures specific to 3D spatial reasoning without requiring model retraining, advancing reliability in embodied AI systems.
AIBullisharXiv – CS AI · Mar 276/10
🧠Researchers introduced Graph-of-Mark (GoM), a new visual prompting technique that overlays scene graphs onto images to improve spatial reasoning in multimodal language models. Testing across 3 open-source MLMs and 4 datasets showed GoM improved zero-shot visual question answering and localization accuracy by up to 11 percentage points compared to existing methods like Set-of-Mark.
AINeutralarXiv – CS AI · Feb 276/107
🧠Researchers introduce PoSh, a new evaluation metric for detailed image descriptions that uses scene graphs to guide LLMs-as-a-Judge, achieving better correlation with human judgments than existing methods. They also present DOCENT, a challenging benchmark dataset featuring artwork with expert-written descriptions to evaluate vision-language models' performance on complex image analysis.