From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs
Researchers demonstrate that large language models can automate the grounding of 3D scene objects to formal ontology classes without training, achieving 90-96% accuracy on kitchen scenes. This zero-shot approach eliminates reliance on brittle, manually curated dictionaries and represents a significant advance in knowledge graph construction for robotic task reasoning.
This research addresses a fundamental automation challenge in robotics and 3D scene understanding. Previously, mapping objects from Universal Scene Description (USD) files to semantic ontology classes required hand-crafted dictionaries that were inflexible and failed across different asset libraries. The breakthrough here is leveraging LLMs' semantic reasoning capabilities to perform this grounding without any task-specific training or fine-tuning, making the approach immediately applicable across diverse domains.
The results are impressive but nuanced. With descriptive object names, LLMs achieve near-perfect accuracy (90-96%), demonstrating they can reliably understand semantic relationships. Performance degrades to 49-89% with abbreviated names and 48% under fully opaque naming conventions when augmented with context. The feature ablation study reveals the LLM's reasoning strategy: it primarily exploits semantic cues embedded in the scene graph itself—sibling object relationships and hierarchical parent paths—rather than geometric properties. This finding has important implications for system design; when geometric information alone is available, accuracy plummets to 4-17%, indicating that semantic context is the critical factor.
For the robotics and AI communities, this work reduces engineering overhead for deploying knowledge graph systems at scale. Rather than maintaining multiple ontology dictionaries for different environments, practitioners can leverage readily available LLMs. However, the dependency on semantic scene structure means success remains contingent on how well scenes are semantically annotated. Organizations deploying such systems should prioritize descriptive naming conventions and clear hierarchical organization. Future development should explore performance with noisy or partially annotated scene data, and whether domain-specific prompting strategies can further improve accuracy under adversarial naming conditions.
- →LLMs achieve 90-96% accuracy grounding USD scene objects to ontology classes without training, eliminating brittle manual dictionary maintenance.
- →Performance degrades significantly with abbreviated or opaque object names, revealing heavy dependence on semantic scene graph structure rather than geometry.
- →Semantic cues like sibling relationships and hierarchical paths drive LLM reasoning, while geometric features alone contribute minimally to accuracy.
- →Zero-shot approach scales across asset libraries and domains, reducing engineering overhead compared to dictionary-based systems.
- →Context-augmented prompting recovers up to 48% accuracy under fully opaque naming, indicating potential for refinement with better prompting strategies.