Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation
Researchers introduce Physics Question Scene Graph (PQSG), a new evaluation framework that uses vision-language models to assess whether AI-generated videos obey physical laws. The framework evaluates videos from models like Sora 2 and Veo 3 through hierarchical question graphs, revealing that closed-source models outperform open-source alternatives in physical realism.
Video generation models have achieved impressive visual fidelity, yet a fundamental gap persists: their inability to consistently respect basic physical laws. PQSG addresses this critical evaluation challenge by introducing a structured, granular assessment method. Rather than relying on binary pass/fail metrics, the framework generates context-aware questions organized as a logical dependency graph, enabling precise identification of which specific physical constraints are violated and where.
This work emerges as video generation technology rapidly advances toward production use. Current models like Sora 2, Veo 3, and open-source alternatives like Wan struggle with scenarios requiring consistent physical reasoning—objects that should fall upward, liquids that defy gravity, or impossible interactions between entities. Without reliable evaluation methods, developers cannot systematically debug these failures, and researchers lack quantitative benchmarks for improvement.
The creation of FinePhyEval dataset represents a significant research contribution, pairing physics-based prompts with human annotations across multiple models. The finding that closed-source models significantly outrank Wan 2.1 on physical realism metrics suggests proprietary architectures or training procedures confer advantages in constraint satisfaction. This disparity may influence enterprise adoption decisions, where physical plausibility directly impacts applications in simulation, education, and visual effects.
Looking forward, PQSG's hierarchical question framework could become a standard evaluation methodology across the video generation industry. The benchmark reveals that while VLMs excel at generating human-like questions, answering them accurately remains challenging—pointing toward needed improvements in multimodal reasoning. As video models move toward real-world applications, systematic physical plausibility evaluation transforms from academic interest to practical necessity.
- →PQSG enables fine-grained evaluation of physical law adherence in AI-generated videos through hierarchical question graphs.
- →Closed-source models (Sora 2, Veo 3) demonstrate significantly higher physical realism than open-source alternative Wan 2.1.
- →FinePhyEval dataset provides the first large-scale benchmark for physics-based video generation assessment with human annotations.
- →Vision-language models can generate human-like evaluation questions but lag in accurately answering them, indicating reasoning gaps.
- →The framework localizes specific physical constraint violations, enabling targeted model improvements beyond holistic quality scores.