🧠 AI⚪ NeutralImportance 6/10

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

arXiv – CS AI|Atin Pothiraj, Jaemin Cho, Yue Zhang, Elias Stengel-Eskin, Mohit Bansal|June 25, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Physics Question Scene Graph (PQSG), a new evaluation framework that uses vision-language models to assess whether AI-generated videos obey physical laws. The framework evaluates videos from models like Sora 2 and Veo 3 through hierarchical question graphs, revealing that closed-source models outperform open-source alternatives in physical realism.

Analysis

Video generation models have achieved impressive visual fidelity, yet a fundamental gap persists: their inability to consistently respect basic physical laws. PQSG addresses this critical evaluation challenge by introducing a structured, granular assessment method. Rather than relying on binary pass/fail metrics, the framework generates context-aware questions organized as a logical dependency graph, enabling precise identification of which specific physical constraints are violated and where.

This work emerges as video generation technology rapidly advances toward production use. Current models like Sora 2, Veo 3, and open-source alternatives like Wan struggle with scenarios requiring consistent physical reasoning—objects that should fall upward, liquids that defy gravity, or impossible interactions between entities. Without reliable evaluation methods, developers cannot systematically debug these failures, and researchers lack quantitative benchmarks for improvement.

The creation of FinePhyEval dataset represents a significant research contribution, pairing physics-based prompts with human annotations across multiple models. The finding that closed-source models significantly outrank Wan 2.1 on physical realism metrics suggests proprietary architectures or training procedures confer advantages in constraint satisfaction. This disparity may influence enterprise adoption decisions, where physical plausibility directly impacts applications in simulation, education, and visual effects.

Looking forward, PQSG's hierarchical question framework could become a standard evaluation methodology across the video generation industry. The benchmark reveals that while VLMs excel at generating human-like questions, answering them accurately remains challenging—pointing toward needed improvements in multimodal reasoning. As video models move toward real-world applications, systematic physical plausibility evaluation transforms from academic interest to practical necessity.

Key Takeaways

→PQSG enables fine-grained evaluation of physical law adherence in AI-generated videos through hierarchical question graphs.
→Closed-source models (Sora 2, Veo 3) demonstrate significantly higher physical realism than open-source alternative Wan 2.1.
→FinePhyEval dataset provides the first large-scale benchmark for physics-based video generation assessment with human annotations.
→Vision-language models can generate human-like evaluation questions but lag in accurately answering them, indicating reasoning gaps.
→The framework localizes specific physical constraint violations, enabling targeted model improvements beyond holistic quality scores.

Mentioned in AI

Models

SoraOpenAI

#video-generation #physics-simulation #evaluation-metrics #vlm-benchmarking #ai-quality-assessment #multimodal-reasoning #open-source-models

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge