🧠 AI🔴 BearishImportance 7/10

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

arXiv – CS AI|Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu, Jitian Guo, Yujiu Yang, Pinjia He|June 2, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce PaSBench-Video, a 740-video benchmark designed to evaluate multimodal large language models' ability to issue timely safety warnings in streaming video scenarios. Testing 13 MLLMs reveals that no model exceeds 20% accuracy on strict metrics, with models struggling to distinguish emerging hazards from routine activities, particularly in driving scenarios where safe and dangerous scenes appear visually similar.

Analysis

PaSBench-Video addresses a critical gap in AI safety evaluation by testing whether video-capable language models can function as real-time safety monitors—a capability distinct from static image analysis or post-hoc video understanding. The benchmark's design reflects genuine deployment challenges: models must process video causally without future context, timestamp warnings precisely to capture intervention windows, and maintain low false-positive rates on benign footage. Current testing reveals that popular MLLMs rely on shallow pattern-matching rather than causal reasoning about risk emergence.

This research reflects broader concerns about deploying foundation models in safety-critical applications. While MLLMs excel at describing video content, the distinction between describing what happens and predicting what will happen remains stark. The sharp performance variance across domains—moderate performance in daily life versus indiscriminate triggering in driving—demonstrates that models exploit dataset biases rather than learning generalizable safety concepts. High false-positive coupling with recall indicates models face fundamental tradeoffs rather than simple calibration issues.

For AI developers and safety teams, these findings underscore that benchmark diversity matters; standard video understanding metrics miss critical safety properties. Organizations considering MLLMs for monitoring applications should recognize that current models perform poorly at the exact task they would be deployed for. The research signals that meaningful progress requires models that reason about physical causation and temporal dynamics rather than visual similarity patterns. Future MLLM development should prioritize temporal reasoning and causal understanding if safety applications remain a target use case.

Key Takeaways

→No tested MLLM exceeds 20% accuracy on strict safety warning metrics, indicating fundamental capability gaps in current models
→Models show 0.64 Pearson correlation between detection rate and false positives, suggesting inherent tradeoffs rather than calibration issues
→Performance splits dramatically by domain, with models achieving moderate accuracy only in inherently anomalous scenarios like daily life hazards
→Current models rely on visual scene cues rather than causal reasoning about emerging harm and temporal dynamics
→The benchmark reveals that video understanding benchmarks poorly predict safety monitoring performance, highlighting evaluation methodology gaps