Jailbreaking Multimodal Large Language Models using Multi-Clip Video
Researchers have identified critical vulnerabilities in multimodal large language models (MLLMs) when processing video inputs, demonstrating that safety mechanisms can be systematically bypassed using multi-clip videos with diverse contexts. The study reveals that video inputs pose greater security risks than static images, with attack success rates increasing proportionally to the number of video clips used.
