AINeutralarXiv โ CS AI ยท 5d ago7/103
๐ง
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
Researchers have introduced WorldSense, the first benchmark for evaluating multimodal AI systems that process visual, audio, and text inputs simultaneously. The benchmark contains 1,662 synchronized audio-visual videos across 67 subcategories and 3,172 QA pairs, revealing that current state-of-the-art models achieve only 65.1% accuracy on real-world understanding tasks.