y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#mllm News & Analysis

34 articles tagged with #mllm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

34 articles
AIBullisharXiv โ€“ CS AI ยท Mar 36/102
๐Ÿง 

SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

Researchers introduce SemHiTok, a unified image tokenizer that uses semantic-guided hierarchical codebooks to balance multimodal understanding and generation tasks. The system decouples semantic and pixel features through a novel architecture that builds pixel sub-codebooks on pretrained semantic codebooks, achieving superior performance in both image reconstruction and multimodal understanding.

AINeutralarXiv โ€“ CS AI ยท Mar 35/103
๐Ÿง 

Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness

Researchers introduce CยณB (Comics Cross-Cultural Benchmark), a new benchmark to test cultural awareness capabilities in Multimodal Large Language Models using over 2000 comic images and 18000 QA pairs. Testing revealed significant performance gaps between current MLLMs and human performance, highlighting the need for improved cultural understanding in AI systems.

AINeutralarXiv โ€“ CS AI ยท Mar 36/104
๐Ÿง 

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Researchers introduce Vision-DeepResearch Benchmark (VDR-Bench) with 2,000 VQA instances to better evaluate multimodal AI systems' visual and textual search capabilities. The benchmark addresses limitations in existing evaluations where answers could be inferred without proper visual search, and proposes a multi-round cropped-search workflow to improve model performance.

$NEAR
AIBullisharXiv โ€“ CS AI ยท Mar 26/1010
๐Ÿง 

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

Researchers introduce UMPIRE, a new training-free framework for quantifying uncertainty in Multimodal Large Language Models (MLLMs) across various input and output modalities. The system measures incoherence-adjusted semantic volume of model responses to better detect errors and improve reliability without requiring external tools or additional computational overhead.

AIBullisharXiv โ€“ CS AI ยท Feb 276/105
๐Ÿง 

To Deceive is to Teach? Forging Perceptual Robustness via Adversarial Reinforcement Learning

Researchers introduce AOT (Adversarial Opponent Training), a self-play framework that improves Multimodal Large Language Models' robustness by having an AI attacker generate adversarial image manipulations to train a defender model. The method addresses perceptual fragility in MLLMs when processing visually complex scenes, reducing hallucinations through dynamic adversarial training.

AINeutralarXiv โ€“ CS AI ยท Mar 54/10
๐Ÿง 

Social Norm Reasoning in Multimodal Language Models: An Evaluation

Researchers evaluated five Multimodal Large Language Models (MLLMs) on their ability to reason about social norms in both text and image scenarios. GPT-4o performed best overall, while all models showed superior performance with text-based norm reasoning compared to image-based scenarios.

๐Ÿง  GPT-4
โ† PrevPage 2 of 2