34 articles tagged with #mllm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Mar 36/102
๐ง Researchers introduce SemHiTok, a unified image tokenizer that uses semantic-guided hierarchical codebooks to balance multimodal understanding and generation tasks. The system decouples semantic and pixel features through a novel architecture that builds pixel sub-codebooks on pretrained semantic codebooks, achieving superior performance in both image reconstruction and multimodal understanding.
AIBullisharXiv โ CS AI ยท Mar 36/104
๐ง Researchers propose MOON, the first generative multimodal large language model designed specifically for e-commerce product understanding. The model addresses key challenges in product representation learning through guided Mixture-of-Experts modules and semantic region detection, while introducing a new benchmark dataset for evaluation.
AINeutralarXiv โ CS AI ยท Mar 35/103
๐ง Researchers introduce CยณB (Comics Cross-Cultural Benchmark), a new benchmark to test cultural awareness capabilities in Multimodal Large Language Models using over 2000 comic images and 18000 QA pairs. Testing revealed significant performance gaps between current MLLMs and human performance, highlighting the need for improved cultural understanding in AI systems.
AINeutralarXiv โ CS AI ยท Mar 36/104
๐ง Researchers introduce Vision-DeepResearch Benchmark (VDR-Bench) with 2,000 VQA instances to better evaluate multimodal AI systems' visual and textual search capabilities. The benchmark addresses limitations in existing evaluations where answers could be inferred without proper visual search, and proposes a multi-round cropped-search workflow to improve model performance.
$NEAR
AIBullisharXiv โ CS AI ยท Mar 26/1019
๐ง Researchers have developed EMO-R3, a new framework that enhances emotional reasoning capabilities in Multimodal Large Language Models through reflective reinforcement learning. The approach introduces structured emotional thinking and reflective rewards to improve interpretability and emotional intelligence in visual understanding tasks.
AIBullisharXiv โ CS AI ยท Mar 26/1010
๐ง Researchers introduce UMPIRE, a new training-free framework for quantifying uncertainty in Multimodal Large Language Models (MLLMs) across various input and output modalities. The system measures incoherence-adjusted semantic volume of model responses to better detect errors and improve reliability without requiring external tools or additional computational overhead.
AIBullisharXiv โ CS AI ยท Feb 276/105
๐ง Researchers introduce AOT (Adversarial Opponent Training), a self-play framework that improves Multimodal Large Language Models' robustness by having an AI attacker generate adversarial image manipulations to train a defender model. The method addresses perceptual fragility in MLLMs when processing visually complex scenes, reducing hallucinations through dynamic adversarial training.
AINeutralarXiv โ CS AI ยท Mar 54/10
๐ง Researchers evaluated five Multimodal Large Language Models (MLLMs) on their ability to reason about social norms in both text and image scenarios. GPT-4o performed best overall, while all models showed superior performance with text-based norm reasoning compared to image-based scenarios.
๐ง GPT-4
AINeutralarXiv โ CS AI ยท Mar 34/105
๐ง Researchers developed TAR-FAS, a new AI framework that uses external visual tools to improve face anti-spoofing detection across different domains. The system employs a Chain-of-Thought approach with visual tools to detect subtle spoofing patterns that traditional methods miss, achieving state-of-the-art performance.