33 articles tagged with #multimodal-llm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv โ CS AI ยท Mar 36/104
๐ง Researchers introduce EgoNight, the first comprehensive benchmark for nighttime egocentric vision understanding, featuring day-night aligned videos and visual question answering tasks. The benchmark reveals significant performance drops in state-of-the-art multimodal large language models when operating under low-light conditions.
AIBullisharXiv โ CS AI ยท Mar 36/103
๐ง Researchers propose HIMM, a new memory framework for AI embodied agents that separates episodic and semantic memory to improve long-term performance. The system achieves significant gains on benchmarks, with 7.3% improvement in LLM-Match and 11.4% in LLM MatchXSPL, addressing key challenges in deploying multimodal language models as embodied agent brains.
AIBullisharXiv โ CS AI ยท Mar 26/1018
๐ง Researchers developed RD-MLDG, a new framework that uses multimodal large language models with reasoning chains to improve domain generalization in deep learning. The approach addresses challenges in cross-domain visual recognition by leveraging reasoning capabilities rather than just visual feature invariance, achieving state-of-the-art performance on standard benchmarks.
AINeutralarXiv โ CS AI ยท Mar 26/1012
๐ง Researchers introduce Ref-Adv, a new benchmark for testing multimodal large language models' visual reasoning capabilities in referring expression tasks. The benchmark reveals that current MLLMs, despite performing well on standard datasets like RefCOCO, rely heavily on shortcuts and show significant gaps in genuine visual reasoning and grounding abilities.
AIBullisharXiv โ CS AI ยท Mar 26/1014
๐ง Researchers propose a data-efficient framework to convert generative Multimodal Large Language Models into universal embedding models without extensive pre-training. The method uses hierarchical embedding prompts and Self-aware Hard Negative Sampling to achieve competitive performance on embedding benchmarks using minimal training data.
AIBullisharXiv โ CS AI ยท Feb 276/108
๐ง Researchers have developed FactGuard, an AI framework that uses multimodal large language models and reinforcement learning to detect video misinformation. The system addresses limitations of existing models by implementing iterative reasoning processes and external tool integration to verify information across video content.
AINeutralarXiv โ CS AI ยท Mar 44/102
๐ง Researchers developed new prompting-based approaches using multimodal large language models to generate real-time video commentary that considers both content relevance and timing. The study introduces dynamic interval-based decoding that adjusts prediction timing based on utterance duration, showing improved alignment with human commentary patterns without requiring model fine-tuning.
AINeutralarXiv โ CS AI ยท Mar 34/103
๐ง Researchers introduced VisJudge-Bench, the first comprehensive benchmark for evaluating AI models' ability to assess visualization quality and aesthetics, revealing significant gaps between advanced models like GPT-5 and human expert judgment. They developed VisJudge, a specialized model that achieved 60.5% better correlation with human assessments compared to GPT-5.