y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#visual-qa News & Analysis

5 articles tagged with #visual-qa. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles
AIBearisharXiv โ€“ CS AI ยท Mar 277/10
๐Ÿง 

The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

Research reveals that open-source large language models (LLMs) lack hierarchical knowledge of visual taxonomies, creating a bottleneck for vision LLMs in hierarchical visual recognition tasks. The study used one million visual question answering tasks across six taxonomies to demonstrate this limitation, finding that even fine-tuning cannot overcome the underlying LLM knowledge gaps.

AIBullisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Researchers introduce ToolVQA, a large-scale multimodal dataset with 23K instances designed to improve AI models' ability to use external tools for visual question answering. The dataset features real-world contexts and multi-step reasoning tasks, with fine-tuned 7B models outperforming GPT-3.5-turbo on various benchmarks.

AIBullisharXiv โ€“ CS AI ยท Mar 276/10
๐Ÿง 

Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models

Photon is a new framework that efficiently processes 3D medical imaging for AI visual question answering by using variable-length token sequences and adaptive compression. The system reduces computational costs while maintaining accuracy through instruction-conditioned token scheduling and custom gradient propagation techniques.

AINeutralarXiv โ€“ CS AI ยท Mar 36/104
๐Ÿง 

EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

Researchers introduce EgoNight, the first comprehensive benchmark for nighttime egocentric vision understanding, featuring day-night aligned videos and visual question answering tasks. The benchmark reveals significant performance drops in state-of-the-art multimodal large language models when operating under low-light conditions.