←Back to feed
🧠 AI⚪ NeutralImportance 6/10
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
arXiv – CS AI|Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown||7 views
🤖AI Summary
Researchers introduce PoSh, a new evaluation metric for detailed image descriptions that uses scene graphs to guide LLMs-as-a-Judge, achieving better correlation with human judgments than existing methods. They also present DOCENT, a challenging benchmark dataset featuring artwork with expert-written descriptions to evaluate vision-language models' performance on complex image analysis.
Key Takeaways
- →PoSh metric uses scene graphs as structured rubrics to guide LLM evaluation of detailed image descriptions, outperforming existing metrics including GPT-4o.
- →DOCENT benchmark contains artwork paired with expert-written references and human quality judgments from art history students.
- →PoSh achieves +0.05 higher Spearman correlation with human judgments compared to best open-weight alternatives.
- →Foundation models struggle with error-free coverage of images with rich scene dynamics, revealing limitations in current VLM capabilities.
- →The research enables advances in assistive text generation and establishes a demanding new task for measuring VLM progress.
#vision-language-models#llm-evaluation#scene-graphs#benchmark-dataset#image-description#vlm-performance#ai-evaluation
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Related Articles