CANVAS: Captioning Art with Narrative Visual-Audio AI Systems
CANVAS is an automated AI system that generates rich, multi-sensory art descriptions and synchronized audio narration for museum collections and digital art, addressing accessibility gaps for blind and low-vision audiences. The system processes images through large language models and text-to-speech services via Zapier, producing detailed captions faster and cheaper than human alternatives while demonstrating superior lexical diversity compared to baseline alt-text.
The CANVAS system addresses a genuine accessibility crisis in cultural institutions. Museums and digital art platforms typically rely on minimal alt-text that fails to capture sensory depth, spatial relationships, or emotional resonance—fundamentally limiting how blind and low-vision audiences experience visual art. This research demonstrates that automation can scale solutions to a problem that has persisted partly because manual human captioning is resource-intensive and labor-costly.
The technical execution combines established AI components—large language models for narrative generation and text-to-speech systems—through workflow automation, making this a practical application rather than a novel algorithmic breakthrough. The quantitative validation across 50 artworks provides empirical evidence that AI-generated descriptions outperform baseline captions on meaningful metrics including lexical diversity and adjective density while maintaining comparable readability.
The efficiency metrics are striking: under 20 seconds per image at costs below $0.05 per artwork create genuine scalability economics that institutions can adopt. This enables rapid retrospective captioning of existing collections and real-time accessibility for new acquisitions. The implications extend beyond museums into educational platforms, e-commerce product imagery, and any visual-heavy digital environment where accessibility remains underprioritized.
The research roadmap appropriately emphasizes user testing with actual BLV participants as the next critical phase. Current evaluation focuses on computational metrics rather than comprehension or user preference—gaps that could reveal whether narrative richness improves accessibility or introduces unnecessary complexity. Market adoption depends on whether institutions perceive accessibility as mission-critical rather than compliant afterthought.
- →AI-generated art descriptions achieve higher lexical diversity and narrative detail than human-written baseline captions while costing under $0.05 per image
- →The automated pipeline reduces production time to under 20 seconds per artwork, enabling scalable accessibility for museum and digital art collections
- →Current evaluation relies on computational metrics rather than user testing with blind and low-vision audiences, limiting insights into actual comprehension impact
- →The system demonstrates that workflow automation through existing tools can address entrenched accessibility gaps in cultural institutions
- →Future adoption depends on whether museums adopt accessibility as core mission versus compliance requirement