LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation
Researchers introduce LASA, a weak supervision method for open-vocabulary sketch semantic segmentation that aggregates multi-layer Vision Transformer attention maps to capture complementary spatial cues. The approach achieves significant improvements over baselines without requiring pixel-level annotations, advancing computer vision capabilities for sparse line drawing interpretation.
LASA addresses a fundamental challenge in sketch understanding: the absence of texture and color information that typically guides semantic segmentation in natural images. The research demonstrates that Vision Transformer layers contain hierarchically organized spatial information—shallow layers preserve global structural context while deeper layers capture local details. By systematically aggregating these complementary representations, LASA creates a more robust framework than relying on single-layer features alone.
The method builds on weak supervision principles, eliminating the need for expensive pixel-level annotations during training. This represents practical progress toward scalable computer vision systems, particularly relevant for applications requiring sketch-based interfaces or rapid annotation workflows. The technical contribution—cross-layer attention aggregation—offers insights applicable beyond sketch segmentation to other vision tasks where structural priors matter.
The experimental validation across three datasets (FS-COCO, SFSD, FrISS) shows consistent, substantial improvements: +3.43 to +15.74 mIoU gains over weakly supervised baselines. These results indicate the approach generalizes across different sketch domains and difficulty levels. The commitment to open-source release enhances reproducibility and adoption potential within the computer vision community.
The research advances open-vocabulary segmentation, enabling systems to work with flexible category vocabularies at inference time without retraining. This flexibility addresses practical deployment scenarios where semantic categories may change. While primarily academic, such advances in efficient vision models support broader adoption of sketch-based interfaces in design tools, game development, and accessibility applications.
- →LASA aggregates multi-layer Vision Transformer attention to capture complementary spatial information for sketch segmentation without pixel-level annotations
- →Cross-layer aggregation provides more robust structural priors than single-layer features, particularly important for texture-free sketch interpretation
- →Experimental results show mIoU improvements of 3.43-15.74 across three sketch segmentation benchmarks compared to weakly supervised baselines
- →The method enables open-vocabulary segmentation at inference time with flexible category vocabularies, improving practical deployment flexibility
- →Publicly available source code facilitates adoption and reproducibility within the computer vision research community