y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

arXiv – CS AI|Liwen Yi, Xianlin Zhang, Yue Zhang, Yue Ming, Xueming Li|
🤖AI Summary

Researchers introduce LASA, a weak supervision method for open-vocabulary sketch semantic segmentation that aggregates multi-layer Vision Transformer attention maps to capture complementary spatial cues. The approach achieves significant improvements over baselines without requiring pixel-level annotations, advancing computer vision capabilities for sparse line drawing interpretation.

Analysis

LASA addresses a fundamental challenge in sketch understanding: the absence of texture and color information that typically guides semantic segmentation in natural images. The research demonstrates that Vision Transformer layers contain hierarchically organized spatial information—shallow layers preserve global structural context while deeper layers capture local details. By systematically aggregating these complementary representations, LASA creates a more robust framework than relying on single-layer features alone.

The method builds on weak supervision principles, eliminating the need for expensive pixel-level annotations during training. This represents practical progress toward scalable computer vision systems, particularly relevant for applications requiring sketch-based interfaces or rapid annotation workflows. The technical contribution—cross-layer attention aggregation—offers insights applicable beyond sketch segmentation to other vision tasks where structural priors matter.

The experimental validation across three datasets (FS-COCO, SFSD, FrISS) shows consistent, substantial improvements: +3.43 to +15.74 mIoU gains over weakly supervised baselines. These results indicate the approach generalizes across different sketch domains and difficulty levels. The commitment to open-source release enhances reproducibility and adoption potential within the computer vision community.

The research advances open-vocabulary segmentation, enabling systems to work with flexible category vocabularies at inference time without retraining. This flexibility addresses practical deployment scenarios where semantic categories may change. While primarily academic, such advances in efficient vision models support broader adoption of sketch-based interfaces in design tools, game development, and accessibility applications.

Key Takeaways
  • LASA aggregates multi-layer Vision Transformer attention to capture complementary spatial information for sketch segmentation without pixel-level annotations
  • Cross-layer aggregation provides more robust structural priors than single-layer features, particularly important for texture-free sketch interpretation
  • Experimental results show mIoU improvements of 3.43-15.74 across three sketch segmentation benchmarks compared to weakly supervised baselines
  • The method enables open-vocabulary segmentation at inference time with flexible category vocabularies, improving practical deployment flexibility
  • Publicly available source code facilitates adoption and reproducibility within the computer vision research community
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles