AIBullisharXiv – CS AI · 9h ago7/10
🧠
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
Researchers introduce Qwen3-VL-Seg, an efficient vision-language model that converts bounding box predictions into pixel-level segmentation masks for open-world referring segmentation tasks. The framework adds minimal parameters (17M, 0.4% overhead) while achieving strong performance on language-intensive visual grounding across in-distribution and out-of-distribution benchmarks.