FedMPT: Federated Multi-label Prompt Tuning of Vision-Language Models
Researchers introduce FedMPT, a novel federated learning method for multi-label recognition in vision-language models that addresses overfitting to spurious label correlations in decentralized settings. The approach uses causal modeling, LLM-driven condition analysis, and optimal transport mechanisms to improve model robustness when adapting to clients with heterogeneous private data.
FedMPT addresses a critical gap in federated learning applied to vision-language models, specifically tackling the challenge of multi-label recognition across decentralized networks. The problem emerges when VLMs adapt to individual clients with private, heterogeneous datasets—models tend to overfit to spurious correlations between labels, causing inappropriate category activation when processing new samples. This represents a real obstacle for privacy-preserving AI deployment in enterprise and collaborative settings.
The technical contribution employs causal modeling with front-door adjustment to decouple the recognition process through intermediate variables that amplify oracle label co-occurrence patterns. This theoretical framework distinguishes FedMPT from standard federated approaches by explicitly addressing label dependency structures rather than treating them as statistical noise. The integration of large language models to decipher underlying label dependencies adds a linguistic understanding layer absent in purely vision-based methods.
The implementation strategy combines three mechanisms: LLM-driven condition discovery, optimal transport between condition-enriched prompts and image patches for region-level semantics, and a gating mechanism synthesizing predictions across conditions. This architectural design demonstrates sophistication in handling the inherent complexity of decentralized multi-label scenarios.
From an industry perspective, this work matters for organizations deploying federated learning in domains requiring granular classification—medical imaging, autonomous systems, and content moderation. The method's competitive benchmark results suggest practical viability, though real-world federated deployment introduces additional complexities around communication efficiency and heterogeneity that warrant further investigation. The research advances foundational techniques for privacy-preserving, robust AI systems.
- →FedMPT is the first federated learning method specifically designed for multi-label recognition in vision-language models
- →The approach uses causal modeling to prevent overfitting to spurious label correlations in decentralized settings
- →LLM-driven analysis uncovers underlying conditions governing label dependencies across heterogeneous client data
- →Optimal transport mechanisms connect condition-enriched prompts with image patches to extract region-level semantics
- →Benchmark results demonstrate competitive performance, advancing practical federated learning for complex recognition tasks