PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation
Researchers achieved second place in SemEval-2026's multilingual polarization detection task by fine-tuning Gemma models with synthetic data augmentation across 22 languages. Their ensemble approach combining LoRA-adapted 12B and 27B parameter models with LLM-generated training data achieved a mean macro-F1 of 0.811, demonstrating the effectiveness of synthetic data strategies and per-language optimization for multilingual NLP tasks.
The PSK team's approach to multilingual polarization detection represents a practical advancement in how large language models can be adapted for specialized classification tasks across diverse linguistic contexts. Rather than pursuing novel architectures, the researchers focused on systematic optimization: fine-tuning separate models per language using LoRA, a parameter-efficient technique that reduces computational overhead while maintaining performance. This methodological choice reflects a broader industry trend toward pragmatic improvements over architectural innovation.
Synthetic data generation emerged as a critical differentiator in their system. By employing three distinct generation strategies through GPT-4o-mini and implementing rigorous quality filtering with embedding-based deduplication, the team addressed a fundamental challenge in multilingual NLP: data scarcity for underrepresented languages. This approach yields immediate practical value for organizations building polarization detection systems, as it demonstrates replicable methods for bootstrapping training datasets without extensive manual annotation.
The finding that threshold tuning produced 2-4% F1 improvements without retraining reveals often-overlooked optimization opportunities in production systems. More concerning was the 30-50% performance degradation experienced by alternative architectures on the test set despite strong development performance, indicating that model selection for multilingual tasks requires careful validation beyond standard benchmarks. This generalization gap suggests that architectural choices carry hidden risks when deploying across multiple languages simultaneously.
For AI practitioners, the results validate Gemma's suitability for specialized multilingual classification while highlighting the importance of per-language customization. The second-place finish, despite ranking first in three languages and top-three in eight others, demonstrates that consistency across linguistic diversity remains challenging—a critical consideration for organizations deploying multilingual content moderation or polarization detection systems at scale.
- →Synthetic data augmentation with quality filtering proved essential for improving multilingual polarization detection across 22 languages.
- →Per-language model fine-tuning with LoRA and threshold tuning delivered 2-4% F1 improvements without requiring complete retraining.
- →Gemma models (12B and 27B) demonstrated superior generalization compared to XLM-RoBERTa and Qwen3, which suffered 30-50% performance drops on test data.
- →Ensemble methods combining models of different sizes with language-specific strategy selection achieved competitive second-place results.
- →Alternative architectures showing strong development performance can fail significantly on test sets, emphasizing validation rigor for multilingual deployments.