PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat
Researchers developed a toxicity detection system for gaming chat using fine-tuned Llama 3.1 with synthetic data augmentation, achieving 4th place in the EEUCA 2026 shared task. The system classifies messages into six toxicity categories and reveals a critical "validation trap" phenomenon where high validation performance doesn't correlate with strong test set generalization.
This research addresses a pressing challenge in online community moderation by advancing machine learning techniques for detecting toxic behavior across multiple severity levels. The team's approach combines instruction-tuned large language models with LoRA fine-tuning and synthetic data augmentation, demonstrating that careful dataset augmentation at 5% can significantly improve performance without overfitting. The F1-macro score of 0.6234 reflects the inherent difficulty of multi-class toxicity classification, where distinguishing between subtle categories like "Other Offensive" and "Insults/Flaming" remains challenging.
The broader context involves escalating toxicity in gaming communities, which platform operators struggle to moderate manually at scale. This research contributes to automated moderation infrastructure, which gaming platforms and esports organizations increasingly require. The discovery of the "validation trap" phenomenon holds particular importance—it suggests that standard cross-validation approaches may mislead practitioners into selecting poorly generalizing models, impacting how future toxicity detection systems are evaluated.
For the gaming and platform moderation industry, improved toxicity detection enables better user experiences and community health management. The insight that synthetic data augmentation requires careful calibration challenges the assumption that more training data universally improves performance. This methodology could influence how content moderation AI is developed across gaming, social media, and online communities.
Future work should focus on why the validation-test performance gap exists and whether this pattern emerges in other multi-class detection tasks beyond toxicity. The research also highlights the importance of domain-specific evaluation metrics that better capture real-world moderation priorities.
- →Llama 3.1 8B with 5% synthetic data augmentation achieved 4th place in multi-class toxicity detection with F1-macro of 0.6234
- →A critical "validation trap" phenomenon reveals that high validation performance doesn't guarantee strong test set generalization
- →Six-category toxicity classification in gaming chat remains challenging due to subtle distinctions between offensive message types
- →Careful calibration of synthetic data augmentation is essential to avoid overfitting and improve model robustness
- →Findings have implications for deploying toxicity detection systems across gaming platforms and online communities