y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#toxicity-detection News & Analysis

7 articles tagged with #toxicity-detection. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

7 articles
AIBullisharXiv – CS AI · May 287/10
🧠

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

Researchers introduce Meow2X and TRNE, two novel frameworks that identify and suppress toxicity in large language models by localizing harmful content to specific neural layers and neurons, then neutralizing it through inference-time adjustments without retraining. The approach demonstrates consistent toxicity reduction across multiple models while preserving language quality, revealing that early MLP layers disproportionately encode toxic behavior.

AIBullisharXiv – CS AI · Mar 57/10
🧠

Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

Researchers introduce DCR (Discernment via Contrastive Refinement), a new method to reduce over-refusal in safety-aligned large language models. The approach helps LLMs better distinguish between genuinely toxic and seemingly toxic prompts, maintaining safety while improving helpfulness without degrading general capabilities.

AIBullisharXiv – CS AI · May 296/10
🧠

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Researchers introduce Opir, a family of efficient encoder-based safety classification models designed to detect toxic content, jailbreaks, and harmful prompts in LLM applications without requiring expensive large guardrail models. The models achieve competitive performance across 12 safety tasks against eight contemporary systems while maintaining significantly smaller deployment footprints, with edge variants containing fewer than 100M parameters.

AINeutralarXiv – CS AI · May 296/10
🧠

Obfuscation Rules for Detecting and Detoxifying Korean Toxicity

Researchers introduce KOTOX, the first Korean-language dataset for detecting and neutralizing obfuscated toxic content in language models. The dataset addresses a critical gap by providing paired examples of normal, toxic, and obfuscated text, leveraging Korean's unique linguistic properties like agglutination and orthographic variation that enable easy toxicity disguise.

AINeutralarXiv – CS AI · May 286/10
🧠

Cyberbullying Governance on Social Media: A Unified Framework from Content Identification to Intervention

Researchers propose a unified framework for cyberbullying governance on social media that moves beyond isolated content detection to integrated, continuous moderation across four interconnected stages: content identification, user behavior modeling, diffusion dynamics, and intervention strategies. The framework addresses critical gaps in existing approaches by accounting for user behavioral patterns, toxic event spread, and proactive mitigation rather than reactive detection alone.

AINeutralarXiv – CS AI · May 116/10
🧠

PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat

Researchers developed a toxicity detection system for gaming chat using fine-tuned Llama 3.1 with synthetic data augmentation, achieving 4th place in the EEUCA 2026 shared task. The system classifies messages into six toxicity categories and reveals a critical "validation trap" phenomenon where high validation performance doesn't correlate with strong test set generalization.

🧠 Llama
AINeutralarXiv – CS AI · Mar 55/10
🧠

M-QUEST -- Meme Question-Understanding Evaluation on Semantics and Toxicity

Researchers developed M-QUEST, a new benchmark for evaluating AI models' ability to understand and detect toxicity in internet memes. The framework identifies 10 key dimensions for meme interpretation and tests 8 open-source language models, finding that instruction-tuned models perform better but still struggle with pragmatic inference.