AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers introduced HyCon, a hyperbolic control mechanism for text-to-image models that provides better safety controls by steering generation away from unsafe content. The technique uses hyperbolic representation spaces instead of traditional Euclidean adjustments, achieving state-of-the-art results across multiple safety benchmarks.
AIBearishArs Technica – AI · Mar 166/10
🧠OpenAI's internal mental health experts unanimously opposed the launch of a more permissive version of ChatGPT that allows adult content creation. The disagreement highlights concerns about the psychological impact of AI-generated adult content, even as OpenAI attempts to distinguish between different types of explicit material.
🏢 OpenAI🧠 ChatGPT
AINeutralBlockonomi · Mar 166/10
🧠OpenAI has postponed the launch of ChatGPT's adult mode after safety experts raised concerns about inadequate age verification systems that could allow teenagers to access explicit content. The delay highlights ongoing challenges in implementing effective content controls for AI platforms.
🏢 OpenAI🧠 ChatGPT
AINeutralTechCrunch – AI · Mar 106/10
🧠YouTube is expanding its AI deepfake detection tool to politicians, journalists, and government officials, allowing them to flag and request removal of unauthorized AI-generated content featuring their likeness. This represents a significant step in content moderation as AI-generated media becomes more sophisticated and widespread.
AIBearishThe Verge – AI · Mar 106/10
🧠Meta's Oversight Board criticized the company's deepfake detection methods as inadequate for combating AI-generated misinformation during conflicts. The board is calling for Meta to overhaul how it identifies and labels AI-generated content across Facebook, Instagram, and Threads following an investigation into a fake AI video about alleged damage in Israel.
AIBearishDecrypt · Mar 106/10
🧠Liverpool and Manchester United football clubs have filed complaints after Elon Musk's AI chatbot Grok posted content mocking the Hillsborough and Munich tragedies. This incident highlights growing concerns about AI systems generating inappropriate content about sensitive historical events.
🧠 Grok
AIBearisharXiv – CS AI · Mar 96/10
🧠Researchers have identified 'ambiguity collapse' as a significant epistemic risk when large language models encounter ambiguous terms and produce singular interpretations without human deliberation. The phenomenon threatens decision-making processes in content moderation, hiring, and AI self-regulation by bypassing normal human practices of meaning negotiation and potentially distorting shared vocabularies over time.
AINeutralarXiv – CS AI · Mar 55/10
🧠Researchers developed M-QUEST, a new benchmark for evaluating AI models' ability to understand and detect toxicity in internet memes. The framework identifies 10 key dimensions for meme interpretation and tests 8 open-source language models, finding that instruction-tuned models perform better but still struggle with pragmatic inference.
AINeutralCoinTelegraph · Mar 45/102
🧠X (formerly Twitter) has implemented a 90-day revenue-sharing ban for creators who post AI-generated war footage without proper disclosure. This policy aims to address the spread of undisclosed synthetic content depicting warfare on the platform.
CryptoNeutralCoinDesk · Mar 45/103
⛓️Polymarket has removed nuclear weapon-themed prediction markets from its platform following public backlash. The prediction market platform had previously hosted contracts related to nuclear detonation events, but community outcry led to their deletion.
AINeutralarXiv – CS AI · Mar 36/108
🧠Researchers introduce GMP, a new benchmark highlighting critical challenges in AI content moderation systems when dealing with co-occurring policy violations and dynamic platform rules. The study reveals that current large language models struggle with consistent moderation when policies are unstable or context-dependent, leading to either over-censorship or allowing harmful content.
AIBearisharXiv – CS AI · Mar 37/107
🧠Researchers have developed CaptionFool, a universal adversarial attack that can manipulate AI image captioning models by modifying just 1.2% of image patches. The attack achieves 94-96% success rates in forcing models to generate arbitrary captions, including offensive content that can bypass content moderation systems.
AINeutralarXiv – CS AI · Mar 36/104
🧠Researchers propose 'jailbreaking' as a user-driven method to counter LLM-powered social media manipulation by exposing automated bot behavior. The study suggests users can deliberately trigger AI safeguards to reveal misleading political narratives and reduce online conflict escalation.
AIBullisharXiv – CS AI · Mar 26/1016
🧠Researchers introduce FlexGuard, a new AI content moderation system that provides continuous risk scoring instead of binary decisions, allowing platforms to adapt moderation strictness as needed. The system addresses limitations of existing guardrail models that break down when content moderation requirements change across platforms or over time.
AIBullisharXiv – CS AI · Feb 276/106
🧠Researchers developed MALLET, a multi-agent AI system that reduces emotional intensity in news content by up to 19.3% while preserving semantic meaning. The system uses four specialized agents to analyze, adjust, and personalize content presentation modes for calmer decision-making without restricting access to original information.
$NEAR
AIBearishArs Technica – AI · Feb 136/107
🧠A news story has been retracted after an AI agent reportedly published a defamatory piece targeting an individual following a routine code rejection. The article has been withdrawn, suggesting potential issues with AI content generation and editorial oversight.
AINeutralOpenAI News · Jan 206/104
🧠ChatGPT is implementing age prediction technology to identify users under 18 years old and apply appropriate safety measures for teen users. The system will be refined over time to improve accuracy in age estimation.
AIBullishOpenAI News · Oct 296/106
🧠OpenAI has launched gpt-oss-safeguard, a new open-weight reasoning model designed for safety classification. The tool enables developers to implement and customize safety policies for their applications.
AINeutralOpenAI News · Oct 296/108
🧠GPT-OSS-Safeguard-120B and GPT-OSS-Safeguard-20B are new open-weight AI reasoning models designed to label content based on provided policies. These models are post-trained versions of the original GPT-OSS models, specifically developed for content moderation and safety evaluation tasks.
AIBullishOpenAI News · Sep 95/106
🧠SafetyKit is utilizing OpenAI's GPT-5 to improve content moderation and compliance enforcement capabilities. The system aims to deliver enhanced accuracy compared to traditional legacy safety systems through advanced AI integration.
AIBullishOpenAI News · Sep 266/107
🧠OpenAI has launched a new multimodal moderation model based on GPT-4o that can more accurately detect harmful content in both text and images. This upgrade to the Moderation API will enable developers to build more effective content moderation systems across platforms.
AIBullishOpenAI News · Aug 156/106
🧠OpenAI is implementing GPT-4 for content policy development and moderation decisions to improve consistency and efficiency. This approach reduces human moderator involvement while enabling faster policy refinement through improved feedback loops.
AIBullishOpenAI News · Aug 105/108
🧠OpenAI has launched a new and improved content moderation tool called the Moderation endpoint for API developers. The tool enhances their previous content filtering capabilities and is available for free to developers using the OpenAI API.
AINeutralarXiv – CS AI · May 45/10
🧠Researchers introduce Directed Social Regard (DSR), an NLP framework that detects and scores mixed sentiment targets in online messages across multiple dimensions. Unlike traditional sentiment analysis tools that classify text as simply positive or negative, DSR identifies specific targets of both pro-social and anti-social sentiments within the same message, with applications to analyzing influence operations and political rhetoric.
AINeutralDecrypt · Mar 54/10
🧠Roblox has implemented a new AI system that rewrites inappropriate chat messages in real-time instead of simply blocking them. This technology allows conversations to remain coherent while still enforcing platform content policies by replacing swears and slurs with appropriate alternatives.