y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#red-teaming News & Analysis

20 articles tagged with #red-teaming. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

20 articles
AIBearisharXiv – CS AI Β· 2d ago7/10
🧠

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

Researchers introduce TEMPLATEFUZZ, a fuzzing framework that systematically exploits vulnerabilities in LLM chat templatesβ€”a previously overlooked attack surface. The method achieves 98.2% jailbreak success rates on open-source models and 90% on commercial LLMs, significantly outperforming existing prompt injection techniques while revealing critical security gaps in production AI systems.

AINeutralarXiv – CS AI Β· Apr 67/10
🧠

AgenticRed: Evolving Agentic Systems for Red-Teaming

AgenticRed introduces an automated red-teaming system that uses evolutionary algorithms and LLMs to autonomously design attack methods without human intervention. The system achieved near-perfect attack success rates across multiple AI models, including 100% success on GPT-5.1, DeepSeek-R1 and DeepSeek V3.2.

🧠 GPT-5🧠 Llama
AINeutralarXiv – CS AI Β· Mar 177/10
🧠

Accelerating Suffix Jailbreak attacks with Prefix-Shared KV-cache

Researchers developed Prefix-Shared KV Cache (PSKV), a new technique that accelerates jailbreak attacks on Large Language Models by 40% while reducing memory usage by 50%. The method optimizes the red-teaming process by sharing cached prefixes across multiple attack attempts, enabling more efficient parallel inference without compromising attack success rates.

AIBearisharXiv – CS AI Β· Mar 127/10
🧠

Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models

Researchers have developed 'Amnesia,' a lightweight adversarial attack that bypasses safety mechanisms in open-weight Large Language Models by manipulating internal transformer states. The attack enables generation of harmful content without requiring fine-tuning or additional training, highlighting vulnerabilities in current LLM safety measures.

AIBearisharXiv – CS AI Β· Mar 127/10
🧠

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

Researchers developed a new framework for evaluating AI security risks specifically in banking and financial services, introducing the Risk-Adjusted Harm Score (RAHS) to measure severity of AI model failures. The study found that AI models become more vulnerable to security exploits during extended interactions, exposing critical weaknesses in current AI safety assessments for financial institutions.

AIBearisharXiv – CS AI Β· Mar 176/10
🧠

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

A new research study reveals that AI judges used to evaluate the safety of large language models perform poorly when assessing adversarial attacks, often degrading to near-random accuracy. The research analyzed 6,642 human-verified labels and found that many attacks artificially inflate their success rates by exploiting judge weaknesses rather than generating genuinely harmful content.

AINeutralarXiv – CS AI Β· Mar 126/10
🧠

FERRET: Framework for Expansion Reliant Red Teaming

Researchers introduce FERRET, a new automated red teaming framework designed to generate multi-modal adversarial conversations to test AI model vulnerabilities. The framework uses three types of expansions (horizontal, vertical, and meta) to create more effective attack strategies and demonstrates superior performance compared to existing red teaming approaches.

AINeutralarXiv – CS AI Β· Mar 126/10
🧠

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Researchers developed ADVERSA, an automated red-teaming framework that measures how AI guardrails degrade over multiple conversation turns rather than single-prompt attacks. Testing on three frontier models revealed a 26.7% jailbreak rate, with successful attacks concentrated in early rounds rather than accumulating through sustained pressure.

🧠 GPT-5🧠 Claude🧠 Opus
AIBullisharXiv – CS AI Β· Mar 36/105
🧠

Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution

Researchers introduce CEMMA, a co-evolutionary framework for improving AI safety alignment in multimodal large language models. The system uses evolving adversarial attacks and adaptive defenses to create more robust AI systems that better resist jailbreak attempts while maintaining functionality.

AINeutralOpenAI News Β· Dec 226/105
🧠

Continuously hardening ChatGPT Atlas against prompt injection

OpenAI is implementing automated red teaming with reinforcement learning to protect ChatGPT Atlas from prompt injection attacks. This proactive security approach aims to discover and patch vulnerabilities early as AI systems become more autonomous and agentic.

AINeutralOpenAI News Β· Feb 255/106
🧠

Deep research System Card

This report details safety measures implemented before releasing a deep research system, including external red teaming exercises and frontier risk evaluations. The work follows a structured Preparedness Framework and includes built-in mitigations to address identified key risk areas.

AINeutralOpenAI News Β· Jan 316/104
🧠

OpenAI o3-mini System Card

OpenAI has released a system card detailing the safety work conducted for its new o3-mini model. The report covers safety evaluations, external red teaming exercises, and assessments under OpenAI's Preparedness Framework to ensure responsible deployment.

AINeutralOpenAI News Β· Jan 236/107
🧠

Operator System Card

This document outlines a multi-layered AI safety framework based on OpenAI's established approaches, focusing on protections against prompt engineering, jailbreaks, privacy and security concerns. It details model and product mitigations, external red teaming efforts, safety evaluations, and ongoing refinement of safeguards.

AINeutralOpenAI News Β· Dec 55/105
🧠

OpenAI o1 System Card

OpenAI has released a system card detailing the safety evaluation process for their o1 and o1-mini models. The report covers external red teaming exercises and frontier risk assessments conducted under their Preparedness Framework before the models' public release.

AINeutralOpenAI News Β· Nov 215/102
🧠

Advancing red teaming with people and AI

The article discusses advancements in red teaming methodologies that combine human expertise with artificial intelligence capabilities. This represents a significant development in cybersecurity practices and AI safety testing approaches.

AINeutralOpenAI News Β· Aug 86/103
🧠

GPT-4o System Card

OpenAI released a system card detailing the comprehensive safety work conducted before launching GPT-4o, including external red team testing and frontier risk evaluations. The report covers safety mitigations built into the model to address key risk areas according to their Preparedness Framework.

AINeutralOpenAI News Β· Sep 195/104
🧠

OpenAI Red Teaming Network

OpenAI has announced an open call for experts to join their Red Teaming Network, focusing on improving AI model safety. The initiative seeks domain experts to help identify vulnerabilities and enhance security measures for OpenAI's AI systems.

AINeutralHugging Face Blog Β· Feb 243/104
🧠

Red-Teaming Large Language Models

The article title suggests content about red-teaming large language models, which involves testing AI systems for vulnerabilities and potential risks. However, no article body content was provided for analysis.