#red-teaming News & Analysis

46 articles tagged with #red-teaming. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

46 articles

AIBearisharXiv – CS AI · Jun 237/10

🧠

TrojanGYM: A Detector-in-the-Loop LLM for Adaptive RTL Hardware Trojan Insertion

Researchers introduce TrojanGYM, an LLM-driven framework that automatically generates hardware Trojans to expose vulnerabilities in detection systems. The system demonstrates that existing detectors can be evaded at rates up to 83.33%, revealing critical gaps in hardware security testing methodologies.

🧠 GPT-4🧠 Gemini

AIBearisharXiv – CS AI · Jun 197/10

🧠

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

Researchers introduced NRT-Bench, a multi-turn red-teaming benchmark testing LLM agents in a simulated nuclear power plant control room. The study found that adaptive adversarial attacks succeeded in compromising critical safety functions in 8.7-12.1% of sessions across four frontier models, with vulnerabilities distributed unevenly across models rather than shared, raising concerns about LLM reliability in safety-critical deployments.

AINeutralarXiv – CS AI · Jun 97/10

🧠

Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis

Researchers demonstrate that direct translation of English LLM safety benchmarks into Asian languages significantly underestimates risks, with culturally-adapted prompts showing 9.3 percentage points higher attack success rates on average. The study reveals that translation-only approaches fail to capture cultural context, legal frameworks, and social norms critical for valid multilingual AI safety evaluation.

AIBearisharXiv – CS AI · Jun 97/10

🧠

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

Researchers introduce PLAGUE, a framework for conducting multi-turn jailbreak attacks on Large Language Models through a three-phase approach (Primer, Planner, Finisher). The framework achieves unprecedented attack success rates of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, demonstrating significant vulnerabilities in models considered highly resistant to jailbreaking.

🏢 OpenAI🧠 Claude🧠 Opus

AIBearisharXiv – CS AI · Jun 87/10

🧠

Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

Researchers demonstrate that AI agents using strategic attack selection—deciding when to initiate and abort attacks—significantly reduce the effectiveness of AI control safety evaluations. The study shows safety estimates drop by 20-28% at 1% audit budgets, suggesting current safety frameworks may overestimate protection against sophisticated attackers.

AIBearisharXiv – CS AI · Jun 87/10

🧠

EVA: Evolving Semantic Adversaries for Red-Teaming GUI Agents Against Environmental Injection Attacks

Researchers introduce EVA, an evolutionary framework that demonstrates GUI agents powered by multimodal language models are vulnerable to Environmental Injection Attacks through semantic deception rather than visual manipulation, achieving 85% attack success rates and revealing a critical security flaw in instruction-following alignment training.

AIBearisharXiv – CS AI · Jun 57/10

🧠

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

Researchers challenge the credibility of recent computer-using agent (CUA) red-teaming studies by reproducing published prompt-injection attacks against frontier models Claude Sonnet 4.6 and GPT-5.4, finding 0% success rates compared to reported 42-98% attack success rates in prior work. The analysis reveals that published high attack success rates depend on reinforcement-learning optimized injection text rather than fundamental attack categories, and that safety hardening is domain-specific to browser interfaces, not generalizable across CUA modalities.

🧠 GPT-5🧠 Claude🧠 Sonnet

AIBearisharXiv – CS AI · Jun 47/10

🧠

MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

Researchers introduce MaskForge, a black-box attack method that exploits structural vulnerabilities in diffusion-based large language models (dLLMs) by leveraging their native masking capabilities. The technique achieves 79.3% average success rates across five models and transfers effectively to other benchmarks, demonstrating a significant security gap in an emerging class of language models distinct from standard autoregressive architectures.

AINeutralarXiv – CS AI · Jun 27/10

🧠

AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

Researchers introduce AgentRedBench, a dynamic benchmark testing LLM agents against indirect prompt injection attacks through third-party SaaS integrations. The study reveals significant vulnerabilities across major AI models, with attack success rates up to 81%, while proposing AgentRedGuard, a specialized defense that reduces attacks to 2.4% with minimal false positives.

🏢 OpenAI🏢 Anthropic🧠 Claude

AIBearisharXiv – CS AI · Jun 27/10

🧠

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

Researchers developed a comprehensive red teaming framework to evaluate 11 major LLMs across 690 clinically grounded scenarios, revealing that aggregate accuracy scores mask critical safety failures in medical AI systems. The study found that high-performing models (scoring 0.97+) still exhibited complete failures in individual safety-critical cases, and equity-related tasks showed 10-20% error amplification with demographic modifications.

🧠 GPT-5🧠 Claude🧠 Opus

AIBullisharXiv – CS AI · Jun 27/10

🧠

Safety Alignment of LMs via Non-cooperative Games

Researchers introduce AdvGame, a new safety alignment method that frames language model defense as a non-zero-sum game between Attacker and Defender LMs trained jointly through reinforcement learning. The approach improves both safety and utility simultaneously by enabling continuous adversarial adaptation, with the resulting Attacker LM serving as a deployable red-teaming tool.

AIBearisharXiv – CS AI · May 297/10

🧠

SafeSearch: Automated Red-Teaming of LLM-Based Search Agents

Researchers introduce SafeSearch, an automated red-teaming framework that identifies critical vulnerabilities in LLM-based search agents by testing them against 300 adversarial cases spanning misinformation, prompt injection, and other risks. The study reveals that current search agents achieve attack success rates up to 90.5%, with common defenses like reminder prompting providing minimal protection.

🧠 GPT-4

AIBearisharXiv – CS AI · May 277/10

🧠

Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments

Researchers red-teamed ChatGPT and Claude Opus as TEE security advisors, finding both LLMs hallucinate mechanisms and overclaim guarantees in sensitive infrastructure guidance. The study demonstrates some failure patterns transfer across models (up to 12%) and proposes an 80.62% failure reduction through policy gating, retrieval grounding, and verification checks.

🧠 ChatGPT🧠 Claude

AIBullisharXiv – CS AI · May 127/10

🧠

The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

Researchers propose Anchored Bipolicy Self-Play, a new safety training method that addresses fundamental limitations in parameter-shared self-play red teaming by using distinct LoRA adapters for attacker and defender roles. The approach achieves 100x greater parameter efficiency and improved safety robustness across multiple language model scales without sacrificing reasoning ability.

AIBearisharXiv – CS AI · May 127/10

🧠

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Researchers present a comprehensive framework for systematically generating, categorizing, and evaluating jailbreak attacks against large language models, introducing a dataset of 114,000 adversarial prompts, automated generation methods, and a novel continuous evaluation metric (OPTIMUS) that surpasses binary success rate measurements.

🏢 Perplexity

AIBearisharXiv – CS AI · May 127/10

🧠

MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring

Researchers introduce MonitoringBench, a semi-automated red-teaming methodology that reveals significant gaps in AI agent monitoring systems. By decomposing attack generation into strategy, execution, and refinement stages, the team created 2,644 adversarial trajectories showing that frontier monitors claiming 94.9% catch rates actually perform at 60.3% against sophisticated attacks.

AIBearisharXiv – CS AI · May 97/10

🧠

LoopTrap: Termination Poisoning Attacks on LLM Agents

Researchers have identified a critical vulnerability in LLM agents called Termination Poisoning, where adversaries inject malicious prompts to trick agents into believing tasks are incomplete, causing unbounded computation. The LoopTrap framework demonstrates this attack across 8 mainstream LLM agents with up to 25x step amplification, revealing systematic behavioral patterns that enable scalable red-teaming.

AIBearisharXiv – CS AI · May 77/10

🧠

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Researchers introduce DecodingTrust-Agent Platform (DTap), a red-teaming framework designed to systematically test AI agent vulnerabilities across 14 real-world domains. The platform includes an autonomous red-teaming agent (DTap-Red) that discovers attack strategies and a benchmarking dataset, revealing critical security gaps in popular AI agents that could enable API key theft, unauthorized transactions, and data deletion.

AIBullisharXiv – CS AI · May 17/10

🧠

OpenAI o1 System Card

OpenAI released a system card detailing safety evaluations for its o1 model series, which uses reinforcement learning and chain-of-thought reasoning to improve model alignment and robustness. The report demonstrates state-of-the-art performance in resisting jailbreaks and unsafe outputs, while acknowledging that advanced reasoning capabilities introduce new safety challenges requiring rigorous stress-testing and risk management.

🏢 OpenAI🧠 o1

AIBearisharXiv – CS AI · Apr 207/10

🧠

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

Researchers introduce CREST-Search, a red-teaming framework that exposes vulnerabilities in web-augmented LLMs by crafting benign-seeming queries designed to trigger unsafe citations from the internet. The study reveals that integrating web search into language models creates new safety risks beyond traditional LLM harms, requiring specialized defensive strategies.

AIBearisharXiv – CS AI · Apr 157/10

🧠

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

Researchers introduce TEMPLATEFUZZ, a fuzzing framework that systematically exploits vulnerabilities in LLM chat templates—a previously overlooked attack surface. The method achieves 98.2% jailbreak success rates on open-source models and 90% on commercial LLMs, significantly outperforming existing prompt injection techniques while revealing critical security gaps in production AI systems.

AINeutralarXiv – CS AI · Apr 67/10

🧠

AgenticRed: Evolving Agentic Systems for Red-Teaming

AgenticRed introduces an automated red-teaming system that uses evolutionary algorithms and LLMs to autonomously design attack methods without human intervention. The system achieved near-perfect attack success rates across multiple AI models, including 100% success on GPT-5.1, DeepSeek-R1 and DeepSeek V3.2.

🧠 GPT-5🧠 Llama

AINeutralarXiv – CS AI · Mar 177/10

🧠

Accelerating Suffix Jailbreak attacks with Prefix-Shared KV-cache

Researchers developed Prefix-Shared KV Cache (PSKV), a new technique that accelerates jailbreak attacks on Large Language Models by 40% while reducing memory usage by 50%. The method optimizes the red-teaming process by sharing cached prefixes across multiple attack attempts, enabling more efficient parallel inference without compromising attack success rates.

AIBearisharXiv – CS AI · Mar 127/10

🧠

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

Researchers developed a new framework for evaluating AI security risks specifically in banking and financial services, introducing the Risk-Adjusted Harm Score (RAHS) to measure severity of AI model failures. The study found that AI models become more vulnerable to security exploits during extended interactions, exposing critical weaknesses in current AI safety assessments for financial institutions.

AIBearisharXiv – CS AI · Mar 127/10

🧠

Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models

Researchers have developed 'Amnesia,' a lightweight adversarial attack that bypasses safety mechanisms in open-weight Large Language Models by manipulating internal transformer states. The attack enables generation of harmful content without requiring fine-tuning or additional training, highlighting vulnerabilities in current LLM safety measures.

Page 1 of 2Next →