20 articles tagged with #red-teaming. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBearisharXiv β CS AI Β· 2d ago7/10
π§ Researchers introduce TEMPLATEFUZZ, a fuzzing framework that systematically exploits vulnerabilities in LLM chat templatesβa previously overlooked attack surface. The method achieves 98.2% jailbreak success rates on open-source models and 90% on commercial LLMs, significantly outperforming existing prompt injection techniques while revealing critical security gaps in production AI systems.
AINeutralarXiv β CS AI Β· Apr 67/10
π§ AgenticRed introduces an automated red-teaming system that uses evolutionary algorithms and LLMs to autonomously design attack methods without human intervention. The system achieved near-perfect attack success rates across multiple AI models, including 100% success on GPT-5.1, DeepSeek-R1 and DeepSeek V3.2.
π§ GPT-5π§ Llama
AINeutralarXiv β CS AI Β· Mar 177/10
π§ Researchers developed Prefix-Shared KV Cache (PSKV), a new technique that accelerates jailbreak attacks on Large Language Models by 40% while reducing memory usage by 50%. The method optimizes the red-teaming process by sharing cached prefixes across multiple attack attempts, enabling more efficient parallel inference without compromising attack success rates.
AIBearisharXiv β CS AI Β· Mar 127/10
π§ Researchers have developed 'Amnesia,' a lightweight adversarial attack that bypasses safety mechanisms in open-weight Large Language Models by manipulating internal transformer states. The attack enables generation of harmful content without requiring fine-tuning or additional training, highlighting vulnerabilities in current LLM safety measures.
AIBearisharXiv β CS AI Β· Mar 127/10
π§ Researchers developed a new framework for evaluating AI security risks specifically in banking and financial services, introducing the Risk-Adjusted Harm Score (RAHS) to measure severity of AI model failures. The study found that AI models become more vulnerable to security exploits during extended interactions, exposing critical weaknesses in current AI safety assessments for financial institutions.
AIBearisharXiv β CS AI Β· Mar 176/10
π§ A new research study reveals that AI judges used to evaluate the safety of large language models perform poorly when assessing adversarial attacks, often degrading to near-random accuracy. The research analyzed 6,642 human-verified labels and found that many attacks artificially inflate their success rates by exploiting judge weaknesses rather than generating genuinely harmful content.
AIBullisharXiv β CS AI Β· Mar 166/10
π§ Researchers developed Q-DIG, a red-teaming method that uses Quality Diversity techniques to identify diverse language instruction failures in Vision-Language-Action models for robotics. The approach generates adversarial prompts that expose vulnerabilities in robot behavior and improves task success rates when used for fine-tuning.
AINeutralarXiv β CS AI Β· Mar 126/10
π§ Researchers introduce FERRET, a new automated red teaming framework designed to generate multi-modal adversarial conversations to test AI model vulnerabilities. The framework uses three types of expansions (horizontal, vertical, and meta) to create more effective attack strategies and demonstrates superior performance compared to existing red teaming approaches.
AINeutralarXiv β CS AI Β· Mar 126/10
π§ Researchers developed ADVERSA, an automated red-teaming framework that measures how AI guardrails degrade over multiple conversation turns rather than single-prompt attacks. Testing on three frontier models revealed a 26.7% jailbreak rate, with successful attacks concentrated in early rounds rather than accumulating through sustained pressure.
π§ GPT-5π§ Claudeπ§ Opus
AIBullisharXiv β CS AI Β· Mar 36/105
π§ Researchers introduce CEMMA, a co-evolutionary framework for improving AI safety alignment in multimodal large language models. The system uses evolving adversarial attacks and adaptive defenses to create more robust AI systems that better resist jailbreak attempts while maintaining functionality.
AINeutralOpenAI News Β· Dec 226/105
π§ OpenAI is implementing automated red teaming with reinforcement learning to protect ChatGPT Atlas from prompt injection attacks. This proactive security approach aims to discover and patch vulnerabilities early as AI systems become more autonomous and agentic.
AINeutralOpenAI News Β· Feb 255/106
π§ This report details safety measures implemented before releasing a deep research system, including external red teaming exercises and frontier risk evaluations. The work follows a structured Preparedness Framework and includes built-in mitigations to address identified key risk areas.
AINeutralOpenAI News Β· Jan 316/104
π§ OpenAI has released a system card detailing the safety work conducted for its new o3-mini model. The report covers safety evaluations, external red teaming exercises, and assessments under OpenAI's Preparedness Framework to ensure responsible deployment.
AINeutralOpenAI News Β· Jan 236/107
π§ This document outlines a multi-layered AI safety framework based on OpenAI's established approaches, focusing on protections against prompt engineering, jailbreaks, privacy and security concerns. It details model and product mitigations, external red teaming efforts, safety evaluations, and ongoing refinement of safeguards.
AINeutralOpenAI News Β· Dec 55/105
π§ OpenAI has released a system card detailing the safety evaluation process for their o1 and o1-mini models. The report covers external red teaming exercises and frontier risk assessments conducted under their Preparedness Framework before the models' public release.
AINeutralOpenAI News Β· Nov 215/102
π§ The article discusses advancements in red teaming methodologies that combine human expertise with artificial intelligence capabilities. This represents a significant development in cybersecurity practices and AI safety testing approaches.
AINeutralOpenAI News Β· Aug 86/103
π§ OpenAI released a system card detailing the comprehensive safety work conducted before launching GPT-4o, including external red team testing and frontier risk evaluations. The report covers safety mitigations built into the model to address key risk areas according to their Preparedness Framework.
AINeutralOpenAI News Β· Sep 195/104
π§ OpenAI has announced an open call for experts to join their Red Teaming Network, focusing on improving AI model safety. The initiative seeks domain experts to help identify vulnerabilities and enhance security measures for OpenAI's AI systems.
AINeutralHugging Face Blog Β· Feb 243/104
π§ The article title suggests content about red-teaming large language models, which involves testing AI systems for vulnerabilities and potential risks. However, no article body content was provided for analysis.
AINeutralHugging Face Blog Β· Feb 231/103
π§ The article title suggests the introduction of a Red-Teaming Resistance Leaderboard, but no article body content was provided for analysis.