AIBearisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce SafeSearch, an automated red-teaming framework that identifies critical vulnerabilities in LLM-based search agents by testing them against 300 adversarial cases spanning misinformation, prompt injection, and other risks. The study reveals that current search agents achieve attack success rates up to 90.5%, with common defenses like reminder prompting providing minimal protection.
🧠 GPT-4
AIBearisharXiv – CS AI · 5d ago7/10
🧠Researchers red-teamed ChatGPT and Claude Opus as TEE security advisors, finding both LLMs hallucinate mechanisms and overclaim guarantees in sensitive infrastructure guidance. The study demonstrates some failure patterns transfer across models (up to 12%) and proposes an 80.62% failure reduction through policy gating, retrieval grounding, and verification checks.
🧠 ChatGPT🧠 Claude
AIBullisharXiv – CS AI · May 127/10
🧠Researchers propose Anchored Bipolicy Self-Play, a new safety training method that addresses fundamental limitations in parameter-shared self-play red teaming by using distinct LoRA adapters for attacker and defender roles. The approach achieves 100x greater parameter efficiency and improved safety robustness across multiple language model scales without sacrificing reasoning ability.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers present a comprehensive framework for systematically generating, categorizing, and evaluating jailbreak attacks against large language models, introducing a dataset of 114,000 adversarial prompts, automated generation methods, and a novel continuous evaluation metric (OPTIMUS) that surpasses binary success rate measurements.
🏢 Perplexity
AIBearisharXiv – CS AI · May 127/10
🧠Researchers introduce MonitoringBench, a semi-automated red-teaming methodology that reveals significant gaps in AI agent monitoring systems. By decomposing attack generation into strategy, execution, and refinement stages, the team created 2,644 adversarial trajectories showing that frontier monitors claiming 94.9% catch rates actually perform at 60.3% against sophisticated attacks.
AIBearisharXiv – CS AI · May 97/10
🧠Researchers have identified a critical vulnerability in LLM agents called Termination Poisoning, where adversaries inject malicious prompts to trick agents into believing tasks are incomplete, causing unbounded computation. The LoopTrap framework demonstrates this attack across 8 mainstream LLM agents with up to 25x step amplification, revealing systematic behavioral patterns that enable scalable red-teaming.
AIBearisharXiv – CS AI · May 77/10
🧠Researchers introduce DecodingTrust-Agent Platform (DTap), a red-teaming framework designed to systematically test AI agent vulnerabilities across 14 real-world domains. The platform includes an autonomous red-teaming agent (DTap-Red) that discovers attack strategies and a benchmarking dataset, revealing critical security gaps in popular AI agents that could enable API key theft, unauthorized transactions, and data deletion.
AIBullisharXiv – CS AI · May 17/10
🧠OpenAI released a system card detailing safety evaluations for its o1 model series, which uses reinforcement learning and chain-of-thought reasoning to improve model alignment and robustness. The report demonstrates state-of-the-art performance in resisting jailbreaks and unsafe outputs, while acknowledging that advanced reasoning capabilities introduce new safety challenges requiring rigorous stress-testing and risk management.
🏢 OpenAI🧠 o1
AIBearisharXiv – CS AI · Apr 207/10
🧠Researchers introduce CREST-Search, a red-teaming framework that exposes vulnerabilities in web-augmented LLMs by crafting benign-seeming queries designed to trigger unsafe citations from the internet. The study reveals that integrating web search into language models creates new safety risks beyond traditional LLM harms, requiring specialized defensive strategies.
AIBearisharXiv – CS AI · Apr 157/10
🧠Researchers introduce TEMPLATEFUZZ, a fuzzing framework that systematically exploits vulnerabilities in LLM chat templates—a previously overlooked attack surface. The method achieves 98.2% jailbreak success rates on open-source models and 90% on commercial LLMs, significantly outperforming existing prompt injection techniques while revealing critical security gaps in production AI systems.
AINeutralarXiv – CS AI · Apr 67/10
🧠AgenticRed introduces an automated red-teaming system that uses evolutionary algorithms and LLMs to autonomously design attack methods without human intervention. The system achieved near-perfect attack success rates across multiple AI models, including 100% success on GPT-5.1, DeepSeek-R1 and DeepSeek V3.2.
🧠 GPT-5🧠 Llama
AINeutralarXiv – CS AI · Mar 177/10
🧠Researchers developed Prefix-Shared KV Cache (PSKV), a new technique that accelerates jailbreak attacks on Large Language Models by 40% while reducing memory usage by 50%. The method optimizes the red-teaming process by sharing cached prefixes across multiple attack attempts, enabling more efficient parallel inference without compromising attack success rates.
AIBearisharXiv – CS AI · Mar 127/10
🧠Researchers have developed 'Amnesia,' a lightweight adversarial attack that bypasses safety mechanisms in open-weight Large Language Models by manipulating internal transformer states. The attack enables generation of harmful content without requiring fine-tuning or additional training, highlighting vulnerabilities in current LLM safety measures.
AIBearisharXiv – CS AI · Mar 127/10
🧠Researchers developed a new framework for evaluating AI security risks specifically in banking and financial services, introducing the Risk-Adjusted Harm Score (RAHS) to measure severity of AI model failures. The study found that AI models become more vulnerable to security exploits during extended interactions, exposing critical weaknesses in current AI safety assessments for financial institutions.
AINeutralarXiv – CS AI · May 96/10
🧠PersonaTeaming introduces a persona-driven approach to red-teaming generative AI systems, combining automated adversarial prompt generation with human-in-the-loop collaboration. The method outperforms existing automated approaches while enabling security researchers to leverage diverse perspectives and backgrounds to uncover AI model vulnerabilities more effectively.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers introduce AI-Control Games, a formal mathematical framework for evaluating the safety of deploying untrusted AI systems through red-teaming exercises modeled as multi-objective stochastic games. The work demonstrates applications to language model deployment protocols, particularly Trusted Monitoring systems, offering improvements over existing empirical safety evaluation methods.
AIBearisharXiv – CS AI · Mar 176/10
🧠A new research study reveals that AI judges used to evaluate the safety of large language models perform poorly when assessing adversarial attacks, often degrading to near-random accuracy. The research analyzed 6,642 human-verified labels and found that many attacks artificially inflate their success rates by exploiting judge weaknesses rather than generating genuinely harmful content.
AIBullisharXiv – CS AI · Mar 166/10
🧠Researchers developed Q-DIG, a red-teaming method that uses Quality Diversity techniques to identify diverse language instruction failures in Vision-Language-Action models for robotics. The approach generates adversarial prompts that expose vulnerabilities in robot behavior and improves task success rates when used for fine-tuning.
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers introduce FERRET, a new automated red teaming framework designed to generate multi-modal adversarial conversations to test AI model vulnerabilities. The framework uses three types of expansions (horizontal, vertical, and meta) to create more effective attack strategies and demonstrates superior performance compared to existing red teaming approaches.
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers developed ADVERSA, an automated red-teaming framework that measures how AI guardrails degrade over multiple conversation turns rather than single-prompt attacks. Testing on three frontier models revealed a 26.7% jailbreak rate, with successful attacks concentrated in early rounds rather than accumulating through sustained pressure.
🧠 GPT-5🧠 Claude🧠 Opus
AIBullisharXiv – CS AI · Mar 36/105
🧠Researchers introduce CEMMA, a co-evolutionary framework for improving AI safety alignment in multimodal large language models. The system uses evolving adversarial attacks and adaptive defenses to create more robust AI systems that better resist jailbreak attempts while maintaining functionality.
AINeutralOpenAI News · Dec 226/105
🧠OpenAI is implementing automated red teaming with reinforcement learning to protect ChatGPT Atlas from prompt injection attacks. This proactive security approach aims to discover and patch vulnerabilities early as AI systems become more autonomous and agentic.
AINeutralOpenAI News · Feb 255/106
🧠This report details safety measures implemented before releasing a deep research system, including external red teaming exercises and frontier risk evaluations. The work follows a structured Preparedness Framework and includes built-in mitigations to address identified key risk areas.
AINeutralOpenAI News · Jan 316/104
🧠OpenAI has released a system card detailing the safety work conducted for its new o3-mini model. The report covers safety evaluations, external red teaming exercises, and assessments under OpenAI's Preparedness Framework to ensure responsible deployment.
AINeutralOpenAI News · Jan 236/107
🧠This document outlines a multi-layered AI safety framework based on OpenAI's established approaches, focusing on protections against prompt engineering, jailbreaks, privacy and security concerns. It details model and product mitigations, external red teaming efforts, safety evaluations, and ongoing refinement of safeguards.