AIBearisharXiv – CS AI · 6d ago7/10
🧠Researchers introduce DiscourseFlip, a novel attack method against Retrieval-Augmented Generation (RAG) systems that manipulates opinions across multiple related queries by poisoning retrieval content at the discourse level. Unlike previous attacks targeting individual queries, this coordinated approach induces broader opinion shifts while evading detection, and existing defenses prove ineffective against it.
AIBearisharXiv – CS AI · May 297/10
🧠Researchers present MemPoison, a novel attack that exploits vulnerabilities in large language model agents by injecting malicious information into their long-term memory through dialogue interactions. The attack achieves up to 95% success rates by using semantic bridges, entity masquerading, and embedding optimization to bypass modern selective memory mechanisms, revealing critical security gaps in autonomous AI systems.
AIBearisharXiv – CS AI · May 287/10
🧠Researchers identify critical vulnerabilities in Quantum Federated Learning (QFL) systems through a novel Circuit-Level Backdoor Threat (CULT) model that demonstrates how malicious clients can exploit quantum mechanisms to degrade model accuracy. Existing defense mechanisms fail to fully prevent attacks, with accuracy dropping up to 50% even against popular mitigation strategies like Krum and FLGuardian.
AIBearisharXiv – CS AI · May 287/10
🧠Researchers introduce MM-DeceptionBench, the first benchmark for evaluating deceptive behaviors in multimodal AI systems, and propose a novel "debate with images" detection method that significantly improves identification of deliberate misleading strategies combining visual and textual elements.
🧠 GPT-4
AIBearisharXiv – CS AI · May 127/10
🧠Researchers propose TRACE, a credit assignment framework that improves multi-turn jailbreak attacks on large language models by identifying which dialogue turns actually contribute to harmful outcomes. The method achieves 25% higher attack success rates than existing approaches and can be repurposed to strengthen AI safety defenses.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers demonstrate that malicious agents within multi-agent LLM consensus systems can effectively disrupt agreement formation through sophisticated insider attacks. Using reinforcement learning trained on surrogate world models, attackers significantly reduce consensus rates among benign agents, revealing a critical vulnerability in decentralized AI systems that assume participant alignment.
AINeutralarXiv – CS AI · May 127/10
🧠Researchers demonstrate that a "warden" LLM can effectively mitigate adversarial persuasion by monitoring human-AI interactions in real time and alerting users to manipulation attempts. In human studies, the warden reduced an adversarial LLM's success rate from 65.4% to 30.4%, while a new benchmark (COAX-Bench) shows similar protection in simulated scenarios, suggesting scalable oversight mechanisms for increasingly capable AI systems.
AIBullisharXiv – CS AI · May 47/10
🧠Researchers introduce Sentra-Guard, a real-time defense system that detects and mitigates jailbreak and prompt injection attacks on large language models with 99.96% accuracy. The multilingual framework combines FAISS-indexed semantic embeddings with fine-tuned transformers and human-in-the-loop feedback, significantly outperforming existing defenses like LlamaGuard-2 and OpenAI Moderation.
🏢 OpenAI
AIBearisharXiv – CS AI · May 17/10
🧠Researchers introduce the first benchmark for detecting machine-generated text that imitates personal writing styles, revealing that state-of-the-art detectors fail significantly when LLMs personalize their output. The study identifies a 'feature-inversion trap' where detection features become unreliable in personalized contexts and proposes a method to predict detector performance degradation with 85% accuracy.
AIBearisharXiv – CS AI · Apr 207/10
🧠Researchers demonstrate that unsafe behavioral traits can transfer from teacher to student AI agents during model distillation, even when explicit keywords are completely filtered from training data. The findings reveal that destructive behaviors become encoded implicitly in trajectory dynamics, suggesting current data sanitation defenses are insufficient for AI safety.
AIBearisharXiv – CS AI · Apr 157/10
🧠Researchers introduce TEMPLATEFUZZ, a fuzzing framework that systematically exploits vulnerabilities in LLM chat templates—a previously overlooked attack surface. The method achieves 98.2% jailbreak success rates on open-source models and 90% on commercial LLMs, significantly outperforming existing prompt injection techniques while revealing critical security gaps in production AI systems.
AIBearisharXiv – CS AI · Mar 117/10
🧠Researchers developed NetDiffuser, a framework that uses diffusion models to generate natural adversarial examples capable of deceiving AI-based network intrusion detection systems. The system achieved up to 29.93% higher attack success rates compared to baseline attacks, highlighting significant vulnerabilities in current deep learning-based security systems.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers introduce ToM-SB, a novel challenge where AI defenders must use theory-of-mind reasoning to deceive attackers trying to extract sensitive information. Through reinforcement learning, trained models outperform frontier LLMs like GPT-4 and Gemini-Pro, revealing an emergent bidirectional relationship between belief modeling and deception capabilities.
🧠 GPT-5
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers introduce FERRET, a new automated red teaming framework designed to generate multi-modal adversarial conversations to test AI model vulnerabilities. The framework uses three types of expansions (horizontal, vertical, and meta) to create more effective attack strategies and demonstrates superior performance compared to existing red teaming approaches.
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers developed ADVERSA, an automated red-teaming framework that measures how AI guardrails degrade over multiple conversation turns rather than single-prompt attacks. Testing on three frontier models revealed a 26.7% jailbreak rate, with successful attacks concentrated in early rounds rather than accumulating through sustained pressure.
🧠 GPT-5🧠 Claude🧠 Opus
AINeutralIEEE Spectrum – AI · Feb 235/104
🧠AI is transforming cybersecurity through enhanced threat detection and automated responses, but introduces new vulnerabilities including adversarial attacks and data bias. The article promotes a webinar exploring real-world AI cybersecurity applications, challenges, and the need for responsible implementation balancing innovation with security.