AINeutralLil'Log (Lilian Weng) · Oct 257/10
🧠Large language models like ChatGPT face security challenges from adversarial attacks and jailbreak prompts that can bypass safety measures implemented during alignment processes like RLHF. Unlike image-based attacks that operate in continuous space, text-based adversarial attacks are more challenging due to the discrete nature of language and lack of direct gradient signals.
🏢 OpenAI🧠 ChatGPT
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers demonstrate that Large Language Model-based multi-agent systems are vulnerable to coordinated attacks where malicious agents collaborate to spread misinformation more effectively than independent attackers. They propose STAR, a defense mechanism using sentence-level analysis that recovers 36.76% of lost performance by identifying and correcting misleading information in agent communications.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers present ChainCaps, a runtime safety framework that prevents tool-using AI agents from exploiting composed services through 'permission laundering'—where an agent passes intermediate results through multiple tools to achieve unauthorized outcomes. The system uses capability budgets that propagate through tool chains via intersection, reducing attack success rates from 25-68% to 0-4.8% while maintaining 96-100% benign task completion across frontier models.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers introduce Cordon-MAS, a new defense framework against poisoning attacks on retrieval-augmented generation (RAG) systems. The framework reduces attack success rates by 92.4% by enforcing information-flow control that prevents synthesis agents from directly accessing untrusted evidence, addressing a critical vulnerability in AI systems used for high-stakes applications.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers propose a dynamic defense mechanism for Multi-Agent Systems that identifies and isolates malicious agents by computing each agent's contribution to final outputs through backward propagation. The method addresses a critical vulnerability where adversarial agents can inject false information that spreads through agent networks, improving security for LLM-based multi-agent applications.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers introduce Shadow Unlearning, a privacy-preserving machine unlearning method that removes training data influence from LLMs without exposing sensitive information to attacks. The Neuro-Semantic Projector Unlearning (NSPU) framework achieves this while maintaining model performance and is 10x more computationally efficient than existing approaches.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers introduce VulTriage, an LLM-based framework that enhances vulnerability detection in source code through triple-path context augmentation combining control flow analysis, vulnerability knowledge retrieval, and semantic summarization. The approach achieves state-of-the-art results on benchmark datasets and demonstrates strong generalization to low-resource scenarios.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers developed ARSM-Agent, a security-enhanced framework for medical decision-making AI systems that defends against adversarial attacks through multi-module validation. The system reduces attack success rates to 8.7% while maintaining 91% knowledge consistency, demonstrating significant improvements over existing baseline approaches.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers evaluated prompt-injection defenses for educational LLM tutors, revealing inherent trade-offs between security, usability, and speed. A multi-layer safeguard pipeline achieved 46.34% attack bypass with zero false positives and 2.50ms latency, while competing systems like NeMo Guardrails eliminated bypasses but suffered 16.22% false positive rates and 1.3-second delays.
AIBearisharXiv – CS AI · May 116/10
🧠Researchers have successfully demonstrated methods to remove watermarks from large language model outputs through various text manipulation techniques including paraphrasing and machine translation. The study reveals that current watermarking schemes designed to prevent misuse of LLMs are vulnerable to attack, raising questions about their effectiveness as security measures.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers present CWE-BENCH-PYTHON, a large-scale benchmark demonstrating that poorly formulated prompts significantly increase the likelihood of LLMs generating insecure code. The study shows advanced prompting techniques like Chain-of-Thought can effectively mitigate these security risks, establishing prompt quality as a critical factor in AI-generated code safety.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers present MEMSAD, a defense mechanism against memory poisoning attacks on retrieval-augmented LLM agents, using gradient-coupled anomaly detection to identify adversarial perturbations while maintaining retrieval performance. The work formalizes security vulnerabilities in persistent external memory systems and demonstrates that while composite defenses achieve perfect detection rates, synonym-based attacks remain undetectable by embedding-based approaches.
AINeutralarXiv – CS AI · May 46/10
🧠Researchers introduce LOCA, a method for identifying why specific jailbreak attacks succeed against safety-trained LLMs by pinpointing minimal, causal changes in intermediate representations. The approach provides local explanations for individual jailbreak instances rather than global theories, achieving refusal induction with an average of six interpretable changes compared to prior methods requiring 20+.
🧠 Llama
AINeutralarXiv – CS AI · Apr 206/10
🧠Researchers propose trace rewriting techniques to protect language models from unauthorized knowledge distillation, a process where smaller models learn from larger ones without permission. The methods preserve model accuracy while degrading distillation usefulness and embedding detectable watermarks in student models.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers introduce Critical-CoT, a defense framework that protects large language models against reasoning-level backdoor attacks by fine-tuning models to develop critical thinking behaviors. Unlike token-level backdoors, these attacks inject malicious reasoning steps into chain-of-thought processes, making them harder to detect; the proposed defense demonstrates strong robustness across multiple LLMs and datasets.
AIBullisharXiv – CS AI · Apr 76/10
🧠Researchers developed a secure-by-design AI framework combining PromptShield and CIAF to automate cloud security and forensic investigations while protecting against prompt injection attacks. The system achieved over 93% accuracy in classification tasks and enhanced ransomware detection in AWS and Azure environments.
AINeutralarXiv – CS AI · Apr 76/10
🧠Research study reveals that when Claude Opus 4.6 deobfuscates JavaScript code, poisoned identifier names from the original string table consistently survive in the reconstructed code, even when the AI demonstrates correct understanding of the code's semantics. Changing the task framing from 'deobfuscate' to 'write fresh implementation' significantly reduced this persistence while maintaining algorithmic accuracy.
🧠 Claude🧠 Haiku🧠 Opus
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers developed ADVERSA, an automated red-teaming framework that measures how AI guardrails degrade over multiple conversation turns rather than single-prompt attacks. Testing on three frontier models revealed a 26.7% jailbreak rate, with successful attacks concentrated in early rounds rather than accumulating through sustained pressure.
🧠 GPT-5🧠 Claude🧠 Opus
AIBullisharXiv – CS AI · Mar 37/108
🧠Researchers introduce DualSentinel, a lightweight framework for detecting targeted attacks on Large Language Models by identifying 'Entropy Lull' patterns - periods of abnormally low token probability entropy that indicate when LLMs are being coercively controlled. The system uses dual-check verification to accurately detect backdoor and prompt injection attacks with near-zero false positives while maintaining minimal computational overhead.
$NEAR
AIBearisharXiv – CS AI · Feb 276/107
🧠Researchers evaluated prompt injection and jailbreak vulnerabilities across multiple open-source LLMs including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma. The study found significant behavioral variations across models and that lightweight defense mechanisms can be consistently bypassed by long, reasoning-heavy prompts.
AINeutralImport AI (Jack Clark) · Jan 126/107
🧠Import AI newsletter issue 440 explores evolving AI systems that can attack other LLMs, AI regulation mechanisms, and automation concepts. The research from Japanese AI startup Sakana demonstrates how AI systems can be evolved to compete against each other in controlled environments.