AIBearisharXiv – CS AI · May 297/10
🧠A comprehensive arXiv research review examines vulnerabilities in Large Language Models, particularly prompt injection and jailbreaking attacks, while analyzing existing defense mechanisms. The study identifies critical security gaps and proposes future research directions for safer LLM deployment across applications.
AIBearisharXiv – CS AI · May 287/10
🧠Researchers introduce HARP, a methodology for measuring how harm propagates across multi-agent LLM systems when one component is compromised. Testing on a finance-oriented seven-agent system reveals that single-agent compromise creates the strongest amplification effects, while existing defenses struggle to balance security with utility costs.
AIBullisharXiv – CS AI · May 287/10
🧠Researchers propose SPARD, a defense framework that protects large language models from harmful fine-tuning attacks by combining safety-constrained optimization with intelligent data selection. The method maintains task performance while significantly reducing adversarial attacks that attempt to remove safety guardrails from AI systems.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers propose Latent Personality Alignment (LPA), a novel defense mechanism for large language models that achieves adversarial robustness by training on abstract personality traits rather than harmful examples. The method requires fewer than 100 training examples while matching the performance of traditional approaches using 150,000+ harmful prompts, and demonstrates superior generalization to unseen attack vectors.
AINeutralarXiv – CS AI · May 127/10
🧠Researchers evaluated six defense mechanisms against persistent memory attacks on LLM agents, finding that most input and retrieval-level defenses fail to prevent malicious instruction execution stored in agent memory. Only Memory Sandbox, a memory-layer tool-gating approach, effectively blocked attacks across eight of nine models while maintaining zero utility cost, though it paradoxically increased attack success in one reasoning model by forcing reliance on alternative execution pathways.
AINeutralarXiv – CS AI · May 97/10
🧠Researchers have developed TurnGate, a defense system that detects multi-turn dialogue attacks where malicious intent is distributed across multiple conversation turns rather than exposed in a single prompt. The study introduces the Multi-Turn Intent Dataset (MTID) and demonstrates that the system outperforms existing baselines while maintaining low false-positive refusal rates.
AINeutralarXiv – CS AI · May 97/10
🧠Researchers propose Safety Bottleneck Regularization (SBR), a defense mechanism against harmful fine-tuning attacks on large language models. The approach anchors a model's unsafe responses to safe outputs via the unembedding layer, reducing harmful capabilities while maintaining performance on legitimate tasks.
AIBearisharXiv – CS AI · May 17/10
🧠A comprehensive academic survey examines security vulnerabilities and defense mechanisms across four operational layers of autonomous agent frameworks built on large language models. The research identifies how threats propagate across layers—from input manipulation through unsafe actions to ecosystem-level impacts—highlighting critical gaps in current security approaches as these systems become increasingly complex and integrated.
AINeutralarXiv – CS AI · May 17/10
🧠Researchers propose a machine unlearning framework to detect and remove neural backdoors—hidden triggers inserted during AI training that can compromise system integrity. Using model inversion and statistical analysis, the approach identifies malicious patterns and autonomously detaches machine behavior from backdoor triggers, addressing a critical cybersecurity vulnerability in AI systems.
AIBearisharXiv – CS AI · Apr 137/10
🧠Researchers have developed XFED, a novel model poisoning attack that compromises federated learning systems without requiring attackers to communicate or coordinate with each other. The attack successfully bypasses eight state-of-the-art defenses, revealing fundamental security vulnerabilities in FL deployments that were previously underestimated.
AINeutralarXiv – CS AI · Apr 107/10
🧠Researchers prove mathematically that no continuous input-preprocessing defense can simultaneously maintain utility, preserve model functionality, and guarantee safety against prompt injection attacks in language models with connected prompt spaces. The findings establish a fundamental trilemma showing that defenses must inevitably fail at some threshold inputs, with results verified in Lean 4 and validated empirically across three LLMs.
AIBearisharXiv – CS AI · Apr 77/10
🧠Researchers conducted the first real-world safety evaluation of OpenClaw, a widely deployed AI agent with extensive system access, revealing significant security vulnerabilities. The study found that poisoning any single dimension of the agent's state increases attack success rates from 24.6% to 64-74%, with even the strongest defenses still vulnerable to 63.8% of attacks.
🧠 GPT-5🧠 Claude🧠 Sonnet
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers have extended the RESTA defense mechanism to vision-language models (VLMs) to protect against jailbreaking attacks that can cause AI systems to produce harmful outputs. The study found that directional embedding noise significantly reduces attack success rates across the JailBreakV-28K benchmark, providing a lightweight security layer for AI agent systems.
AIBearisharXiv – CS AI · Mar 37/104
🧠Researchers have developed new stealthy poisoning attacks that can bypass current defenses in regression models used across industrial and scientific applications. The study introduces BayesClean, a novel defense mechanism that better protects against these sophisticated attacks when poisoning attempts are significant.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers from UTS achieved second place in a psychological defense mechanism classification competition using a multi-agent AI system that identifies defense patterns through absence-based reasoning rather than presence detection. The system combines Gemini 2.5 agents with fine-tuned Qwen models to achieve an F1 score of 0.406, addressing critical biases in minority class prediction through structured ensemble methods.
🧠 Gemini
AIBearisharXiv – CS AI · Feb 276/107
🧠Researchers evaluated prompt injection and jailbreak vulnerabilities across multiple open-source LLMs including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma. The study found significant behavioral variations across models and that lightweight defense mechanisms can be consistently bypassed by long, reasoning-heavy prompts.
AINeutralarXiv – CS AI · May 95/10
🧠Researchers propose UAT-MC, a new defense mechanism for multimodal recommender systems that addresses cross-modal gradient misalignment in evasion-based promotion attacks. The approach synchronizes visual and textual perturbations through coordinated adversarial training, improving robustness while maintaining recommendation quality.
AINeutralOpenAI News · May 34/106
🧠The article discusses research on adversarial robustness transfer between different types of perturbations in machine learning models. This research examines how defensive techniques developed for one type of attack may provide protection against other types of adversarial examples.