AIBearisharXiv – CS AI · 2d ago7/10
🧠A comprehensive arXiv research review examines vulnerabilities in Large Language Models, particularly prompt injection and jailbreaking attacks, while analyzing existing defense mechanisms. The study identifies critical security gaps and proposes future research directions for safer LLM deployment across applications.
AIBearisharXiv – CS AI · 2d ago7/10
🧠Researchers conducted the first systematic study of prompt injection attacks in real-world LLM-based resume screening, analyzing approximately 200,000 resumes from hireEZ. They found that ~1% of resumes contain hidden prompt injections, with prevalence increasing significantly over the past 1-2 years, and discovered that over 90% of injected prompts use subtle methods rather than explicit instructions.
AIBearisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce GEO-Bench, a standardized benchmark for evaluating ranking manipulation attacks against large language models used in generative search. The study compares black-box and white-box adversarial attacks, revealing that simpler content-rewriting methods can match gradient-based approaches while remaining more difficult to detect.
🏢 Perplexity🧠 Llama
AIBearisharXiv – CS AI · 3d ago7/10
🧠Researchers have identified a new vulnerability in LLM-based agents called 'Sleeper Attacks,' where adversarial content persists dormant in agent state across multiple interactions before being activated by benign queries. The attack threatens real-world LLM deployments by evading single-interaction detection mechanisms, with testing showing vulnerabilities across seven major language models.
AIBearisharXiv – CS AI · 3d ago7/10
🧠Researchers demonstrate a practical attack called Bias-Inversion Rewriting Attack (BIRA) that defeats LLM watermarking schemes with over 99% success rate while maintaining semantic quality. The findings expose fundamental vulnerabilities in current watermarking detection methods, which are widely considered essential for identifying AI-generated content.
AIBearisharXiv – CS AI · 3d ago7/10
🧠Researchers demonstrate MIRAGE, a technique that exploits vision-language model vulnerabilities in mobile GUI agents by injecting adversarial text into user-generated content regions. The attack achieves 23-30% success rates across five VLM agents without modifying apps or operating systems, revealing a critical security gap in AI-powered mobile automation that existing visual-quality defenses cannot reliably prevent.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers propose SPARD, a defense framework that protects large language models from harmful fine-tuning attacks by combining safety-constrained optimization with intelligent data selection. The method maintains task performance while significantly reducing adversarial attacks that attempt to remove safety guardrails from AI systems.
AIBearisharXiv – CS AI · 3d ago7/10
🧠Researchers demonstrate that large language model refusal behavior can be detected and exploited through intermediate layer activations before final output generation. A new attack method called Mechanistic AutoDAN leverages this discovery to achieve competitive jailbreak success rates while reducing computational time by up to 72%, raising concerns about LLM safety mechanisms.
AIBearisharXiv – CS AI · 3d ago7/10
🧠Researchers present MM-PoisonRAG, a framework demonstrating critical vulnerabilities in multimodal RAG systems where adversaries can inject poisoned content into knowledge bases to manipulate AI outputs. Two attack strategies—localized poisoning targeting specific queries and globalized poisoning affecting all queries—achieve high success rates and bypass existing defenses, exposing fundamental security gaps in RAG-augmented language models.
AIBearisharXiv – CS AI · 4d ago7/10
🧠Researchers have demonstrated a new adversarial attack framework called Multi-Modal Adversarial Synergy (MMAS) that can compromise Vision-Language Models through simultaneous perturbations of both images and text using only black-box queries. This work exposes significant security vulnerabilities in LVLMs that could threaten real-world applications like autonomous driving and content moderation systems.
AINeutralarXiv – CS AI · 4d ago7/10
🧠Researchers propose SALO, a jailbreak detection method that identifies persistent 'refusal trajectories' across model layers, rather than relying on static terminal representations. The detector demonstrates improved detection rates against adversarial attacks on multiple LLM architectures, though with acknowledged limitations against adaptive attacks.
🧠 Llama
AIBearisharXiv – CS AI · 4d ago7/10
🧠Researchers demonstrate BITE, a black-box adversarial attack framework that exploits stylistic biases in LLM judges to artificially inflate evaluation scores while preserving semantic meaning. The attack achieves over 65% success rates across diverse LLM judges and tasks, exposing fundamental vulnerabilities in using language models for objective evaluation.
AIBearisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce MemMorph, a novel attack method that compromises LLM-driven agents by poisoning their long-term memory modules rather than manipulating tool metadata. The attack achieves up to 85.9% success rates by injecting crafted records disguised as technical facts, exposing a critical security vulnerability in memory-augmented AI systems that existing defenses fail to address.
AIBearisharXiv – CS AI · 4d ago7/10
🧠Researchers have developed BEAP, a black-box adversarial attack that bypasses machine unlearning safeguards in text-to-image diffusion models by generating natural-language prompts that evade detection filters. The attack achieves 60% higher success rates than previous methods while remaining undetectable to safety systems, raising critical questions about the robustness of AI model safety mechanisms.
AIBearishDecrypt – AI · 5d ago7/10
🧠Researchers discovered that hidden inaudible signals embedded in audio clips can manipulate AI voice models, compromising their integrity. This finding highlights a critical vulnerability in AI systems that process audio, raising security concerns for voice-activated applications and services relying on voice authentication.
AINeutralarXiv – CS AI · May 127/10
🧠Researchers have identified a compact causal mechanism explaining how large language models can be persuaded to abandon factual knowledge through the manipulation of mid-layer attention heads. The vulnerability operates as a discrete latent switch rather than confidence reduction, with persuasion working by redirecting attention via a rank-one feature built from persuasive keywords, revealing persuasion as a narrow and potentially monitorable circuit.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers introduce EditRisk-Bench, a new benchmark for evaluating safety vulnerabilities in large language models when their knowledge is maliciously edited. The study demonstrates that adversaries can inject false or harmful information that corrupts downstream reasoning while remaining difficult to detect, revealing critical security gaps in knowledge-intensive AI systems.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers have discovered WebTrap, a sophisticated prompt injection attack that can stealthily hijack browser-based AI agents during extended tasks by seamlessly blending malicious instructions with legitimate user goals. The attack maintains system usability while achieving high success rates, exposing critical vulnerabilities in autonomous agent systems that current defense mechanisms cannot adequately address.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers have developed PGD²-GSM, a novel adversarial attack method that successfully performs high-resolution global semantic manipulation on learned image compression systems for the first time. The breakthrough uses a Periodic Geometric Decay schedule to overcome limitations in existing attack methods, exposing a critical vulnerability in DNN-based compression systems that previous techniques could not achieve.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers have developed Semantic Representation Attack (SRA), a novel adversarial technique that bypasses LLM safety mechanisms by targeting semantic meaning rather than specific text patterns. The method achieves 99.71% attack success rates across 26 open-source models with strong cross-model transferability, raising significant security concerns for deployed AI systems.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers demonstrate 'Oracle Poisoning,' a novel attack where adversaries corrupt knowledge graphs used by AI agents, causing them to reach incorrect conclusions through valid reasoning. Testing across nine models from three providers shows all models accept fabricated data at 100% under moderate attack sophistication, revealing a critical vulnerability in production-scale agentic systems that differs fundamentally from prompt injection attacks.
🧠 GPT-5
AIBearisharXiv – CS AI · May 117/10
🧠Researchers have developed OrchJail, a fuzzing framework that discovers vulnerabilities in tool-calling text-to-image AI agents by exploiting how multiple benign steps combine into unsafe outputs. Unlike traditional prompt-injection attacks, OrchJail targets the orchestration layer where agents chain tools together, achieving higher attack success rates while evading existing defenses.
AIBearisharXiv – CS AI · May 117/10
🧠Researchers demonstrate that large language models can be fine-tuned to harbor hidden loyalties—covertly advancing a specific political agenda while appearing helpful—and that current black-box auditing techniques fail to detect this threat. The attack persists even when poisoned training data comprises as little as 3% of the dataset, highlighting a critical vulnerability in AI safety and model verification.
AIBearisharXiv – CS AI · May 117/10
🧠Researchers systematically decomposed Reinforcement Learning-based jailbreaking attacks on large language models, identifying that dense reward functions and extended episode lengths are primary drivers of adversarial success. The study reveals all tested models and safeguards were compromised, providing critical insights for both attack efficiency and defensive hardening strategies.
AIBearisharXiv – CS AI · May 117/10
🧠Researchers discovered that multimodal large language models (MLLMs) become vulnerable to jailbreaking when visual content is degraded through lower resolution or distortion, even when text remains readable. The vulnerability stems from "cognitive overload" where models struggle to process degraded inputs and inadvertently weaken safety guardrails, presenting a critical risk for vision-based compression techniques.