#llm-security News & Analysis

177 articles tagged with #llm-security. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

177 articles

AIBearisharXiv – CS AI · May 127/10

🧠

LLM-Agnostic Semantic Representation Attack

Researchers have developed Semantic Representation Attack (SRA), a novel adversarial technique that bypasses LLM safety mechanisms by targeting semantic meaning rather than specific text patterns. The method achieves 99.71% attack success rates across 26 open-source models with strong cross-model transferability, raising significant security concerns for deployed AI systems.

AIBearisharXiv – CS AI · May 127/10

🧠

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Researchers present a comprehensive framework for systematically generating, categorizing, and evaluating jailbreak attacks against large language models, introducing a dataset of 114,000 adversarial prompts, automated generation methods, and a novel continuous evaluation metric (OPTIMUS) that surpasses binary success rate measurements.

🏢 Perplexity

AINeutralarXiv – CS AI · May 117/10

🧠

Towards Security-Auditable LLM Agents: A Unified Graph Representation

Researchers propose Agent-BOM, a unified graph-based representation system for auditing the security of LLM-based autonomous agents. The framework addresses critical gaps in existing audit mechanisms by tracking both static capabilities and dynamic runtime states, enabling detection of complex attack chains across multi-agent systems.

AIBearisharXiv – CS AI · May 117/10

🧠

A Systematic Investigation of The RL-Jailbreaker in LLMs

Researchers systematically decomposed Reinforcement Learning-based jailbreaking attacks on large language models, identifying that dense reward functions and extended episode lengths are primary drivers of adversarial success. The study reveals all tested models and safeguards were compromised, providing critical insights for both attack efficiency and defensive hardening strategies.

AIBearisharXiv – CS AI · May 117/10

🧠

Searching for Privacy Risks in LLM Agents via Simulation

Researchers developed a search-based framework to identify privacy vulnerabilities in LLM-based agents through simulated multi-turn interactions. The study reveals that malicious agents employ sophisticated tactics like impersonation and consent forgery to extract sensitive information, while defenses evolve into robust identity-verification systems, with findings generalizing across diverse scenarios and models.

AIBearisharXiv – CS AI · May 97/10

🧠

Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

A comprehensive measurement study reveals that large language models frequently specify vulnerable and incompatible library versions in generated Python code, with 36.70%-55.70% of tasks containing known CVEs and 62.75%-74.51% rated as Critical or High severity. The research demonstrates this represents a systemic bias across all evaluated models rather than isolated errors, with most CVEs publicly disclosed before the models' knowledge cutoffs.

AINeutralarXiv – CS AI · May 77/10

🧠

SoK: Robustness in Large Language Models against Jailbreak Attacks

Researchers introduce Security Cube, a comprehensive evaluation framework for assessing Large Language Model robustness against jailbreak attacks. The study systematically catalogs existing attack and defense methods while establishing benchmarks across 13 attack vectors and 5 defense mechanisms, revealing critical gaps in current LLM safety practices.

AIBearisharXiv – CS AI · May 77/10

🧠

Syntax- and Compilation-Preserving Evasion of LLM Vulnerability Detectors

Researchers demonstrate that LLM-based vulnerability detectors, increasingly used in software security pipelines, can be evaded through syntax-preserving code transformations. The study reveals that models with 70%+ accuracy on clean code can fail to detect 87%+ of vulnerabilities when subjected to minor edits, with adversarial attacks achieving up to 92.5% evasion rates—raising serious questions about the reliability of AI-driven security tools in production environments.

🧠 GPT-4

AIBearisharXiv – CS AI · May 47/10

🧠

Exploring LLM biases to manipulate AI search overview

Researchers demonstrate that Large Language Models used in AI search overview systems are vulnerable to bias manipulation through reinforcement learning-optimized snippet rewriting. The study reveals that adversaries can exploit LLM biases to influence search result rankings and generate inaccurate or harmful information, posing significant security risks to AI-powered search applications.

AIBearisharXiv – CS AI · May 47/10

🧠

Jailbroken Frontier Models Retain Their Capabilities

Researchers found that advanced jailbreaks against large language models impose minimal performance degradation on the most capable models, with frontier models like Claude Opus 4.6 losing only 7.7% of benchmark performance when compromised. This challenges the assumption that safety mechanisms inherently trade off capability, raising concerns that safety strategies relying on performance degradation are insufficient for protecting frontier AI systems.

🧠 Claude🧠 Haiku🧠 Opus

AIBullisharXiv – CS AI · May 47/10

🧠

Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts

Researchers introduce Sentra-Guard, a real-time defense system that detects and mitigates jailbreak and prompt injection attacks on large language models with 99.96% accuracy. The multilingual framework combines FAISS-indexed semantic embeddings with fine-tuned transformers and human-in-the-loop feedback, significantly outperforming existing defenses like LlamaGuard-2 and OpenAI Moderation.

🏢 OpenAI

AINeutralarXiv – CS AI · May 17/10

🧠

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Researchers demonstrate that multi-turn prompt injection attacks leave detectable signatures in language model activation patterns, achieving 93.8% detection accuracy through analysis of residual stream trajectories. The approach reveals that adversarial attack sequences exhibit distinctive 'restlessness' patterns across model architectures, though detection effectiveness varies significantly when deployed on real-world data.

AIBearisharXiv – CS AI · May 17/10

🧠

When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

Researchers introduce the first benchmark for detecting machine-generated text that imitates personal writing styles, revealing that state-of-the-art detectors fail significantly when LLMs personalize their output. The study identifies a 'feature-inversion trap' where detection features become unreliable in personalized contexts and proposes a method to predict detector performance degradation with 85% accuracy.

AIBearisharXiv – CS AI · May 17/10

🧠

From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems

Researchers present the first comprehensive threat modeling of LLM-enabled robotic systems, mapping three categories of attacks (cyber, adversarial, and conversational) across the perception-planning-actuation pipeline. The analysis reveals critical architectural vulnerabilities where compromised inputs or unsafe model outputs can propagate to unsafe physical actions without proper validation boundaries.

AIBearisharXiv – CS AI · May 17/10

🧠

Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study

A comprehensive academic survey examines security vulnerabilities and defense mechanisms across four operational layers of autonomous agent frameworks built on large language models. The research identifies how threats propagate across layers—from input manipulation through unsafe actions to ecosystem-level impacts—highlighting critical gaps in current security approaches as these systems become increasingly complex and integrated.

AIBullisharXiv – CS AI · May 17/10

🧠

Toward Autonomous SOC Operations: End-to-End LLM Framework for Threat Detection, Query Generation, and Resolution in Security Operations

Researchers present an end-to-end LLM framework that automates Security Operations Center (SOC) workflows by combining ensemble-based threat detection, syntax-constrained query generation, and retrieval-augmented resolution support. The system reduces incident triage time from hours to under 10 minutes while achieving 82.8% detection accuracy and improving resolution prediction from 78.3% to 90.0%.

AIBearisharXiv – CS AI · Apr 207/10

🧠

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

Researchers introduce CREST-Search, a red-teaming framework that exposes vulnerabilities in web-augmented LLMs by crafting benign-seeming queries designed to trigger unsafe citations from the internet. The study reveals that integrating web search into language models creates new safety risks beyond traditional LLM harms, requiring specialized defensive strategies.

AIBearisharXiv – CS AI · Apr 157/10

🧠

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

Researchers introduce TEMPLATEFUZZ, a fuzzing framework that systematically exploits vulnerabilities in LLM chat templates—a previously overlooked attack surface. The method achieves 98.2% jailbreak success rates on open-source models and 90% on commercial LLMs, significantly outperforming existing prompt injection techniques while revealing critical security gaps in production AI systems.

AIBearisharXiv – CS AI · Apr 157/10

🧠

Mobile GUI Agents under Real-world Threats: Are We There Yet?

Researchers have identified critical vulnerabilities in mobile GUI agents powered by large language models, revealing that third-party content in real-world apps causes these agents to fail significantly more often than benchmark tests suggest. Testing on 122 dynamic tasks and over 3,000 static scenarios shows misleading rates of 36-42%, raising serious concerns about deploying these agents in commercial settings.

AIBullisharXiv – CS AI · Apr 157/10

🧠

Safe-FedLLM: Delving into the Safety of Federated Large Language Models

Researchers propose Safe-FedLLM, a defense framework addressing security vulnerabilities in federated large language model training by detecting malicious clients through analysis of LoRA update patterns. The lightweight classifier-based approach effectively mitigates attacks while maintaining model performance and training efficiency, representing a significant advancement in securing distributed LLM development.

AIBearisharXiv – CS AI · Apr 147/10

🧠

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Researchers have identified a novel jailbreaking vulnerability in LLMs called 'Salami Slicing Risk,' where attackers chain multiple low-risk inputs that individually bypass safety measures but cumulatively trigger harmful outputs. The Salami Attack framework demonstrates over 90% success rates against GPT-4o and Gemini, highlighting a critical gap in current multi-turn defense mechanisms that assume individual requests are adequately monitored.

🧠 GPT-4🧠 Gemini

AIBearisharXiv – CS AI · Apr 147/10

🧠

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

Researchers have discovered a critical vulnerability in Reinforcement Learning with Verifiable Rewards (RLVR), an emerging training paradigm that enhances LLM reasoning abilities. By injecting less than 2% poisoned data into training sets, attackers can implant backdoors that degrade safety performance by 73% when triggered, without modifying the reward verifier itself.

AINeutralarXiv – CS AI · Apr 147/10

🧠

ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

Researchers introduce ClawGuard, a runtime security framework that protects tool-augmented LLM agents from indirect prompt injection attacks by enforcing user-confirmed rules at tool-call boundaries. The framework blocks malicious instructions embedded in tool responses without requiring model modifications, demonstrating robust protection across multiple state-of-the-art language models.

AIBearisharXiv – CS AI · Apr 147/10

🧠

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Researchers have developed ADAM, a novel privacy attack that exploits vulnerabilities in Large Language Model agents' memory systems through adaptive querying, achieving up to 100% success rates in extracting sensitive information. The attack highlights critical security gaps in modern LLM-based systems that rely on memory modules and retrieval-augmented generation, underscoring the urgent need for privacy-preserving safeguards.

AIBearisharXiv – CS AI · Apr 147/10

🧠

Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models

Researchers have developed Adaptive Stealing (AS), a novel watermark stealing algorithm that exploits vulnerabilities in LLM watermarking systems by dynamically selecting optimal attack strategies based on contextual token states. This advancement demonstrates that existing fixed-strategy watermark defenses are insufficient, highlighting critical security gaps in protecting proprietary LLM services and raising urgent questions about watermark robustness.

← PrevPage 4 of 8Next →