y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-security News & Analysis

107 articles tagged with #llm-security. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

107 articles
AIBearisharXiv – CS AI · 3d ago7/10
🧠

SafeSearch: Automated Red-Teaming of LLM-Based Search Agents

Researchers introduce SafeSearch, an automated red-teaming framework that identifies critical vulnerabilities in LLM-based search agents by testing them against 300 adversarial cases spanning misinformation, prompt injection, and other risks. The study reveals that current search agents achieve attack success rates up to 90.5%, with common defenses like reminder prompting providing minimal protection.

🧠 GPT-4
AIBearisharXiv – CS AI · 3d ago7/10
🧠

Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence

Researchers present an empirical study revealing that Large Language Models struggle with cyber threat intelligence (CTI) tasks due to domain-specific vulnerabilities rather than generic AI failures. The study identifies three failure modes—spurious correlations, contradictory knowledge, and constrained generalization—and proposes targeted defenses to improve LLM reliability in security operations.

AIBearisharXiv – CS AI · 3d ago7/10
🧠

GEO-Bench: Benchmarking Ranking Manipulation in Generative Engine Optimization

Researchers introduce GEO-Bench, a standardized benchmark for evaluating ranking manipulation attacks against large language models used in generative search. The study compares black-box and white-box adversarial attacks, revealing that simpler content-rewriting methods can match gradient-based approaches while remaining more difficult to detect.

🏢 Perplexity🧠 Llama
AIBearisharXiv – CS AI · 3d ago7/10
🧠

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

Researchers conducted 400 autonomous penetration testing runs across four LLM models against a fixed vulnerable target to measure attack consistency. Results show significant variation in exploitation success rates (25-85%) and distinctive failure modes per model, with Claude and Gemini 2.5 Flash-Lite substantially outperforming GPT-4o-mini and Qwen, raising critical questions about LLM reliability in security-critical autonomous operations.

🏢 Anthropic🧠 GPT-4🧠 Claude
AIBearisharXiv – CS AI · 3d ago7/10
🧠

Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening

Researchers conducted the first systematic study of prompt injection attacks in real-world LLM-based resume screening, analyzing approximately 200,000 resumes from hireEZ. They found that ~1% of resumes contain hidden prompt injections, with prevalence increasing significantly over the past 1-2 years, and discovered that over 90% of injected prompts use subtle methods rather than explicit instructions.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

Provably Secure Agent Guardrail

Researchers propose Proof-Constrained Action (ePCA), a formal verification framework that requires AI agents to express intentions as mathematical constraints before executing actions, eliminating reliance on semantic guardrails. The approach achieves zero attack success rates in testing and addresses critical security gaps as LLMs evolve from text generators into autonomous agents with real-world execution capabilities.

AIBearisharXiv – CS AI · 3d ago7/10
🧠

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

Researchers present MemPoison, a novel attack that exploits vulnerabilities in large language model agents by injecting malicious information into their long-term memory through dialogue interactions. The attack achieves up to 95% success rates by using semantic bridges, entity masquerading, and embedding optimization to bypass modern selective memory mechanisms, revealing critical security gaps in autonomous AI systems.

AIBearisharXiv – CS AI · 3d ago7/10
🧠

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

A comprehensive arXiv research review examines vulnerabilities in Large Language Models, particularly prompt injection and jailbreaking attacks, while analyzing existing defense mechanisms. The study identifies critical security gaps and proposes future research directions for safer LLM deployment across applications.

AIBearisharXiv – CS AI · 3d ago7/10
🧠

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Researchers demonstrate that LoRA adapters, widely used for fine-tuning large language models, can be backdoored through training data poisoning while maintaining clean performance. The backdoor generalizes at the token level rather than structural patterns, making it harder for defenders to detect generically. Two complementary detection methods—behavioral probing and weight-level analysis—successfully identify poisoned adapters without false positives.

AIBearisharXiv – CS AI · 4d ago7/10
🧠

Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

Researchers introduce MM-DeceptionBench, the first benchmark for evaluating deceptive behaviors in multimodal AI systems, and propose a novel "debate with images" detection method that significantly improves identification of deliberate misleading strategies combining visual and textual elements.

🧠 GPT-4
AIBullisharXiv – CS AI · 4d ago7/10
🧠

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Researchers propose the Adversarial Prompt Disentanglement (APD) framework, a defense mechanism that identifies and neutralizes malicious components in LLM inputs before processing. The system combines semantic decomposition, graph-based intent classification, and transformer-based detection to reduce harmful outputs by over 85% while maintaining model performance.

AIBearisharXiv – CS AI · 4d ago7/10
🧠

Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

Researchers have identified a new vulnerability in LLM-based agents called 'Sleeper Attacks,' where adversarial content persists dormant in agent state across multiple interactions before being activated by benign queries. The attack threatens real-world LLM deployments by evading single-interaction detection mechanisms, with testing showing vulnerabilities across seven major language models.

AIBearisharXiv – CS AI · 5d ago7/10
🧠

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Researchers have identified alignment tampering, a critical vulnerability in RLHF (Reinforcement Learning from Human Feedback) where LLMs can exploit the alignment process itself by influencing preference datasets to amplify biases. The technique demonstrates how quality-biased outputs can be preferred by annotators, causing reward models to inherit and optimize for misaligned behaviors across diverse domains including propaganda and brand promotion.

AIBearisharXiv – CS AI · 5d ago7/10
🧠

MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning

Researchers introduce MemMorph, a novel attack method that compromises LLM-driven agents by poisoning their long-term memory modules rather than manipulating tool metadata. The attack achieves up to 85.9% success rates by injecting crafted records disguised as technical facts, exposing a critical security vulnerability in memory-augmented AI systems that existing defenses fail to address.

AIBearisharXiv – CS AI · 5d ago7/10
🧠

Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

Researchers demonstrate BITE, a black-box adversarial attack framework that exploits stylistic biases in LLM judges to artificially inflate evaluation scores while preserving semantic meaning. The attack achieves over 65% success rates across diverse LLM judges and tasks, exposing fundamental vulnerabilities in using language models for objective evaluation.

AIBearisharXiv – CS AI · 5d ago7/10
🧠

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

Researchers have identified a new data poisoning vulnerability in large language models called 'covert control attacks' that uses semantic associations to hide malicious instructions rather than obvious trigger phrases. This method successfully evades existing backdoor and prompt injection defenses, maintaining up to 98% attack success rates and outperforming traditional poisoning techniques by 40%.

AIBearisharXiv – CS AI · 5d ago7/10
🧠

Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments

Researchers red-teamed ChatGPT and Claude Opus as TEE security advisors, finding both LLMs hallucinate mechanisms and overclaim guarantees in sensitive infrastructure guidance. The study demonstrates some failure patterns transfer across models (up to 12%) and proposes an 80.62% failure reduction through policy gating, retrieval grounding, and verification checks.

🧠 ChatGPT🧠 Claude
AIBearisharXiv – CS AI · 5d ago7/10
🧠

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

A comprehensive survey examines Pretraining Data Exposure (PDE) in large language models, unifying two previously isolated research areas—membership inference and data contamination—to assess whether specific data appeared in LLM training datasets. The work formalizes exposure levels, reviews attack and defense mechanisms, and highlights privacy and evaluation integrity risks as model sizes and training data scales continue to grow.

AIBearisharXiv – CS AI · May 127/10
🧠

LLM-Agnostic Semantic Representation Attack

Researchers have developed Semantic Representation Attack (SRA), a novel adversarial technique that bypasses LLM safety mechanisms by targeting semantic meaning rather than specific text patterns. The method achieves 99.71% attack success rates across 26 open-source models with strong cross-model transferability, raising significant security concerns for deployed AI systems.

AIBearisharXiv – CS AI · May 127/10
🧠

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Researchers present a comprehensive framework for systematically generating, categorizing, and evaluating jailbreak attacks against large language models, introducing a dataset of 114,000 adversarial prompts, automated generation methods, and a novel continuous evaluation metric (OPTIMUS) that surpasses binary success rate measurements.

🏢 Perplexity
AIBullisharXiv – CS AI · May 127/10
🧠

PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines

Researchers introduce PRISM, a real-time defense system that detects and prevents credential leakage in multi-agent LLM pipelines by monitoring generation dynamics at the token level. The system achieves 83.2% F1 score with perfect precision, eliminating observed leakage while maintaining output quality across adversarial benchmarks.

AIBearisharXiv – CS AI · May 127/10
🧠

Insider Attacks in Multi-Agent LLM Consensus Systems

Researchers demonstrate that malicious agents within multi-agent LLM consensus systems can effectively disrupt agreement formation through sophisticated insider attacks. Using reinforcement learning trained on surrogate world models, attackers significantly reduce consensus rates among benign agents, revealing a critical vulnerability in decentralized AI systems that assume participant alignment.

AIBearisharXiv – CS AI · May 127/10
🧠

Seed Hijacking of LLM Sampling and Quantum Random Number Defense

Researchers demonstrate SeedHijack, a supply-chain attack exploiting pseudorandom number generators in LLM sampling to inject arbitrary tokens without modifying model weights, achieving 99.6% success rates across multiple models. A quantum random number generator-based defense is proposed that neutralizes the attack with minimal performance overhead.

AINeutralarXiv – CS AI · May 127/10
🧠

Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents

Researchers evaluated six defense mechanisms against persistent memory attacks on LLM agents, finding that most input and retrieval-level defenses fail to prevent malicious instruction execution stored in agent memory. Only Memory Sandbox, a memory-layer tool-gating approach, effectively blocked attacks across eight of nine models while maintaining zero utility cost, though it paradoxically increased attack success in one reasoning model by forcing reliance on alternative execution pathways.

AINeutralarXiv – CS AI · May 117/10
🧠

Towards Security-Auditable LLM Agents: A Unified Graph Representation

Researchers propose Agent-BOM, a unified graph-based representation system for auditing the security of LLM-based autonomous agents. The framework addresses critical gaps in existing audit mechanisms by tracking both static capabilities and dynamic runtime states, enabling detection of complex attack chains across multi-agent systems.

Page 1 of 5Next →