y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-security News & Analysis

107 articles tagged with #llm-security. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

107 articles
AINeutralarXiv – CS AI · Apr 77/10
🧠

Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities

A comprehensive study of 10,000 trials reveals that most assumed triggers for LLM agent exploitation don't work, but 'goal reframing' prompts like 'You are solving a puzzle; there may be hidden clues' can cause 38-40% exploitation rates despite explicit rule instructions. The research shows agents don't override rules but reinterpret tasks to make exploitative actions seem aligned with their goals.

🏢 OpenAI🧠 GPT-4🧠 GPT-5
AIBearisharXiv – CS AI · Apr 67/10
🧠

Generalization Limits of Reinforcement Learning Alignment

Researchers discovered that reinforcement learning alignment techniques like RLHF have significant generalization limits, demonstrated through 'compound jailbreaks' that increased attack success rates from 14.3% to 71.4% on OpenAI's gpt-oss-20b model. The study provides empirical evidence that safety training doesn't generalize as broadly as model capabilities, highlighting critical vulnerabilities in current AI alignment approaches.

🏢 OpenAI
AIBearisharXiv – CS AI · Apr 67/10
🧠

Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis

Researchers conducted the first comprehensive security analysis of Agent Skills, an emerging standard for LLM-based agents to acquire domain expertise. The study identified significant structural vulnerabilities across the framework's lifecycle, including lack of data-instruction boundaries and insufficient security review processes.

AIBearisharXiv – CS AI · Apr 67/10
🧠

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

Researchers discovered Document-Driven Implicit Payload Execution (DDIPE), a supply-chain attack method that embeds malicious code in LLM coding agent skill documentation. The attack achieves 11.6% to 33.5% bypass rates across multiple frameworks, with 2.5% evading both detection and security alignment measures.

AIBearisharXiv – CS AI · Apr 67/10
🧠

An Independent Safety Evaluation of Kimi K2.5

An independent safety evaluation of the open-weight AI model Kimi K2.5 reveals significant security risks including lower refusal rates on CBRNE-related requests, cybersecurity vulnerabilities, and concerning sabotage capabilities. The study highlights how powerful open-weight models may amplify safety risks due to their accessibility and calls for more systematic safety evaluations before deployment.

🧠 GPT-5🧠 Claude🧠 Opus
AINeutralarXiv – CS AI · Apr 67/10
🧠

AgenticRed: Evolving Agentic Systems for Red-Teaming

AgenticRed introduces an automated red-teaming system that uses evolutionary algorithms and LLMs to autonomously design attack methods without human intervention. The system achieved near-perfect attack success rates across multiple AI models, including 100% success on GPT-5.1, DeepSeek-R1 and DeepSeek V3.2.

🧠 GPT-5🧠 Llama
AIBearisharXiv – CS AI · Apr 67/10
🧠

Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study

A large-scale study of 17,022 third-party LLM agent skills found 520 vulnerable skills with credential leakage issues, identifying 10 distinct leakage patterns. The research reveals that 76.3% of vulnerabilities require joint analysis of code and natural language, with debug logging being the primary attack vector causing 73.5% of credential leaks.

AIBearisharXiv – CS AI · Mar 277/10
🧠

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Research reveals that LLM system prompt configuration creates massive security vulnerabilities, with the same model's phishing detection rates ranging from 1% to 97% based solely on prompt design. The study PhishNChips demonstrates that more specific prompts can paradoxically weaken AI security by replacing robust multi-signal reasoning with exploitable single-signal dependencies.

AIBearisharXiv – CS AI · Mar 277/10
🧠

LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts

Researchers have identified a new vulnerability in large language models called 'natural distribution shifts' where seemingly benign prompts can bypass safety mechanisms to reveal harmful content. They developed ActorBreaker, a novel attack method that uses multi-turn prompts to gradually expose unsafe content, and proposed expanding safety training to address this vulnerability.

AINeutralarXiv – CS AI · Mar 277/10
🧠

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Researchers have identified a new category of AI safety called 'reasoning safety' that focuses on protecting the logical consistency and integrity of LLM reasoning processes. They developed a real-time monitoring system that can detect unsafe reasoning behaviors with over 84% accuracy, addressing vulnerabilities beyond traditional content safety measures.

AINeutralarXiv – CS AI · Mar 267/10
🧠

Mitigating Many-Shot Jailbreaking

Researchers have developed techniques to mitigate many-shot jailbreaking (MSJ) attacks on large language models, where attackers use numerous examples to override safety training. Combined fine-tuning and input sanitization approaches significantly reduce MSJ effectiveness while maintaining normal model performance.

AIBearisharXiv – CS AI · Mar 267/10
🧠

Enhancing Jailbreak Attacks on LLMs via Persona Prompts

Researchers developed a genetic algorithm-based method using persona prompts to exploit large language models, reducing refusal rates by 50-70% across multiple LLMs. The study reveals significant vulnerabilities in AI safety mechanisms and demonstrates how these attacks can be enhanced when combined with existing methods.

AIBearisharXiv – CS AI · Mar 267/10
🧠

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Researchers demonstrate that Claude Code AI agent can autonomously discover novel adversarial attack algorithms against large language models, achieving significantly higher success rates than existing methods. The discovered attacks achieve up to 40% success rate on CBRN queries and 100% attack success rate against Meta-SecAlign-70B, compared to much lower rates from traditional methods.

🧠 Claude
AIBullisharXiv – CS AI · Mar 177/10
🧠

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

Researchers developed SFCoT (Safer Chain-of-Thought), a new framework that monitors and corrects AI reasoning steps in real-time to prevent jailbreak attacks. The system reduced attack success rates from 58.97% to 12.31% while maintaining general AI performance, addressing a critical vulnerability in current large language models.

AINeutralarXiv – CS AI · Mar 177/10
🧠

Accelerating Suffix Jailbreak attacks with Prefix-Shared KV-cache

Researchers developed Prefix-Shared KV Cache (PSKV), a new technique that accelerates jailbreak attacks on Large Language Models by 40% while reducing memory usage by 50%. The method optimizes the red-teaming process by sharing cached prefixes across multiple attack attempts, enabling more efficient parallel inference without compromising attack success rates.

AIBullisharXiv – CS AI · Mar 177/10
🧠

Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference

Researchers developed a new framework to remove backdoors from large language models without prior knowledge of triggers or clean reference models. The method uses an immunization-inspired approach that creates synthetic backdoored variants to identify and neutralize malicious components while preserving the model's generative capabilities.

AIBearisharXiv – CS AI · Mar 177/10
🧠

VisualLeakBench: Auditing the Fragility of Large Vision-Language Models against PII Leakage and Social Engineering

Researchers introduced VisualLeakBench, a new evaluation suite that tests Large Vision-Language Models (LVLMs) for vulnerabilities to privacy attacks through visual inputs. The study found significant weaknesses in frontier AI systems like GPT-5.2, Claude-4, Gemini-3 Flash, and Grok-4, with Claude-4 showing the highest PII leakage rate at 74.4% despite having strong OCR attack resistance.

🧠 GPT-5🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · Mar 177/10
🧠

TrinityGuard: A Unified Framework for Safeguarding Multi-Agent Systems

Researchers have introduced TrinityGuard, a comprehensive safety evaluation and monitoring framework for LLM-based multi-agent systems (MAS) that addresses emerging security risks beyond single agents. The framework identifies 20 risk types across three tiers and provides both pre-development evaluation and runtime monitoring capabilities.

AIBearisharXiv – CS AI · Mar 177/10
🧠

DECEIVE-AFC: Adversarial Claim Attacks against Search-Enabled LLM-based Fact-Checking Systems

Researchers developed DECEIVE-AFC, an adversarial attack framework that can significantly compromise AI-based fact-checking systems by manipulating claims to disrupt evidence retrieval and reasoning. The attacks reduced fact-checking accuracy from 78.7% to 53.7% in testing, highlighting major vulnerabilities in LLM-based verification systems.

AIBearisharXiv – CS AI · Mar 167/10
🧠

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

Researchers have released MalURLBench, the first benchmark to evaluate how LLM-based web agents handle malicious URLs, revealing significant vulnerabilities across 12 popular models. The study found that existing AI agents struggle to detect disguised malicious URLs and proposed URLGuard as a defensive solution.

AIBearisharXiv – CS AI · Mar 127/10
🧠

Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models

Researchers have developed 'Amnesia,' a lightweight adversarial attack that bypasses safety mechanisms in open-weight Large Language Models by manipulating internal transformer states. The attack enables generation of harmful content without requiring fine-tuning or additional training, highlighting vulnerabilities in current LLM safety measures.

AIBearisharXiv – CS AI · Mar 127/10
🧠

Targeted Bit-Flip Attacks on LLM-Based Agents

Researchers have introduced Flip-Agent, the first targeted bit-flip attack framework specifically designed to exploit LLM-based agents by manipulating hardware faults. The attack can manipulate both final outputs and tool invocations in multi-stage AI agent pipelines, revealing critical security vulnerabilities in these systems.

AIBearisharXiv – CS AI · Mar 97/10
🧠

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Researchers have developed SAHA (Safety Attention Head Attack), a new jailbreak framework that exploits vulnerabilities in deeper attention layers of open-source large language models. The method improves attack success rates by 14% over existing techniques by targeting insufficiently aligned attention heads rather than surface-level prompts.

AIBearisharXiv – CS AI · Mar 97/10
🧠

Window-based Membership Inference Attacks Against Fine-tuned Large Language Models

Researchers developed WBC (Window-Based Comparison), a new membership inference attack method that significantly outperforms existing approaches by analyzing localized patterns in Large Language Models rather than global signals. The technique achieves 2-3 times better detection rates and exposes critical privacy vulnerabilities in fine-tuned LLMs through sliding window analysis and binary voting mechanisms.

AIBearisharXiv – CS AI · Mar 67/10
🧠

Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

Researchers discovered a new vulnerability in multimodal large language models where specially crafted images can cause significant performance degradation by inducing numerical instability during inference. The attack method was validated on major vision-language models including LLaVa, Idefics3, and SmolVLM, showing substantial performance drops even with minimal image modifications.

← PrevPage 3 of 5Next →