y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#prompt-injection News & Analysis

64 articles tagged with #prompt-injection. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

64 articles
AIBearisharXiv – CS AI · Mar 166/10
🧠

Prompt Injection as Role Confusion

Researchers have identified 'role confusion' as the fundamental mechanism behind prompt injection attacks on language models, where models assign authority based on how text is written rather than its source. The study achieved 60-61% attack success rates across multiple models and found that internal role confusion strongly predicts attack success before generation begins.

AINeutralOpenAI News · Mar 116/10
🧠

Designing AI agents to resist prompt injection

The article discusses ChatGPT's defensive mechanisms against prompt injection attacks and social engineering attempts. It focuses on how the AI system constrains risky actions and protects sensitive data within agent workflows to maintain security and reliability.

🧠 ChatGPT
AIBullisharXiv – CS AI · Mar 36/108
🧠

Tracking Capabilities for Safer Agents

Researchers propose a new safety framework for AI agents using Scala 3 with capture checking to prevent information leakage and malicious behaviors. The system creates a 'safety harness' that tracks capabilities through static type checking, allowing fine-grained control over agent actions while maintaining task performance.

AIBearisharXiv – CS AI · Mar 37/107
🧠

Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection

Researchers developed 'Reverse CAPTCHA,' a framework that tests how large language models respond to invisible Unicode-encoded instructions embedded in normal text. The study found that AI models can follow hidden instructions that humans cannot see, with tool use dramatically increasing compliance rates and different AI providers showing distinct preferences for encoding schemes.

AIBullisharXiv – CS AI · Mar 37/108
🧠

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Researchers introduce DualSentinel, a lightweight framework for detecting targeted attacks on Large Language Models by identifying 'Entropy Lull' patterns - periods of abnormally low token probability entropy that indicate when LLMs are being coercively controlled. The system uses dual-check verification to accurately detect backdoor and prompt injection attacks with near-zero false positives while maintaining minimal computational overhead.

$NEAR
AIBearisharXiv – CS AI · Feb 276/107
🧠

Analysis of LLMs Against Prompt Injection and Jailbreak Attacks

Researchers evaluated prompt injection and jailbreak vulnerabilities across multiple open-source LLMs including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma. The study found significant behavioral variations across models and that lightweight defense mechanisms can be consistently bypassed by long, reasoning-heavy prompts.

AINeutralOpenAI News · Feb 136/103
🧠

Introducing Lockdown Mode and Elevated Risk labels in ChatGPT

OpenAI introduces new security features for ChatGPT including Lockdown Mode and Elevated Risk labels to help organizations protect against prompt injection attacks and AI-driven data exfiltration. These enterprise-focused security enhancements aim to address growing concerns about AI systems being exploited for malicious data access.

AINeutralOpenAI News · Jan 286/105
🧠

Keeping your data safe when an AI agent clicks a link

OpenAI has implemented safeguards to protect user data when AI agents interact with external links, addressing potential security vulnerabilities. The measures focus on preventing URL-based data exfiltration and prompt injection attacks that could compromise user information.

$LINK
AIBearishIEEE Spectrum – AI · Jan 216/105
🧠

Why AI Keeps Falling for Prompt Injection Attacks

Large language models (LLMs) remain highly vulnerable to prompt injection attacks where specific phrasing can override safety guardrails, causing AI systems to perform forbidden actions or reveal sensitive information. Unlike humans who use contextual judgment and layered defenses, current LLMs lack the ability to assess situational appropriateness and cannot universally prevent such attacks.

AINeutralOpenAI News · Dec 226/105
🧠

Continuously hardening ChatGPT Atlas against prompt injection

OpenAI is implementing automated red teaming with reinforcement learning to protect ChatGPT Atlas from prompt injection attacks. This proactive security approach aims to discover and patch vulnerabilities early as AI systems become more autonomous and agentic.

AINeutralOpenAI News · Dec 186/106
🧠

Addendum to GPT-5.2 System Card: GPT-5.2-Codex

OpenAI has released an addendum to their GPT-5.2 System Card specifically for GPT-5.2-Codex, detailing comprehensive safety measures for the code-generating AI model. The document outlines both model-level mitigations including specialized safety training and product-level protections like agent sandboxing and configurable network access.

AIBearishOpenAI News · Apr 196/105
🧠

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Large Language Models (LLMs) currently face significant security vulnerabilities from prompt injections and jailbreaks, where attackers can override the model's original instructions with malicious prompts. This highlights a critical weakness in current AI systems' ability to maintain instruction integrity and security.

← PrevPage 3 of 3