AIBullisharXiv – CS AI · Apr 76/10
🧠Researchers developed a secure-by-design AI framework combining PromptShield and CIAF to automate cloud security and forensic investigations while protecting against prompt injection attacks. The system achieved over 93% accuracy in classification tasks and enhanced ransomware detection in AWS and Azure environments.
AIBearisharXiv – CS AI · Mar 166/10
🧠Researchers have identified 'role confusion' as the fundamental mechanism behind prompt injection attacks on language models, where models assign authority based on how text is written rather than its source. The study achieved 60-61% attack success rates across multiple models and found that internal role confusion strongly predicts attack success before generation begins.
AINeutralOpenAI News · Mar 116/10
🧠The article discusses ChatGPT's defensive mechanisms against prompt injection attacks and social engineering attempts. It focuses on how the AI system constrains risky actions and protects sensitive data within agent workflows to maintain security and reliability.
🧠 ChatGPT
AIBullisharXiv – CS AI · Mar 116/10
🧠Researchers propose a four-layer Layered Governance Architecture (LGA) framework to address security vulnerabilities in autonomous AI agents powered by large language models. The system achieves 96% interception rate of malicious activities including prompt injection and tool misuse with only 980ms latency.
🧠 GPT-4🧠 Llama
AIBullisharXiv – CS AI · Mar 36/108
🧠Researchers propose a new safety framework for AI agents using Scala 3 with capture checking to prevent information leakage and malicious behaviors. The system creates a 'safety harness' that tracks capabilities through static type checking, allowing fine-grained control over agent actions while maintaining task performance.
AIBearisharXiv – CS AI · Mar 37/107
🧠Researchers developed 'Reverse CAPTCHA,' a framework that tests how large language models respond to invisible Unicode-encoded instructions embedded in normal text. The study found that AI models can follow hidden instructions that humans cannot see, with tool use dramatically increasing compliance rates and different AI providers showing distinct preferences for encoding schemes.
AIBullisharXiv – CS AI · Mar 37/108
🧠Researchers introduce DualSentinel, a lightweight framework for detecting targeted attacks on Large Language Models by identifying 'Entropy Lull' patterns - periods of abnormally low token probability entropy that indicate when LLMs are being coercively controlled. The system uses dual-check verification to accurately detect backdoor and prompt injection attacks with near-zero false positives while maintaining minimal computational overhead.
$NEAR
AIBearisharXiv – CS AI · Feb 276/107
🧠Researchers evaluated prompt injection and jailbreak vulnerabilities across multiple open-source LLMs including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma. The study found significant behavioral variations across models and that lightweight defense mechanisms can be consistently bypassed by long, reasoning-heavy prompts.
AINeutralOpenAI News · Feb 136/103
🧠OpenAI introduces new security features for ChatGPT including Lockdown Mode and Elevated Risk labels to help organizations protect against prompt injection attacks and AI-driven data exfiltration. These enterprise-focused security enhancements aim to address growing concerns about AI systems being exploited for malicious data access.
AINeutralOpenAI News · Jan 286/105
🧠OpenAI has implemented safeguards to protect user data when AI agents interact with external links, addressing potential security vulnerabilities. The measures focus on preventing URL-based data exfiltration and prompt injection attacks that could compromise user information.
$LINK
AIBearishIEEE Spectrum – AI · Jan 216/105
🧠Large language models (LLMs) remain highly vulnerable to prompt injection attacks where specific phrasing can override safety guardrails, causing AI systems to perform forbidden actions or reveal sensitive information. Unlike humans who use contextual judgment and layered defenses, current LLMs lack the ability to assess situational appropriateness and cannot universally prevent such attacks.
AINeutralOpenAI News · Dec 226/105
🧠OpenAI is implementing automated red teaming with reinforcement learning to protect ChatGPT Atlas from prompt injection attacks. This proactive security approach aims to discover and patch vulnerabilities early as AI systems become more autonomous and agentic.
AINeutralOpenAI News · Dec 186/106
🧠OpenAI has released an addendum to their GPT-5.2 System Card specifically for GPT-5.2-Codex, detailing comprehensive safety measures for the code-generating AI model. The document outlines both model-level mitigations including specialized safety training and product-level protections like agent sandboxing and configurable network access.
AIBearishOpenAI News · Apr 196/105
🧠Large Language Models (LLMs) currently face significant security vulnerabilities from prompt injections and jailbreaks, where attackers can override the model's original instructions with malicious prompts. This highlights a critical weakness in current AI systems' ability to maintain instruction integrity and security.