#adversarial-prompts News & Analysis

5 articles tagged with #adversarial-prompts. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AIBullisharXiv – CS AI · May 287/10

🧠

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Researchers propose the Adversarial Prompt Disentanglement (APD) framework, a defense mechanism that identifies and neutralizes malicious components in LLM inputs before processing. The system combines semantic decomposition, graph-based intent classification, and transformer-based detection to reduce harmful outputs by over 85% while maintaining model performance.

AIBearisharXiv – CS AI · May 127/10

🧠

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Researchers present a comprehensive framework for systematically generating, categorizing, and evaluating jailbreak attacks against large language models, introducing a dataset of 114,000 adversarial prompts, automated generation methods, and a novel continuous evaluation metric (OPTIMUS) that surpasses binary success rate measurements.

🏢 Perplexity

AIBearishApple Machine Learning · Mar 37/105

🧠

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

Research demonstrates computational challenges in AI alignment, specifically showing that efficient filtering of adversarial prompts and unsafe outputs from large language models may be fundamentally impossible. The study reveals theoretical limitations in separating intelligence from judgment in AI systems, highlighting intractable problems in content filtering approaches.

AIBearisharXiv – CS AI · Feb 277/107

🧠

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Researchers developed CC-BOS, a framework that uses classical Chinese text to conduct more effective jailbreak attacks on Large Language Models. The method exploits the conciseness and obscurity of classical Chinese to bypass safety constraints, using bio-inspired optimization techniques to automatically generate adversarial prompts.

AIBearisharXiv – CS AI · Jun 16/10

🧠

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

Researchers demonstrate that toxic language in prompts significantly degrades the factual accuracy of large language models, even when semantic content remains identical. By analyzing internal model activations, they identify that toxicity amplifies perturbation-sensitive nodes while leaving core reasoning pathways relatively stable, revealing a critical vulnerability in LLM reliability.