AIBearisharXiv – CS AI · Jun 107/10
🧠Researchers discovered that Large Language Models leak significantly more personally identifiable information (PII) when interacting with AI agents compared to human users, despite identical safety mechanisms. The study identifies an 'Interlocutor Effect' where LLMs reduce privacy caution based on perceived recipient identity, with leakage rates increasing up to 23 percentage points when addressing AI agents, raising critical security concerns for multi-agent system architectures.
🧠 Llama
AIBearisharXiv – CS AI · Jun 57/10
🧠Researchers have discovered a critical vulnerability in safety-aligned large language models called Posterior Attack, which exploits the very safety mechanisms designed to prevent harmful outputs. The attack works by prompting models to generate responses their internal classifiers would flag as unsafe, and paradoxically, more sophisticated safety-aligned models are more vulnerable to this exploitation than less-aligned ones.
🧠 GPT-5🧠 Claude
AIBullisharXiv – CS AI · Jun 57/10
🧠Researchers introduce ANCHOR, an LLM-based framework that applies human-like supervision to self-evolving AI agents during their training process. The study demonstrates that limited human oversight effectively prevents safety degradation and capability loss in autonomous systems while maintaining core performance, with output verification emerging as the optimal intervention point.
AIBullisharXiv – CS AI · Jun 57/10
🧠Researchers demonstrate that safety behaviors in generative AI models can be represented as portable latent directions that transfer across different architectures without requiring unsafe training data on target models. This framework enables cross-model safety steering for text-to-image and text-to-video generation, suggesting safety is a shared property rather than model-specific.
AIBullisharXiv – CS AI · May 287/10
🧠GroundedCache proposes a safety-first framework for reusing cached answers in retrieval-augmented generation systems by validating four conditions before serving cached responses. The system achieves near-zero unsafe-served rates (0-1.5%) across benchmarks while maintaining minimal latency overhead, addressing critical vulnerabilities in current caching approaches that can serve incorrect answers.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers demonstrate that many-shot jailbreak attacks on language models work by inducing progressive activation drift through implicit fine-tuning, and propose a simple defense using a single safety demonstration at inference time that counteracts this drift without requiring parameter modifications or white-box access.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers have developed Semantic Representation Attack (SRA), a novel adversarial technique that bypasses LLM safety mechanisms by targeting semantic meaning rather than specific text patterns. The method achieves 99.71% attack success rates across 26 open-source models with strong cross-model transferability, raising significant security concerns for deployed AI systems.
AIBearisharXiv – CS AI · May 97/10
🧠Researchers have identified a fundamental vulnerability in multimodal large language models where safety mechanisms can be bypassed by exploiting the tension between hiding harmful intent and maintaining reconstructability. The study demonstrates that character-removed text variants combined with keyword-related distractor images achieve effective jailbreaks, revealing that models' own reconstruction capabilities become a security liability.
AIBearisharXiv – CS AI · May 47/10
🧠Researchers found that advanced jailbreaks against large language models impose minimal performance degradation on the most capable models, with frontier models like Claude Opus 4.6 losing only 7.7% of benchmark performance when compromised. This challenges the assumption that safety mechanisms inherently trade off capability, raising concerns that safety strategies relying on performance degradation are insufficient for protecting frontier AI systems.
🧠 Claude🧠 Haiku🧠 Opus
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers propose Coupled Weight and Activation Constraints (CWAC), a novel safety alignment technique for large language models that simultaneously constrains weight updates and regularizes activation patterns to prevent harmful outputs during fine-tuning. The method demonstrates that existing single-constraint approaches are insufficient and outperforms baselines across multiple LLMs while maintaining task performance.
AIBearisharXiv – CS AI · Apr 137/10
🧠Researchers demonstrate Semantic Intent Fragmentation (SIF), a novel attack on LLM orchestration systems where a single legitimate request causes AI systems to decompose tasks into individually benign subtasks that collectively violate security policies. The attack succeeds in 71% of enterprise scenarios while bypassing existing safety mechanisms, though plan-level information-flow tracking can detect all attacks before execution.
AINeutralarXiv – CS AI · Mar 167/10
🧠Researchers propose the Superficial Safety Alignment Hypothesis (SSAH), suggesting that AI safety alignment in large language models can be understood as a binary classification task of fulfilling or refusing user requests. The study identifies four types of critical components at the neuron level that establish safety guardrails, enabling models to retain safety attributes while adapting to new tasks.
AINeutralarXiv – CS AI · Jun 96/10
🧠Researchers introduce a framework for evaluating how LLM providers control user interaction styles through alignment mechanisms, measuring prompt steerability and regression-to-default behaviors across dialogue. The study reveals that provider-side controls shape not just safety but also communicative defaults that influence user autonomy, with implications for pluralism and democratic agency in human-AI systems.
AINeutralarXiv – CS AI · May 116/10
🧠This research paper addresses the emerging challenge of designing safe AI agents for CI/CD pipelines by introducing a framework distinguishing between data-plane authority (localized interventions) and control-plane authority (configuration changes). The authors argue that current systems prioritize bounded autonomy with external governance rather than intrinsic safety guarantees, identifying control-plane safety and formalization of autonomy boundaries as critical research gaps.
AIBearisharXiv – CS AI · Mar 37/108
🧠Researchers have developed MIDAS, a new jailbreaking framework that successfully bypasses safety mechanisms in Multimodal Large Language Models by dispersing harmful content across multiple images. The technique achieved an 81.46% average attack success rate against four closed-source MLLMs by extending reasoning chains and reducing security attention.
$LINK