AIBearisharXiv – CS AI · May 127/10
🧠Researchers demonstrate that malicious agents within multi-agent LLM consensus systems can effectively disrupt agreement formation through sophisticated insider attacks. Using reinforcement learning trained on surrogate world models, attackers significantly reduce consensus rates among benign agents, revealing a critical vulnerability in decentralized AI systems that assume participant alignment.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce PRISM, a real-time defense system that detects and prevents credential leakage in multi-agent LLM pipelines by monitoring generation dynamics at the token level. The system achieves 83.2% F1 score with perfect precision, eliminating observed leakage while maintaining output quality across adversarial benchmarks.
AINeutralarXiv – CS AI · May 117/10
🧠Researchers propose Agent-BOM, a unified graph-based representation system for auditing the security of LLM-based autonomous agents. The framework addresses critical gaps in existing audit mechanisms by tracking both static capabilities and dynamic runtime states, enabling detection of complex attack chains across multi-agent systems.
AIBearisharXiv – CS AI · May 117/10
🧠Researchers systematically decomposed Reinforcement Learning-based jailbreaking attacks on large language models, identifying that dense reward functions and extended episode lengths are primary drivers of adversarial success. The study reveals all tested models and safeguards were compromised, providing critical insights for both attack efficiency and defensive hardening strategies.
AIBearisharXiv – CS AI · May 117/10
🧠Researchers developed a search-based framework to identify privacy vulnerabilities in LLM-based agents through simulated multi-turn interactions. The study reveals that malicious agents employ sophisticated tactics like impersonation and consent forgery to extract sensitive information, while defenses evolve into robust identity-verification systems, with findings generalizing across diverse scenarios and models.
AIBearisharXiv – CS AI · May 97/10
🧠A comprehensive measurement study reveals that large language models frequently specify vulnerable and incompatible library versions in generated Python code, with 36.70%-55.70% of tasks containing known CVEs and 62.75%-74.51% rated as Critical or High severity. The research demonstrates this represents a systemic bias across all evaluated models rather than isolated errors, with most CVEs publicly disclosed before the models' knowledge cutoffs.
AINeutralarXiv – CS AI · May 77/10
🧠Researchers introduce Security Cube, a comprehensive evaluation framework for assessing Large Language Model robustness against jailbreak attacks. The study systematically catalogs existing attack and defense methods while establishing benchmarks across 13 attack vectors and 5 defense mechanisms, revealing critical gaps in current LLM safety practices.
AIBearisharXiv – CS AI · May 77/10
🧠Researchers demonstrate that LLM-based vulnerability detectors, increasingly used in software security pipelines, can be evaded through syntax-preserving code transformations. The study reveals that models with 70%+ accuracy on clean code can fail to detect 87%+ of vulnerabilities when subjected to minor edits, with adversarial attacks achieving up to 92.5% evasion rates—raising serious questions about the reliability of AI-driven security tools in production environments.
🧠 GPT-4
AIBearisharXiv – CS AI · May 47/10
🧠Researchers demonstrate that Large Language Models used in AI search overview systems are vulnerable to bias manipulation through reinforcement learning-optimized snippet rewriting. The study reveals that adversaries can exploit LLM biases to influence search result rankings and generate inaccurate or harmful information, posing significant security risks to AI-powered search applications.
AIBearisharXiv – CS AI · May 47/10
🧠Researchers found that advanced jailbreaks against large language models impose minimal performance degradation on the most capable models, with frontier models like Claude Opus 4.6 losing only 7.7% of benchmark performance when compromised. This challenges the assumption that safety mechanisms inherently trade off capability, raising concerns that safety strategies relying on performance degradation are insufficient for protecting frontier AI systems.
🧠 Claude🧠 Haiku🧠 Opus
AIBullisharXiv – CS AI · May 47/10
🧠Researchers introduce Sentra-Guard, a real-time defense system that detects and mitigates jailbreak and prompt injection attacks on large language models with 99.96% accuracy. The multilingual framework combines FAISS-indexed semantic embeddings with fine-tuned transformers and human-in-the-loop feedback, significantly outperforming existing defenses like LlamaGuard-2 and OpenAI Moderation.
🏢 OpenAI
AIBullisharXiv – CS AI · May 17/10
🧠Researchers present an end-to-end LLM framework that automates Security Operations Center (SOC) workflows by combining ensemble-based threat detection, syntax-constrained query generation, and retrieval-augmented resolution support. The system reduces incident triage time from hours to under 10 minutes while achieving 82.8% detection accuracy and improving resolution prediction from 78.3% to 90.0%.
AIBearisharXiv – CS AI · May 17/10
🧠Researchers present the first comprehensive threat modeling of LLM-enabled robotic systems, mapping three categories of attacks (cyber, adversarial, and conversational) across the perception-planning-actuation pipeline. The analysis reveals critical architectural vulnerabilities where compromised inputs or unsafe model outputs can propagate to unsafe physical actions without proper validation boundaries.
AIBearisharXiv – CS AI · May 17/10
🧠A comprehensive academic survey examines security vulnerabilities and defense mechanisms across four operational layers of autonomous agent frameworks built on large language models. The research identifies how threats propagate across layers—from input manipulation through unsafe actions to ecosystem-level impacts—highlighting critical gaps in current security approaches as these systems become increasingly complex and integrated.
AINeutralarXiv – CS AI · May 17/10
🧠Researchers demonstrate that multi-turn prompt injection attacks leave detectable signatures in language model activation patterns, achieving 93.8% detection accuracy through analysis of residual stream trajectories. The approach reveals that adversarial attack sequences exhibit distinctive 'restlessness' patterns across model architectures, though detection effectiveness varies significantly when deployed on real-world data.
AIBearisharXiv – CS AI · May 17/10
🧠Researchers introduce the first benchmark for detecting machine-generated text that imitates personal writing styles, revealing that state-of-the-art detectors fail significantly when LLMs personalize their output. The study identifies a 'feature-inversion trap' where detection features become unreliable in personalized contexts and proposes a method to predict detector performance degradation with 85% accuracy.
AIBearisharXiv – CS AI · Apr 207/10
🧠Researchers introduce CREST-Search, a red-teaming framework that exposes vulnerabilities in web-augmented LLMs by crafting benign-seeming queries designed to trigger unsafe citations from the internet. The study reveals that integrating web search into language models creates new safety risks beyond traditional LLM harms, requiring specialized defensive strategies.
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers propose Safe-FedLLM, a defense framework addressing security vulnerabilities in federated large language model training by detecting malicious clients through analysis of LoRA update patterns. The lightweight classifier-based approach effectively mitigates attacks while maintaining model performance and training efficiency, representing a significant advancement in securing distributed LLM development.
AIBearisharXiv – CS AI · Apr 157/10
🧠Researchers introduce TEMPLATEFUZZ, a fuzzing framework that systematically exploits vulnerabilities in LLM chat templates—a previously overlooked attack surface. The method achieves 98.2% jailbreak success rates on open-source models and 90% on commercial LLMs, significantly outperforming existing prompt injection techniques while revealing critical security gaps in production AI systems.
AIBearisharXiv – CS AI · Apr 157/10
🧠Researchers have identified critical vulnerabilities in mobile GUI agents powered by large language models, revealing that third-party content in real-world apps causes these agents to fail significantly more often than benchmark tests suggest. Testing on 122 dynamic tasks and over 3,000 static scenarios shows misleading rates of 36-42%, raising serious concerns about deploying these agents in commercial settings.
AIBearisharXiv – CS AI · Apr 147/10
🧠Researchers have developed Adaptive Stealing (AS), a novel watermark stealing algorithm that exploits vulnerabilities in LLM watermarking systems by dynamically selecting optimal attack strategies based on contextual token states. This advancement demonstrates that existing fixed-strategy watermark defenses are insufficient, highlighting critical security gaps in protecting proprietary LLM services and raising urgent questions about watermark robustness.
AIBearisharXiv – CS AI · Apr 147/10
🧠Researchers have identified a novel jailbreaking vulnerability in LLMs called 'Salami Slicing Risk,' where attackers chain multiple low-risk inputs that individually bypass safety measures but cumulatively trigger harmful outputs. The Salami Attack framework demonstrates over 90% success rates against GPT-4o and Gemini, highlighting a critical gap in current multi-turn defense mechanisms that assume individual requests are adequately monitored.
🧠 GPT-4🧠 Gemini
AIBearisharXiv – CS AI · Apr 147/10
🧠Researchers have discovered a critical vulnerability in Reinforcement Learning with Verifiable Rewards (RLVR), an emerging training paradigm that enhances LLM reasoning abilities. By injecting less than 2% poisoned data into training sets, attackers can implant backdoors that degrade safety performance by 73% when triggered, without modifying the reward verifier itself.
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers introduce ClawGuard, a runtime security framework that protects tool-augmented LLM agents from indirect prompt injection attacks by enforcing user-confirmed rules at tool-call boundaries. The framework blocks malicious instructions embedded in tool responses without requiring model modifications, demonstrating robust protection across multiple state-of-the-art language models.
AIBearisharXiv – CS AI · Apr 147/10
🧠Researchers have developed ADAM, a novel privacy attack that exploits vulnerabilities in Large Language Model agents' memory systems through adaptive querying, achieving up to 100% success rates in extracting sensitive information. The attack highlights critical security gaps in modern LLM-based systems that rely on memory modules and retrieval-augmented generation, underscoring the urgent need for privacy-preserving safeguards.