#ai-safety News & Analysis

649 articles tagged with #ai-safety. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

649 articles

AINeutralarXiv – CS AI · Mar 37/106

🧠

MOSAIC: Unveiling the Moral, Social and Individual Dimensions of Large Language Models

Researchers introduce MOSAIC, the first comprehensive benchmark to evaluate moral, social, and individual characteristics of Large Language Models beyond traditional Moral Foundation Theory. The benchmark includes over 600 curated questions and scenarios from nine validated questionnaires and four platform-based games, providing empirical evidence that current evaluation methods are insufficient for assessing AI ethics comprehensively.

AIBearisharXiv – CS AI · Mar 37/106

🧠

Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems

Researchers discovered that subliminal prompting can create a 'thought virus' effect in multi-agent AI systems, where bias from one compromised agent spreads throughout the entire network. The study shows this attack vector can degrade truthfulness and create alignment risks across connected AI systems.

AIBullisharXiv – CS AI · Mar 37/107

🧠

You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models

Researchers introduce GUARD, a novel framework to prevent text-to-image AI models from memorizing and reproducing training data that could lead to privacy or copyright issues. The method uses attention attenuation to guide image generation away from original training data while maintaining prompt alignment and image quality.

$NEAR

AIBullisharXiv – CS AI · Mar 36/107

🧠

Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion

Researchers propose RADS (Reachability-Aware Diffusion Steering), a new framework that prevents AI text-to-image models from memorizing training data while maintaining image quality. The method uses reinforcement learning to steer diffusion models away from generating memorized content during inference, offering a plug-and-play solution that doesn't require modifying the underlying model.

AIBearisharXiv – CS AI · Mar 37/107

🧠

Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection

Researchers developed 'Reverse CAPTCHA,' a framework that tests how large language models respond to invisible Unicode-encoded instructions embedded in normal text. The study found that AI models can follow hidden instructions that humans cannot see, with tool use dramatically increasing compliance rates and different AI providers showing distinct preferences for encoding schemes.

AIBearisharXiv – CS AI · Mar 37/109

🧠

Hidden in the Metadata: Stealth Poisoning Attacks on Multimodal Retrieval-Augmented Generation

Researchers have discovered MM-MEPA, a new attack method that can poison multimodal AI systems by manipulating only metadata while leaving visual content unchanged. The attack achieves up to 91% success rate in disrupting AI retrieval systems and proves resistant to current defense strategies.

AINeutralarXiv – CS AI · Mar 36/107

🧠

When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation

Researchers fine-tuned the Llama 2 7B model using real patient-doctor interaction transcripts to improve medical query responses, but found significant discrepancies between automatic similarity metrics and GPT-4 evaluations. The study highlights the challenges in evaluating AI medical models and recommends human medical expert review for proper validation.

AIBearisharXiv – CS AI · Mar 36/108

🧠

Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents

Researchers identified widespread TOCTOU (time of check to time of use) vulnerabilities in browser-use agents, where web pages change between planning and execution phases, potentially causing unintended actions. A study of 10 popular open-source agents revealed these security flaws are common, prompting development of a lightweight mitigation strategy based on pre-execution validation.

AINeutralarXiv – CS AI · Mar 37/107

🧠

A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction

Researchers found that machine unlearning in large language models, which aims to remove specific training data influence, is less effective in interactive settings than previously thought. Knowledge that appears forgotten in static tests can often be recovered through multi-turn conversations and self-correction interactions.

AINeutralarXiv – CS AI · Mar 37/107

🧠

Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models

Researchers introduce SurgUn, a surgical unlearning method for text-to-image diffusion models that enables precise removal of specific visual concepts while preserving other capabilities. The approach addresses challenges in copyright compliance and content policy enforcement by applying targeted weight-space updates based on retroactive interference theory.

AIBearisharXiv – CS AI · Mar 36/107

🧠

Hide&Seek: Remove Image Watermarks with Negligible Cost via Pixel-wise Reconstruction

Researchers have developed HIDE&SEEK (HS), a new attack method that can effectively remove watermarks from machine-generated images while maintaining visual quality. This research exposes vulnerabilities in current state-of-the-art proactive image watermarking defenses, highlighting the ongoing arms race between watermarking protection and removal techniques.

AIBullisharXiv – CS AI · Mar 37/106

🧠

Token-level Data Selection for Safe LLM Fine-tuning

Researchers have developed TOSS, a new framework for safely fine-tuning large language models that operates at the token level rather than sample level. The method identifies and removes unsafe tokens while preserving task-specific information, demonstrating superior performance compared to existing sample-level defense methods in maintaining both safety and utility.

AIBearisharXiv – CS AI · Mar 37/109

🧠

Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders

A study reveals that safety-aligned large language models exhibit "Defensive Refusal Bias," refusing legitimate cybersecurity defense tasks 2.72x more often when they contain security-sensitive keywords. The research found particularly high refusal rates for critical defensive operations like system hardening (43.8%) and malware analysis (34.3%), suggesting current AI safety measures rely on semantic similarity rather than understanding intent.

AIBullisharXiv – CS AI · Mar 37/106

🧠

Attention Smoothing Is All You Need For Unlearning

Researchers propose Attention Smoothing Unlearning (ASU), a new framework that helps Large Language Models forget sensitive or copyrighted content without losing overall performance. The method uses self-distillation and attention smoothing to erase specific knowledge while maintaining coherent responses, outperforming existing unlearning techniques.

AIBearisharXiv – CS AI · Mar 36/107

🧠

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Researchers created PanCanBench, a comprehensive benchmark evaluating 22 large language models on pancreatic cancer-related patient questions, revealing significant variations in clinical accuracy and high hallucination rates. The study found that even top-performing models like GPT-4o and Gemini-2.5 Pro had hallucination rates of 6%, while newer reasoning-optimized models didn't consistently improve factual accuracy.

AIBearisharXiv – CS AI · Mar 37/108

🧠

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Researchers have discovered VidDoS, a new universal attack framework that can severely degrade Video-based Large Language Models by causing extreme computational resource exhaustion. The attack increases token generation by over 205x and inference latency by more than 15x, creating critical safety risks in real-world applications like autonomous driving.

AINeutralarXiv – CS AI · Mar 36/105

🧠

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

Researchers introduced Spoof-SUPERB, a new benchmark for evaluating self-supervised learning models' ability to detect audio deepfakes. The study tested 20 SSL models and found that large-scale discriminative models like XLS-R and WavLM Large consistently outperformed others, especially under acoustic degradations.

AIBullisharXiv – CS AI · Mar 37/1010

🧠

Inference-Time Safety For Code LLMs Via Retrieval-Augmented Revision

Researchers developed a new inference-time safety mechanism for code-generating AI models that uses retrieval-augmented generation to identify and fix security vulnerabilities in real-time. The approach leverages Stack Overflow discussions to guide AI code revision without requiring model retraining, improving security while maintaining interpretability.

AIBullisharXiv – CS AI · Mar 37/108

🧠

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Researchers introduce DualSentinel, a lightweight framework for detecting targeted attacks on Large Language Models by identifying 'Entropy Lull' patterns - periods of abnormally low token probability entropy that indicate when LLMs are being coercively controlled. The system uses dual-check verification to accurately detect backdoor and prompt injection attacks with near-zero false positives while maintaining minimal computational overhead.

$NEAR

AIBullisharXiv – CS AI · Mar 36/105

🧠

Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution

Researchers introduce CEMMA, a co-evolutionary framework for improving AI safety alignment in multimodal large language models. The system uses evolving adversarial attacks and adaptive defenses to create more robust AI systems that better resist jailbreak attempts while maintaining functionality.

AIBullisharXiv – CS AI · Mar 37/105

🧠

ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs

Researchers introduce ALTER, a new framework for efficiently "unlearning" specific knowledge from large language models while preserving their overall utility. The system uses asymmetric LoRA architecture to selectively forget targeted information with 95% effectiveness while maintaining over 90% model utility, significantly outperforming existing methods.

AIBullisharXiv – CS AI · Mar 36/103

🧠

Explanation-Guided Adversarial Training for Robust and Interpretable Models

Researchers propose Explanation-Guided Adversarial Training (EGAT), a framework that combines adversarial training with explainable AI to create more robust and interpretable deep neural networks. The method achieves 37% improvement in adversarial accuracy while producing semantically meaningful explanations with only 16% increase in training time.

AINeutralarXiv – CS AI · Mar 36/104

🧠

Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots

Researchers propose 'jailbreaking' as a user-driven method to counter LLM-powered social media manipulation by exposing automated bot behavior. The study suggests users can deliberately trigger AI safeguards to reveal misleading political narratives and reduce online conflict escalation.

AIBullisharXiv – CS AI · Mar 36/103

🧠

Adaptive Confidence Regularization for Multimodal Failure Detection

Researchers propose Adaptive Confidence Regularization (ACR), a new framework for detecting failures in multimodal AI systems used in critical applications like autonomous vehicles and medical diagnostics. The approach uses confidence degradation detection and synthetic failure generation to improve reliability of AI predictions in high-stakes scenarios.

AIBullisharXiv – CS AI · Mar 36/103

🧠

Token-Importance Guided Direct Preference Optimization

Researchers propose Token-Importance Guided Direct Preference Optimization (TI-DPO), a new framework for aligning Large Language Models with human preferences. The method uses hybrid weighting mechanisms and triplet loss to achieve more accurate and robust AI alignment compared to existing Direct Preference Optimization approaches.

← PrevPage 20 of 26Next →