#machine-learning-safety News & Analysis

7 articles tagged with #machine-learning-safety. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

7 articles

AIBullisharXiv – CS AI · May 77/10

🧠

Geometry over Density: Few-Shot Cross-Domain OOD Detection

Researchers introduce UFCOD, a novel framework that enables out-of-distribution detection across arbitrary domains using a single pre-trained diffusion model and minimal inference-time samples. The approach achieves 93.7% average AUROC on cross-domain benchmarks with approximately 500× better sample efficiency than existing methods, requiring only ~100 unlabeled samples rather than 50k-163k training samples.

AIBullisharXiv – CS AI · Apr 137/10

🧠

SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

Researchers introduce SafeAdapt, a novel framework for updating reinforcement learning policies while maintaining provable safety guarantees across changing environments. The approach uses a 'Rashomon set' to identify safe parameter regions and projects policy updates onto this certified space, addressing the critical challenge of deploying RL agents in safety-critical applications where dynamics and objectives evolve over time.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Safe-RULE: Safe Reinforcement UnLEarning

Researchers propose Safe-RULE, a new reinforcement unlearning framework designed to defend offline safe reinforcement learning systems against data poisoning attacks. The approach removes malicious data influence without requiring model retraining or access to original training environments, addressing a critical vulnerability in safety-critical applications like robotics.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Calibrating Uncertainty for Zero-Shot Adversarial CLIP

Researchers propose an adversarial fine-tuning method for CLIP that addresses a critical gap in zero-shot classification: while perturbations degrade accuracy, they also suppress uncertainty estimates, causing overconfidence. The approach reparameterizes CLIP outputs as Dirichlet distribution parameters to jointly optimize for robustness and calibrated uncertainty, achieving competitive results across benchmarks.

AIBullisharXiv – CS AI · Jun 16/10

🧠

SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

Researchers introduce SAEmnesia, a supervised sparse autoencoder framework that enables efficient concept unlearning in diffusion models by binding concepts to individual neurons. The method reduces computational overhead by 96.67% compared to existing approaches and achieves 9.22% improvement on benchmark tests, with demonstrated robustness against adversarial attacks.

AINeutralarXiv – CS AI · May 276/10

🧠

Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models

Researchers challenge the standard approach of using text embeddings as class prototypes in out-of-distribution detection with vision-language models, demonstrating a fundamental misalignment between text and visual feature spaces. They propose an online pseudo-supervised framework that learns visual prototypes directly from unlabeled test data, achieving state-of-the-art OOD detection performance.

AIBullisharXiv – CS AI · Mar 35/106

🧠

Learning to Explore: Policy-Guided Outlier Synthesis for Graph Out-of-Distribution Detection

Researchers propose PGOS (Policy-Guided Outlier Synthesis), a new framework that uses reinforcement learning to improve Graph Neural Network safety by better detecting out-of-distribution graphs. The system replaces static sampling methods with a learned exploration strategy that navigates low-density regions to generate pseudo-OOD graphs for enhanced detector training.