y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#refusal-mechanisms News & Analysis

2 articles tagged with #refusal-mechanisms. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles
AINeutralarXiv – CS AI · 15h ago7/10
🧠

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Researchers demonstrate that chain-of-thought reasoning in large language models like DeepSeek-R1 fundamentally changes how refusal mechanisms operate, requiring multi-stage interventions rather than simple activation steering. Unlike traditional LLMs where refusal exists in a single directional subspace, reasoning models jointly encode refusal across both residual activations and reasoning chains, making them more robust to direct attacks but potentially vulnerable to CoT-level manipulations.

AINeutralarXiv – CS AI · 15h ago7/10
🧠

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

Researchers propose SALO, a jailbreak detection method that identifies persistent 'refusal trajectories' across model layers, rather than relying on static terminal representations. The detector demonstrates improved detection rates against adversarial attacks on multiple LLM architectures, though with acknowledged limitations against adaptive attacks.

🧠 Llama