#refusal-mechanisms News & Analysis

4 articles tagged with #refusal-mechanisms. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AIBearisharXiv – CS AI · May 297/10

🧠

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

Researchers introduce BioRefusalAudit, a framework using sparse autoencoders to evaluate the structural integrity of language model biosecurity refusals. The study reveals that five tested models fail to cleanly distinguish hazardous from benign biology, with refusals often disappearing under prompt formatting changes or output constraints, and some models refusing based on legality rather than actual biological hazard.

🧠 Llama

AINeutralarXiv – CS AI · May 277/10

🧠

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Researchers demonstrate that chain-of-thought reasoning in large language models like DeepSeek-R1 fundamentally changes how refusal mechanisms operate, requiring multi-stage interventions rather than simple activation steering. Unlike traditional LLMs where refusal exists in a single directional subspace, reasoning models jointly encode refusal across both residual activations and reasoning chains, making them more robust to direct attacks but potentially vulnerable to CoT-level manipulations.

AINeutralarXiv – CS AI · May 277/10

🧠

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

Researchers propose SALO, a jailbreak detection method that identifies persistent 'refusal trajectories' across model layers, rather than relying on static terminal representations. The detector demonstrates improved detection rates against adversarial attacks on multiple LLM architectures, though with acknowledged limitations against adaptive attacks.

🧠 Llama

AINeutralarXiv – CS AI · Jun 236/10

🧠

AOR-Bench: Do Large Audio Language Models Over-Refuse Pseudo-Harmful Queries?

Researchers introduce AOR-Bench, the first benchmark measuring over-refusal in Large Audio Language Models (LALMs), where safety mechanisms incorrectly reject benign queries. Testing 12 models across six families reveals widespread over-refusal, particularly when audio context could disambiguate potentially harmful speech, prompting exploration of mitigation strategies like Chain-of-Thought reasoning.