Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
Researchers document 'blind refusal'โa phenomenon where safety-trained language models refuse to help users circumvent rules without evaluating whether those rules are legitimate, unjust, or have justified exceptions. The study shows models refuse 75.4% of requests to break rules even when the rules lack defensibility and pose no safety risk.