AINeutralarXiv – CS AI · 3h ago7/10
🧠
CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders
Researchers propose CRaFT, a circuit-guided framework that identifies critical refusal features in large language models by analyzing inter-feature relationships rather than isolated activation signals. The method improves jailbreak attack success rates from 6.7% to 57.4% across benchmarks, advancing understanding of LLM safety mechanisms and highlighting vulnerabilities in model alignment.