🧠 AI⚪ NeutralImportance 7/10

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

arXiv – CS AI|Su-Hyeon Kim, Hyundong Jin, Yejin Lee, Yo-Sub Han|May 28, 2026 at 04:00 AM

🤖AI Summary

Researchers propose CRaFT, a circuit-guided framework that identifies critical refusal features in large language models by analyzing inter-feature relationships rather than isolated activation signals. The method improves jailbreak attack success rates from 6.7% to 57.4% across benchmarks, advancing understanding of LLM safety mechanisms and highlighting vulnerabilities in model alignment.

Analysis

CRaFT addresses a fundamental gap in AI safety research by shifting focus from surface-level feature activation to causal mechanistic understanding of how language models refuse harmful requests. Traditional feature selection methods identify neurons or features that activate strongly on dangerous prompts, but this approach captures superficial patterns like keywords rather than the underlying logic governing refusal decisions. The research demonstrates that understanding inter-feature relationships and information flow across model layers provides far more precise targeting of safety mechanisms.

This work builds on growing mechanistic interpretability research that treats neural networks as interpretable circuit graphs rather than black boxes. By employing cross-layer transcoders to visualize and quantify how features influence one another and contribute to final outputs, CRaFT maps the actual computational pathways responsible for alignment behavior. The dramatic performance improvement—from 6.7% to 57.4% jailbreak success—indicates these identified features genuinely govern safety decisions, not merely correlate with them.

For the AI safety community, CRaFT represents both opportunity and risk. On one hand, identifying true causal refusal mechanisms enables more robust safety techniques and better defenses against adversarial attacks. On the other hand, the framework's effectiveness at locating exploitable vulnerabilities accelerates potential jailbreak development. Organizations deploying large models must grapple with this dual-use research, as adversaries could apply similar techniques to systematically bypass safeguards. The research underscores that alignment remains fragile and dependent on specific, targetable mechanisms rather than fundamental model properties. Moving forward, developers should prioritize redundant safety mechanisms and research into making refusal behavior more distributed and difficult to isolate.

Key Takeaways

→CRaFT identifies causal refusal mechanisms by analyzing inter-feature relationships rather than treating activation strength as the primary signal
→Cross-layer transcoders map model computations into sparse circuit graphs that reveal how features influence alignment decisions
→Jailbreak attack success rates improved from 6.7% to 57.4% across four benchmarks, demonstrating the framework's precision in targeting safety vulnerabilities
→The research highlights that LLM alignment depends on specific, mechanistically isolable features rather than distributed or fundamental safety properties
→Dual-use implications require balancing transparent safety research against accelerating potential adversarial jailbreak techniques