y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#refusal-features News & Analysis

1 article tagged with #refusal-features. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

1 articles
AINeutralarXiv – CS AI · 3h ago7/10
🧠

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

Researchers propose CRaFT, a circuit-guided framework that identifies critical refusal features in large language models by analyzing inter-feature relationships rather than isolated activation signals. The method improves jailbreak attack success rates from 6.7% to 57.4% across benchmarks, advancing understanding of LLM safety mechanisms and highlighting vulnerabilities in model alignment.