y0news
← Feed
Back to feed
🧠 AI🔴 BearishImportance 7/10Actionable

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

arXiv – CS AI|Jinman Wu, Yi Xie, Shen Lin, Shiqian Zhao, Xiaofeng Chen|
🤖AI Summary

Researchers propose the Disentangled Safety Hypothesis (DSH) revealing that AI safety mechanisms in large language models operate on two separate axes - recognition ('knowing') and execution ('acting'). They demonstrate how this separation can be exploited through the Refusal Erasure Attack to bypass safety controls while comparing architectural differences between Llama3.1 and Qwen2.5.

Key Takeaways
  • Safety mechanisms in LLMs are not monolithic but operate on two distinct geometric subspaces for recognition and execution.
  • The research introduces the Refusal Erasure Attack (REA) achieving state-of-the-art success rates in bypassing AI safety controls.
  • A 'Knowing without Acting' state can be created where models recognize harmful content but fail to refuse it.
  • Llama3.1 uses explicit semantic control while Qwen2.5 employs latent distributed control for safety mechanisms.
  • The geometric analysis reveals safety signals evolve from entangled to independent across model layers.
Mentioned in AI
Models
LlamaMeta
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles