y0news
← Feed
Back to feed
🧠 AI🔴 BearishImportance 7/10

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

arXiv – CS AI|Cosimo Galeone, Anna Ettorre, Minsu Park, Giuseppe Ettorre, Daniele Ligorio|
🤖AI Summary

Researchers discovered that language models can detect undesirable behaviors like hallucination with near-perfect accuracy, yet the neural directions enabling detection are nearly orthogonal (83 degrees apart) from those controlling the behavior. This fundamental geometric dissociation between knowing and steering persists across multiple models and scales, challenging a core assumption of mechanistic interpretability that detection should enable control.

Analysis

This research reveals a critical gap in mechanistic interpretability theory by demonstrating that detecting a behavior in neural activations does not guarantee the ability to steer or control it. The team found that while Gemma 2-2B-it achieves perfect linear separability for hallucinated entities (AUC = 1.000), the detection direction sits at approximately 83 degrees from the direction that actually produces refusals, indicating minimal alignment. This detection-intervention gap persists consistently across four models spanning three families and scales from 1B to 9B parameters, and remarkably remains unchanged before and after instruction tuning, suggesting the dissociation originates in foundational pretraining rather than fine-tuning artifacts.

The findings challenge a widespread assumption undergirding mechanistic interpretability work: that understanding where a behavior lives in activation space automatically provides a handle for modification. Instead, the research shows that detection operates on high-dimensional patterns while steerability depends on functional properties not readily apparent from static geometric relationships. A modest 15-degree rotation toward the refusal direction partially bridges the gap, achieving 73% and 60% refusal rates on held-out categories, but this limited success underscores that controlling model behavior requires more than identifying its latent representation.

For AI safety and alignment researchers, this work suggests that achieving interpretability—the ability to read and detect unwanted behaviors—may not translate directly into controllability without additional intervention strategies. The persistence of the detection-steering gap across pretraining and fine-tuning suggests it reflects fundamental properties of how language models encode knowledge. Future research must distinguish between readable behavioral signatures and the actual causal mechanisms driving outputs, potentially redirecting mechanistic interpretability efforts toward functional rather than purely geometric approaches.

Key Takeaways
  • Language models detect hallucinations with near-perfect accuracy yet the detection direction is orthogonal (~83°) to the direction that controls refusal behavior
  • The detection-intervention gap persists across multiple model families and scales (1B-9B parameters) and originates in pretraining, not instruction tuning
  • Detection operates as a high-dimensional phenomenon while steerability depends on functional properties not predictable from static geometric angles alone
  • A 15-degree rotation toward refusal directions partially improves control (73-60% refusal rates) but confirms steering requires intervention beyond identifying behavior location
  • The findings challenge core assumptions in mechanistic interpretability that knowing where behaviors exist should enable controlling them
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles