y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Dissociating Direct Access from Inference in AI Introspection

arXiv – CS AI|Harvey Lederman, Kyle Mahowald|
🤖AI Summary

Researchers replicated and extended AI introspection studies, finding that large language models detect injected thoughts through two distinct mechanisms: probability-matching based on prompt anomalies and direct access to internal states. The direct access mechanism is content-agnostic, meaning models can detect anomalies but struggle to identify their semantic content, often confabulating high-frequency concepts.

Key Takeaways
  • AI models use two separable mechanisms for introspection: probability-matching and direct access to internal states.
  • The direct access mechanism is content-agnostic, detecting anomalies without reliably identifying semantic content.
  • Models tend to confabulate injected concepts that are high-frequency and concrete like 'apple'.
  • Correct identification of injected concepts typically requires significantly more computational tokens.
  • The findings align with established theories in philosophy and psychology about introspective mechanisms.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles