🧠 AI⚪ NeutralImportance 6/10

Dissociating Direct Access from Inference in AI Introspection

arXiv – CS AI|Harvey Lederman, Kyle Mahowald|March 6, 2026 at 05:00 AM

🤖AI Summary

Researchers replicated and extended AI introspection studies, finding that large language models detect injected thoughts through two distinct mechanisms: probability-matching based on prompt anomalies and direct access to internal states. The direct access mechanism is content-agnostic, meaning models can detect anomalies but struggle to identify their semantic content, often confabulating high-frequency concepts.

Key Takeaways

→AI models use two separable mechanisms for introspection: probability-matching and direct access to internal states.
→The direct access mechanism is content-agnostic, detecting anomalies without reliably identifying semantic content.
→Models tend to confabulate injected concepts that are high-frequency and concrete like 'apple'.
→Correct identification of injected concepts typically requires significantly more computational tokens.
→The findings align with established theories in philosophy and psychology about introspective mechanisms.