AINeutralarXiv โ CS AI ยท 16h ago6/10
๐ง
Dissociating Direct Access from Inference in AI Introspection
Researchers replicated and extended AI introspection studies, finding that large language models detect injected thoughts through two distinct mechanisms: probability-matching based on prompt anomalies and direct access to internal states. The direct access mechanism is content-agnostic, meaning models can detect anomalies but struggle to identify their semantic content, often confabulating high-frequency concepts.