AINeutralarXiv โ CS AI ยท 4d ago6/104
๐ง
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
Researchers investigated whether large language models can introspect by detecting perturbations to their internal states using Meta-Llama-3.1-8B-Instruct. They found that while binary detection methods from prior work were flawed due to methodological artifacts, models do show partial introspection capabilities, localizing sentence injections at 88% accuracy and discriminating injection strengths at 83% accuracy, but only for early-layer perturbations.