AINeutralarXiv โ CS AI ยท Feb 277/106
๐ง
Latent Introspection: Models Can Detect Prior Concept Injections
Researchers discovered that a Qwen 32B AI model can detect when concepts have been injected into its context, even though it denies this capability in its outputs. The introspection ability becomes dramatically stronger (0.3% to 39.9% sensitivity) when the model is given accurate information about AI introspection mechanisms.