AIBullisharXiv – CS AI · 18h ago7/10
🧠
PRISM: Recovering Instruction Sets from Language Model Activations
Researchers introduce PRISM, a new AI system that decodes hidden states from language models to reveal the complete set of active instructions guiding their behavior. This advancement addresses a critical security gap in monitoring deployed LLM agents by detecting unintended objectives, prompt injections, and hidden constraints that models may follow without explicit output indication.