AIBullisharXiv – CS AI · 3h ago6/10
🧠
Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring
Researchers propose TELLME, a novel method to improve transparency and monitorability of large language models by enhancing their internal representations rather than relying solely on external monitoring tools. The technique demonstrates consistent improvements in detoxification tasks across multimodal datasets and model architectures, addressing the fundamental challenge that chain-of-thought explanations fail to accurately reflect LLMs' actual decision-making processes.