AIBullisharXiv – CS AI · 6h ago7/10
🧠
TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection
TriLens is a novel white-box detection method that identifies hallucinations in language models by tracking entropy changes across internal computational layers. Rather than examining only final outputs, the technique monitors uncertainty signals from multi-head attention, feed-forward networks, and residual streams using logit lens analysis, creating a compact 3L-dimensional trajectory that reveals how model confidence settles during inference.