#internal-representations News & Analysis

3 articles tagged with #internal-representations. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Peeking Inside LLMs: Leveraging Internal Artifacts of LLMs for Enhancing Reliability in Legal Classification

Researchers demonstrate that internal computational artifacts within Large Language Models can reliably detect when the model produces incorrect outputs in legal classification tasks. By analyzing these internal signals, downstream classifiers can identify hallucinated or erroneous predictions, potentially improving the reliability of LLM-based legal systems for high-stakes applications like bail decisions and statute violation predictions.

AINeutralarXiv – CS AI · Jun 27/10

🧠

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

Researchers introduce MENTIS, a framework for measuring internal geometric changes in language models during preference alignment training. The study reveals that alignment leaves selective, depth-localized signatures in model computations, with normative concepts showing larger internal reorganization than factual concepts across multiple model architectures.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

Researchers propose a conformal prediction framework for large language models that uses internal neural representations rather than surface-level outputs to assess reliability and uncertainty. The Layer-Wise Information scoring method improves prediction validity under distribution shift while maintaining competitive performance, addressing a critical challenge in deploying LLMs where traditional uncertainty signals become unreliable.