#gradient-attribution News & Analysis

4 articles tagged with #gradient-attribution. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AINeutralarXiv – CS AI · Apr 147/10

🧠

Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

Researchers introduce Pando, a benchmark that evaluates mechanistic interpretability methods by controlling for the 'elicitation confounder'—where black-box prompting alone might explain model behavior without requiring white-box tools. Testing 720 models, they find gradient-based attribution and relevance patching improve accuracy by 3-5% when explanations are absent or misleading, but perform poorly when models provide faithful explanations, suggesting interpretability tools may provide limited value for alignment auditing.

AIBullisharXiv – CS AI · Apr 137/10

🧠

The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs

Researchers introduce the Two-Stage Decision-Sampling Hypothesis to explain how reinforcement learning enables self-reflection capabilities in large language models, demonstrating that RL's superior performance stems from improved decision-making rather than generation quality. The theory shows that reward gradients distribute asymmetrically across policy components, explaining why RL succeeds where supervised fine-tuning fails.

AINeutralarXiv – CS AI · Jun 96/10

🧠

ABLE: Representing and Mapping LLMs via Attribution-Based Large-model Embedding

Researchers introduce ABLE, a framework that represents and compares large language models through gradient-based feature attributions rather than parameter analysis or output comparison. The training-free method achieves competitive performance on model comparison tasks across 239 open-source LLMs while providing theoretical stability guarantees.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Applied Explainability for Large Language Models: A Comparative Study

Researchers compare three explainability techniques—Integrated Gradients, Attention Rollout, and SHAP—for interpreting LLM decisions on sentiment classification tasks. The study reveals that gradient-based methods offer stability and interpretability, while attention-based approaches are faster but less predictive, highlighting critical trade-offs in choosing explanation methods for transformer models.