y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#gradient-attribution News & Analysis

3 articles tagged with #gradient-attribution. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles
AINeutralarXiv – CS AI · Apr 147/10
🧠

Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

Researchers introduce Pando, a benchmark that evaluates mechanistic interpretability methods by controlling for the 'elicitation confounder'—where black-box prompting alone might explain model behavior without requiring white-box tools. Testing 720 models, they find gradient-based attribution and relevance patching improve accuracy by 3-5% when explanations are absent or misleading, but perform poorly when models provide faithful explanations, suggesting interpretability tools may provide limited value for alignment auditing.

AIBullisharXiv – CS AI · Apr 137/10
🧠

The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs

Researchers introduce the Two-Stage Decision-Sampling Hypothesis to explain how reinforcement learning enables self-reflection capabilities in large language models, demonstrating that RL's superior performance stems from improved decision-making rather than generation quality. The theory shows that reward gradients distribute asymmetrically across policy components, explaining why RL succeeds where supervised fine-tuning fails.

AINeutralarXiv – CS AI · Apr 206/10
🧠

Applied Explainability for Large Language Models: A Comparative Study

Researchers compare three explainability techniques—Integrated Gradients, Attention Rollout, and SHAP—for interpreting LLM decisions on sentiment classification tasks. The study reveals that gradient-based methods offer stability and interpretability, while attention-based approaches are faster but less predictive, highlighting critical trade-offs in choosing explanation methods for transformer models.