Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups
Researchers have identified systematic fairness disparities in how large language models explain their decisions across demographic groups, introducing the Explanation Fairness Taxonomy (EFT) to measure five dimensions of explanation inequality. Testing five major LLMs across hiring, medical, credit, and legal domains reveals statistically significant disparities in explanation quality, with stylistic inequalities appearing resistant to prompt-based fixes and likely embedded in model pre-training.
This research addresses a critical gap in AI fairness discourse: while decision fairness has received extensive scrutiny, the quality and consistency of AI explanations across demographic groups has been largely overlooked. The study's significance lies in demonstrating that LLMs don't merely make biased decisions—they justify those decisions with measurably different levels of sophistication, depth, and tone depending on the demographic context. This compounds existing fairness concerns by potentially obscuring bias beneath seemingly neutral explanations.
The Explanation Fairness Taxonomy provides a structured methodology for auditing explanation disparities across five dimensions: verbosity, sentiment, epistemic hedging, decision-linkage, and lexical complexity. Testing across 400 prompt pairs and five major models (GPT-4.1, Claude Sonnet, LLaMA 3.3 70B, GPT-OSS 120B, Qwen3 32B) reveals that model architecture significantly influences disparity magnitude, with Qwen3 showing nearly 6x larger verbosity gaps than LLaMA. This variability suggests explanation fairness is not a universal problem but one with model-specific solutions.
The finding that prompting mitigations reduce decision-linked disparities (78-95%) but fail to address stylistic inequalities carries major implications for deployment. It suggests disparities encoded during pre-training cannot be remedied through instruction engineering alone, requiring either retraining or architectural changes. For regulated domains—hiring, lending, healthcare—these findings underscore the inadequacy of deployment-level fixes and point toward the need for upstream model development standards. Regulators and AI developers must now factor explanation fairness into compliance frameworks, particularly for high-stakes decisions where transparency serves as both a fairness and accountability mechanism.
- →LLMs exhibit statistically significant disparities in explanation quality, tone, and complexity across demographic groups, independent of decision fairness.
- →Qwen3 32B shows 5.9x larger verbosity disparities than LLaMA 3.3 70B, indicating model architecture strongly influences explanation fairness.
- →Prompting-based mitigations reduce decision-linked explanation disparities by 78-95% but cannot address stylistic inequalities rooted in pre-training.
- →Explanation fairness failures are particularly consequential in hiring, medical triage, credit assessment, and legal judgment domains.
- →Current deployment-level interventions are insufficient; explanation fairness requires upstream fixes during model development and pre-training.