Is Fairness Truly Fair? Towards Reliable Lipschitz Fairness in Multi-Task Learning via Fixed-\texorpdfstring{$\delta$}{delta} Alignment
Researchers propose ReLiF, a framework addressing fairness evaluation problems in multi-task machine learning by using fixed evaluation thresholds rather than model-dependent ones. The work identifies how different algorithms can appear unfairly comparable under inconsistent fairness metrics and demonstrates that proper auditing protocols reveal genuine utility-fairness trade-offs obscured by conventional methods.
This research tackles a fundamental problem in machine learning fairness: ensuring that similar inputs receive similar predictions across multiple tasks. The core issue is that existing fairness evaluation methods use thresholds derived from each model's own representation scales, creating an apples-to-oranges comparison problem where different algorithms operate under different semantic standards. This threshold confounding makes it impossible to reliably rank models by fairness.
The ReLiF framework addresses this by separating training from evaluation. During training, it applies controlled regularization through a violation-rate feedback controller that keeps fairness constraints active without overwhelming the primary learning objective. During evaluation, it uses a fixed reference tolerance shared across all models, enabling consistent semantic comparison. The research includes theoretical analysis of threshold drift patterns and conditions under which fairness rankings remain stable.
The practical implications extend across healthcare and computer vision applications. Experiments on clinical time-series data and NYU Depth V2 datasets reveal that fixed-threshold auditing exposes real utility-fairness trade-offs that conventional methods mask. On NYUv2, ReLiF maintains competitive performance while substantially reducing bias under standardized thresholds. Critically, the research demonstrates that task-balancing baselines sometimes outperform fairness-aware methods on bias metrics, but only when using consistent evaluation protocols.
This work matters because fairness claims in machine learning increasingly affect real-world decisions in medicine, autonomous systems, and resource allocation. Without reliable evaluation standards, organizations cannot objectively assess whether fairness improvements are genuine or merely artifacts of inconsistent metrics. The framework provides a protocol for honest fairness assessment in multi-task scenarios.
- βFixed-threshold auditing protocols enable semantically consistent fairness evaluation across different multi-task learning algorithms
- βThreshold confounding can artificially inflate fairness claims by using model-dependent instead of shared reference standards
- βReLiF's violation-rate feedback controller prevents fairness constraints from dominating training while maintaining evaluation rigor
- βProper fairness auditing reveals genuine utility-fairness trade-offs that conventional methods obscure in healthcare and vision tasks
- βTask-balancing baselines can achieve lower bias than specialized fairness methods under consistent evaluation standards