FragileFlow: Spectral Control of Correct-but-Fragile Predictions for Foundation Model Robustness
FragileFlow introduces a theoretical framework and practical regularizer to detect and mitigate a hidden failure mode in large language models and vision-language models where predictions remain technically correct but confidence margins narrow dangerously. The research provides the first PAC-Bayes bounds for margin-aware error flow, addressing robustness gaps that standard accuracy metrics overlook.
Foundation models present a measurement paradox: aggregate accuracy metrics fail to capture structured instability where correct predictions teeter near decision boundaries. FragileFlow addresses this by formalizing "correct-but-fragile" predictions—outputs that remain accurate under clean conditions but become vulnerable to perturbations as probability mass drifts toward competing classes. This phenomenon represents a critical safety concern for deployed systems where marginal robustness failures could compound across tasks.
The research emerges from growing recognition that average-case robustness benchmarks obscure worst-case performance degradation. Previous work emphasized consistency under perturbations without examining the spectral properties of probability distributions around decision boundaries. FragileFlow's margin-aware error-flow formulation directly targets this gap by constructing a vulnerable-risk matrix that tracks class-wise probability leakage patterns.
The theoretical contribution—a PAC-Bayes upper bound with deterministic worst-class robustness guarantees under stability conditions—provides formal grounding often missing from empirical robustness work. Empirical validation across multiple-choice LLM benchmarks and few-shot CLIP adaptation demonstrates consistent improvements in theory-facing risk measures while maintaining clean accuracy, suggesting the approach doesn't trade performance for safety.
The implications extend beyond academic interest. As foundation models integrate into mission-critical applications, understanding fragile-correctness patterns becomes essential for risk assessment. The plug-in regularizer design enables practical deployment without architectural modification, lowering implementation barriers. However, the stability conditions required for theoretical guarantees may not hold universally across all deployment contexts.
- →FragileFlow detects correct-but-fragile predictions by identifying when probability mass flows toward wrong classes despite maintaining overall accuracy.
- →The research provides the first PAC-Bayes theoretical bounds for margin-aware error-flow robustness in foundation models.
- →The method works as a plug-in regularizer compatible with existing LLM and VLM architectures without requiring retraining.
- →Experiments show consistent improvements in worst-class accuracy under perturbations while preserving clean performance.
- →The framework reveals why standard accuracy metrics fail to capture structured failure modes in foundation model robustness.