Researchers propose a Human-Centered Benchmarking Framework that evaluates driver monitoring AI models across accuracy, explainability, efficiency, and robustness—rather than accuracy alone. Testing four lightweight architectures on eye-state classification reveals that while models perform similarly on clean data, each excels in different dimensions, and critically, the top-ranked model fails under sensor noise by misclassifying closed eyes as open, a safety-critical vulnerability.
This research addresses a fundamental gap in how safety-critical AI systems are validated before deployment. Driver monitoring systems directly impact human safety, yet the industry standard of benchmarking solely on classification accuracy masks critical failure modes that emerge in real-world conditions. The study demonstrates that a model ranking first on aggregate human-centered metrics can retain less than 50% performance under sensor noise—a common deployment scenario—while simultaneously committing the most dangerous error type: falsely identifying closed eyes as open.
The broader context reflects growing recognition across autonomous systems research that accuracy metrics are insufficient proxies for real-world safety. This benchmarking gap has persisted because establishing multi-dimensional evaluation frameworks requires domain expertise, operational data, and consensus on weighting trade-offs. The paper's contribution—formalizing four dimensions and demonstrating their independence—provides a replicable methodology applicable beyond driver monitoring to facial recognition, medical imaging, and other safety-critical vision tasks.
For developers and safety teams deploying driver monitoring, the findings carry immediate operational implications. Selecting models based on average performance could inadvertently introduce failure modes that safety testing might not catch if robustness evaluations use only synthetic noise rather than real sensor degradation profiles. The research suggests procurement and validation protocols must specify dimension-specific thresholds rather than relying on aggregate scores.
Moving forward, stakeholders should monitor whether automotive and transportation safety standards incorporate multi-dimensional benchmarking. Standards bodies like SOTIF (Safety of the Intended Functionality) may need to formalize robustness and explainability requirements alongside accuracy, potentially creating new compliance obligations for AI model developers.
- →Single-metric accuracy benchmarking fails to identify critical safety vulnerabilities in driver monitoring systems deployed in real-world conditions.
- →The top-ranked model under aggregate human-centered scoring degrades catastrophically under sensor noise while misclassifying closed eyes as open.
- →Vision models near-identical in clean-set accuracy diverge sharply across explainability, efficiency, and robustness dimensions, each dominating exactly one axis.
- →Transformer-based models (DeiT-Tiny) maintain performance under noise better than optimized CNNs despite not ranking highest overall.
- →Multi-dimensional evaluation frameworks are necessary for safety-critical AI deployment but remain absent from standard model comparison practices.