A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References
This research examines how the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) metric used to train and evaluate speech separation models performs poorly when training data contains noise, revealing fundamental limitations in the current benchmark approach. The authors propose reference enhancement techniques to mitigate this issue, though results indicate that processing introduces artifacts that limit overall quality improvements.
Speech separation technology powers voice communication systems across telecommunications, voice assistants, and hearing aids. The WSJ0-2Mix benchmark has become the de facto standard for evaluating these systems, yet this study exposes a critical flaw: SI-SDR optimization assumes clean references, but real-world training data often contains noise, creating a mismatch between metric and reality.
The research derives the mathematical implications of SI-SDR with noisy references, proving that noise either caps achievable SI-SDR scores or forces models to perpetuate noise in output. This explains why models optimized for SI-SDR improvements don't necessarily produce perceptually better speech. The authors tested reference enhancement with WHAM! dataset augmentation, reducing noise in separated speech but introducing processing artifacts that offset quality gains.
This matters significantly for AI developers building production speech systems. Engineers currently rely on SI-SDR as their optimization target, but this work demonstrates that metric improvements don't translate to better user experience when references contain noise. The negative correlation found between SI-SDR and perceived noisiness across multiple test sets validates this disconnect, suggesting practitioners need multi-metric evaluation strategies beyond SI-SDR alone.
Future development should focus on noise-aware training objectives and reference preprocessing techniques that don't introduce artifacts. The findings push the community toward more robust benchmarks that account for real-world training data conditions rather than idealized clean references.
- βSI-SDR metric optimization fails to improve perceived speech quality when training references contain noise
- βCurrent speech separation benchmarks use clean reference assumptions that don't match real-world training data
- βReference enhancement reduces noise but introduces artifacts that limit overall quality improvements
- βStrong negative correlation exists between SI-SDR scores and actual perceived noisiness in separated speech
- βMulti-metric evaluation strategies are necessary for developing speech separation models with practical performance