Performance and Complexity Trade-off Optimization of Speech Models During Training
Researchers propose a novel reparameterization technique using feature noise injection that enables joint optimization of speech model performance and computational complexity during training via gradient descent. Unlike post-hoc methods like pruning or quantization, this approach dynamically optimizes model size without heuristic weight-selection criteria, demonstrated through voice activity detection and audio anti-spoofing applications.
The paper addresses a fundamental inefficiency in neural network design: the separation between performance optimization and computational cost reduction. Traditional machine learning workflows train models to maximize task performance, then apply post-hoc techniques like quantization or pruning to reduce computational overhead. This two-stage approach leaves performance-complexity trade-offs on the table because standard gradient descent cannot directly optimize non-differentiable factors like layer sizes and floating-point operations.
The proposed reparameterization technique bridges this gap by injecting controlled feature noise that enables gradient-based optimization of both objectives simultaneously. This represents a methodological advancement in model efficiency research, particularly relevant as speech models increasingly deploy on edge devices with computational constraints. The approach differs fundamentally from existing methods by treating model compression as an integrated design problem rather than a post-training cleanup task.
For practitioners developing speech models, this technique could reduce deployment friction by eliminating separate optimization phases. Machine learning teams currently spend considerable resources on manual architecture tuning and compression pipelines; integrated optimization promises faster iteration cycles and potentially superior final models. The method's applicability across different domains—demonstrated through synthetic examples and two real-world speech tasks—suggests broad relevance for the speech processing community.
Future developments should track whether this approach generalizes beyond speech models to computer vision and NLP tasks. Open-source code availability enables rapid adoption and independent validation. The technique particularly impacts resource-constrained deployment scenarios where traditional training-then-compression workflows prove inefficient.
- →Novel reparameterization with feature noise injection enables simultaneous optimization of model performance and computational complexity during training.
- →Eliminates post-hoc pruning and quantization needs by treating model compression as an integrated design objective.
- →Applicable across speech processing tasks including voice activity detection and audio anti-spoofing with demonstrated effectiveness.
- →Addresses non-differentiability problem that prevents standard gradient descent from optimizing layer sizes and computational costs.
- →Open-source implementation encourages broader adoption and validation across different neural network architectures and domains.