Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
Researchers introduce Singularity-aware Adam (S-Adam), a novel optimizer addressing instability in deep learning with non-smooth components like ReLU activations. The method uses a Local Geometric Instability metric to dynamically adjust step sizes, demonstrating up to 6% accuracy improvements on benchmark datasets while mitigating gradient oscillations.
Modern deep learning architectures introduce non-smooth elements that violate traditional optimization assumptions, creating challenges for adaptive optimizers like Adam. Gradient chattering—violent oscillations from conflicting signals in the Clarke subdifferential—degrades convergence and generalization performance, particularly in quantization-aware training and small-batch scenarios. S-Adam addresses this fundamental limitation by introducing a computationally efficient Local Geometric Instability (LGI) metric that quantifies subdifferential diameter through randomized directional derivatives, enabling real-time detection of unstable regions.
The optimizer's adaptive damping mechanism exponentially decelerates updates in high-instability zones while maintaining acceleration in smooth loss landscapes, balancing exploration and stability. This approach builds on differential inclusion theory, providing formal convergence guarantees to Clarke stationary points at optimal O(1/√T) rates—matching theoretical benchmarks for non-smooth optimization. The method proves particularly valuable for quantization-aware training, where discrete operations create inherent non-smoothness, and for distributed learning with small batch sizes where noisy gradients amplify instability.
Empirical results demonstrate consistent improvements across CIFAR-100 and TinyImageNet benchmarks, with 3-6% accuracy gains over existing methods. Beyond academic significance, this advancement matters for deploying quantized neural networks on edge devices and mobile platforms where computational efficiency depends on stable training. The work addresses a growing gap between theoretical optimization assumptions and practical architectural realities, relevant to practitioners implementing state-of-the-art models with activation functions and quantization operators.
- →S-Adam introduces Local Geometric Instability metric to detect and mitigate gradient chattering in non-smooth loss landscapes.
- →Achieves up to 6% accuracy improvements on CIFAR-100 and 3% on TinyImageNet compared to AdamW and Prox-SGD.
- →Provides formal convergence guarantees to Clarke stationary points at optimal O(1/√T) rates using differential inclusion theory.
- →Particularly beneficial for quantization-aware training and small-batch learning scenarios with high gradient noise.
- →Adaptive damping mechanism balances convergence speed in smooth regions with stability in geometrically unstable areas.