Researchers introduce FOGO, a new optimizer that addresses gradient interference during neural network training by orthogonalizing momentum updates and storing past directions in compressed memory. The method shows improvements over Adam and Muon across diverse tasks including continual learning, class-imbalanced classification, and large language model training.
FOGO represents a meaningful advancement in optimization algorithms by reframing forgetting as a universal training phenomenon rather than a problem exclusive to continual learning scenarios. The core insight—that dominant gradient directions suppress valuable but infrequent updates, creating information loss at every training step—addresses a fundamental inefficiency in current deep learning practices. This perspective unifies two traditionally separate research areas and suggests optimization quality improvements are achievable within existing computational budgets.
The technical contribution leverages spectral orthogonalization combined with random projection-based memory to maintain historical gradient directions without significant storage overhead. By preventing gradient monopolization and resolving conflicts through lightweight corrections, FOGO maintains a more diverse optimization trajectory. This builds on established principles in optimization theory while introducing practical mechanisms for their implementation at scale.
For the machine learning community, the implications extend beyond academic interest. LLM fine-tuning and continual learning are increasingly critical for production systems, and convergence improvements directly translate to reduced computational costs and faster model deployment. The consistent outperformance across heterogeneous tasks—from image classification to pretraining—suggests FOGO's benefits generalize beyond niche applications. The method's minimal computational overhead makes adoption feasible without infrastructure changes.
Future research should examine FOGO's performance on larger model scales and its interaction with modern techniques like mixed-precision training and distributed optimization. Industry adoption hinges on integration with popular frameworks and empirical validation on production workloads. The theoretical guarantees around distance preservation in the codebook memory merit deeper exploration regarding their practical bounds.
- →FOGO detects and resolves gradient interference across standard training and continual learning scenarios using spectral orthogonalization
- →Compact codebook memory built on random projection preserves pairwise distances while minimizing storage requirements
- →The optimizer demonstrates consistent improvements over Adam and Muon across class-imbalanced, domain-shifted, and LLM fine-tuning tasks
- →Lightweight orthogonal correction and proximal steps add minimal computational overhead compared to existing optimizers
- →Treating forgetting as a universal optimization phenomenon rather than continual-learning-specific issue unifies two research areas