ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning
Researchers propose ESSAM, a novel training framework combining Evolution Strategies with Sharpness-Aware Maximization to fine-tune large language models for mathematical reasoning while dramatically reducing GPU memory requirements. The approach achieves comparable accuracy to reinforcement learning methods like PPO and GRPO while using 18-10× less memory, addressing a critical bottleneck in LLM development.
The research tackles a fundamental constraint in large language model development: the prohibitive computational cost of reinforcement learning-based fine-tuning. As LLMs grow larger and organizations seek to improve reasoning capabilities, GPU memory demands have become a practical barrier for many developers and research teams. ESSAM addresses this by combining zero-order optimization through Evolution Strategies with sharpness-aware techniques, enabling efficient parameter updates without the gradient accumulation overhead typical of standard RL approaches.
This work emerges from ongoing efforts to democratize advanced LLM training. While reinforcement learning has proven effective for improving mathematical reasoning on benchmarks like GSM8K, the computational requirements have confined this approach to well-resourced institutions. Evolution Strategies offer a memory-efficient alternative by evaluating multiple candidate solutions rather than maintaining large gradient buffers, fundamentally changing the resource calculus of LLM improvement.
The empirical results demonstrate genuine competitive performance, achieving 78.27% average accuracy comparable to GRPO's 78.34% while consuming a fraction of the memory. The generalization experiments across multiple datasets suggest the approach produces models with stronger robustness rather than merely memorizing task-specific patterns. An accelerated variant achieving near 2× speedup while maintaining memory efficiency indicates further optimization potential.
For the AI development landscape, this research signals that memory-intensive RL fine-tuning may not remain the exclusive domain of large-scale labs. Smaller organizations and individual researchers could access previously unavailable training methodologies. The work establishes an important precedent that algorithmic innovation can overcome hardware limitations, potentially reshaping competitive dynamics in LLM development and deployment.
- →ESSAM reduces GPU memory usage by 18× versus PPO and 10× versus GRPO while maintaining competitive accuracy on mathematical reasoning tasks
- →The framework combines Evolution Strategies with Sharpness-Aware Maximization to achieve full parameter fine-tuning without high memory overhead
- →Models trained with ESSAM demonstrate superior generalization, achieving best performance on 5 of 6 tested datasets
- →An accelerated variant achieves 2× speedup while maintaining low memory usage and outperforming PPO baseline
- →The approach democratizes advanced LLM training by removing the computational barrier that previously required large-scale infrastructure