Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs
Researchers introduce Ghosted Layers, a training-free method to recover performance degradation in layer-pruned large language models by solving an activation alignment problem through optimal linear operators. The technique uses a small calibration set to reconstruct hidden state mismatches introduced by pruning, maintaining efficiency gains while improving accuracy and perplexity across multiple LLM architectures.
Layer pruning represents a critical optimization technique for reducing computational costs of large language models by removing entire Transformer decoder blocks. However, this process creates a fundamental problem: the surviving layers expect hidden state distributions from their training, but receive misaligned activations from the pruned architecture. Ghosted Layers addresses this distribution mismatch through a mathematically elegant solution that derives a closed-form optimal linear operator from minimal calibration data, avoiding the need for computationally expensive retraining.
The advancement here stems from the model compression arms race in AI development. As LLMs grow larger, practitioners seek efficiency improvements to reduce inference latency and computational requirements. Previous training-free recovery methods constrained their solutions to limited operator subspaces, sacrificing optimality for simplicity. This work achieves unconstrained optimization, theoretically guaranteeing better performance recovery. The research demonstrates consistent improvements across multiple LLM backbones and pruning strategies, indicating broad applicability.
For developers and AI practitioners, Ghosted Layers offers immediate practical value by enabling effective model compression without the resource overhead of fine-tuning. Organizations deploying LLMs at scale can achieve faster inference and lower computational costs while maintaining model quality. This efficiency improvement compounds across deployment scenarios where slight latency reductions and reduced resource consumption translate directly to operational savings and improved user experience.
The technique's training-free nature makes it particularly valuable for proprietary models where fine-tuning data access is restricted. Future work may explore application to other neural network architectures beyond Transformers or combination with complementary pruning strategies to achieve even greater compression ratios.
- βGhosted Layers solves hidden state misalignment in pruned LLMs using closed-form optimal linear operators from minimal calibration data.
- βThe method achieves unconstrained optimization solutions, improving upon previous constrained approaches restricted to limited operator subspaces.
- βTraining-free recovery enables efficient layer pruning without retraining overhead, preserving computational gains while restoring model performance.
- βExperiments show consistent accuracy and perplexity improvements across multiple LLM architectures and pruning strategies.
- βThe approach enables practical model compression for large-scale deployments with restricted fine-tuning access.