AIBullisharXiv – CS AI · 7h ago7/10
🧠
From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression
Researchers introduce SubFit, a post-training compression method for Large Language Models that operates at the submodule level rather than full-layer granularity, achieving superior perplexity-accuracy trade-offs. The approach selects non-contiguous Attention and FeedForward submodules with individual fitted residual bypasses, delivering 84.6% downstream accuracy retention at 25% sparsity compared to 81.6% for existing methods.
🏢 Perplexity