🧠 AI⚪ NeutralImportance 6/10

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

arXiv – CS AI|Stuart Whipp|June 19, 2026 at 04:00 AM

🤖AI Summary

Researchers measured the actual linearity of transformer feed-forward network blocks across multiple language models, finding that linearity varies dramatically between adjacent blocks and is learned during training rather than determined by architecture. This discovery enables targeted compression strategies and reveals methodological issues in evaluating transformer models.

Analysis

Transformer feed-forward networks are typically assumed to function as nonlinear computational units, but this study provides the first systematic measurement of their actual linearity through linear recoverability (R²_lin). By decomposing each FFN block into its optimal linear approximation and residual, researchers can quantify how much variance a simple linear map explains in the input-to-output relationship.

The findings demonstrate significant heterogeneity across transformer blocks, with linearity ranging from near-perfect (>0.99) to highly nonlinear (<0.3) even within the same model. Critically, this recoverability is not determined by architectural choices like activation functions—identical GELU-based models exhibit sharply different linearity profiles, proving that linearity is a learned property optimized during training rather than an inherent structural characteristic.

These insights have immediate practical implications for model compression. Blocks with high linear recoverability can be replaced with single-layer approximations at substantial parameter reduction (demonstrated with 8x compression in GPT-2's early layers with minimal perplexity impact), while low-recoverability blocks appropriately flag dangerous compression targets. The analysis also exposes a methodological concern: linear baselines often fail to converge properly on transformer activations due to ill-conditioning, requiring the exact closed-form least-squares solution as ground truth.

For the broader AI field, this work bridges theory and practice by providing an optimizer-free diagnostic tool for understanding transformer internals. It suggests that transformer training implicitly learns which computations should be linear versus nonlinear—a finding that could inform future architecture designs and training procedures optimizing for efficiency and interpretability.

Key Takeaways

→Transformer FFN linearity varies dramatically between blocks and is learned during training, not determined by architecture
→Low-recoverability blocks resist compression while high-recoverability blocks can be replaced at 8x parameter reduction with minimal performance loss
→The residual nonlinearity in low-recoverability blocks exhibits higher-order or distributed structure, not simple position-wise interactions
→Linear baseline training often under-converges on transformer activations, necessitating closed-form least-squares solutions for accurate measurement
→Block-level linearity patterns are heterogeneous and non-monotonic with depth, suggesting fine-grained learned optimization strategies

Mentioned in AI

Companies

Perplexity→

#transformers #neural-networks #model-compression #feed-forward-blocks #linearity-measurement #gpt-2 #llama #optimization

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge