A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases
Researchers have developed a geometric framework for understanding how large language models process information across their layers, identifying three distinct phases in next-token prediction: Seeding Multiplexing, Hoisting Overriding, and Focal Convergence. The study reveals that model depth primarily increases capacity for candidate disambiguation rather than adding fundamentally new computational stages.
This research provides unprecedented insight into the internal mechanics of large language models by treating predictive information as a geometric problem rather than a black-box phenomenon. Using representation lenses as diagnostic tools, researchers tracked how prediction capability evolves across model layers by measuring changes in effective rank and subspace geometry. The discovery of three distinct phases suggests that LLM computation follows a surprisingly consistent organizational principle across different model families and scales.
The implications extend beyond academic curiosity. Understanding that deeper models primarily refine candidate selection rather than introducing new computational stages challenges assumptions about scaling laws and model architecture. If disambiguation capacity scales linearly with depth while early and late phases grow slowly, this suggests current architectural designs may not optimally leverage increased model size. The observation that updates remain orthogonal to the residual stream throughout all phases indicates a fundamental constraint on how information flows through transformers.
For practitioners building or deploying LLMs, these geometric insights could inform architecture design and training procedures. If the three-phase structure is universal across model families, it may represent an optimal decomposition that alternative architectures should either replicate or deliberately subvert. The finding that attention and feed-forward layers seed candidates in family-specific proportions suggests these components play distinct, complementary roles that could be exploited for efficiency gains.
Future work should investigate whether this geometric structure emerges necessarily from transformer constraints or represents learned organization amenable to modification. Understanding whether architectural changes can alter phase proportions or eliminate phases entirely could unlock more efficient scaling.
- βLLMs organize next-token prediction into three geometric phases: Seeding Multiplexing, Hoisting Overriding, and Focal Convergence, with predictable effective rank evolution.
- βModel depth primarily expands candidate disambiguation capacity rather than introducing fundamentally new computational mechanisms.
- βPredictive updates remain orthogonal to residual streams throughout all layers, suggesting fundamental constraints on transformer information flow.
- βThe three-phase structure emerges consistently across eight models spanning 1B-32B parameters from different families, indicating universal organizational principles.
- βPhase 2 expands linearly with depth while Phases 1 and 3 grow slowly, creating a scaling bottleneck in the middle layers.