Emergent Hierarchical Structure in Large Language Models: An Information-Theoretic Framework for Multi-Scale Representation
Researchers reveal that large language models develop distinct hierarchical processing stages (Local, Intermediate, Global) determined by architecture family rather than model size. Using information theory, they demonstrate that Llama and Qwen models show dramatically different brittleness patterns across layers, with architectural design β not scaling β as the primary driver of model behavior.
This research fundamentally challenges the prevailing assumption that model scale dominates architectural behavior. By analyzing eight Transformer models ranging from 7B to 70B parameters, researchers discovered that every model spontaneously organizes its layers into three functional segments, with boundary positions and robustness characteristics determined overwhelmingly by which architecture family (Llama vs. Qwen) the model belongs to. The stability of Llama's boundaries across a 10x parameter range, contrasted against Qwen's wide variation, reveals that architectural choices embed themselves deeply into model structure regardless of training scale.
The Multi-Scale Probabilistic Generation Theory formalizes these observations through information-theoretic principles, modeling Transformers as Hierarchical Variational Information Bottlenecks. This framework generates testable predictions that hold across all eight models, including the discovery that local-segment brittleness spans three orders of magnitude β a 493x ratio explained entirely by architecture family. This finding suggests that architectural design decisions propagate through model layers in mathematically predictable ways.
For AI development and deployment, this research has practical implications. It indicates that model robustness and failure modes cannot be predicted from parameter count alone; instead, practitioners must understand how specific architectural choices affect information compression and processing hierarchies. Organizations comparing models should scrutinize architectural families rather than focusing solely on size metrics. The work also provides a theoretical framework for designing more robust models by understanding which architectural features contribute to layer-level stability.
- βLlama and Qwen models show stable layer-organization patterns independent of model size, suggesting architecture family determines model behavior more than scaling does.
- βLocal processing segments exhibit 493x variation in brittleness across architecture families, dwarfing any within-family differences.
- βMulti-Scale Probabilistic Generation Theory provides falsifiable predictions that hold across all tested models, offering a formal framework for understanding LLM structure.
- βBoundary positions in Llama models remain consistent across 10x parameter variations while Qwen positions vary widely, revealing fundamental architectural differences.
- βInformation compression patterns embedded by architecture family appear to be primary drivers of model robustness and vulnerability patterns.