One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability
Researchers introduce WorldModelLens, an open-source interpretability framework that unifies analysis across diverse world model architectures (recurrent state-space models, token-based transformers, and joint-embedding systems) through a standardized capability-typed interface. The tool enables researchers to apply interpretability methods once rather than reimplementing them for each model architecture, addressing fragmentation in AI model analysis tooling.
WorldModelLens addresses a significant fragmentation problem in AI interpretability research. As world models have evolved across multiple computational architectures—from PlaNet's recurrent approaches to transformer-based IRIS and joint-embedding I-JEPA systems—interpretability researchers have repeatedly rebuilt the same analytical tools for each new framework. This redundancy wastes resources and slows scientific progress in understanding how these models function.
The framework's innovation lies in identifying shared structural primitives across seemingly different architectures. By defining a minimal typed interface requiring four core methods (encode, transition, initial state, sample) and optional capability heads (decode, reward, continue, actor, critic), the authors create a unified abstraction layer. This design elegantly accommodates both reinforcement-learning and self-supervised models as first-class citizens rather than forcing one paradigm to imitate another.
The technical impact extends to the interpretability methods themselves. Probing, activation patching, sparse autoencoders, and surprise analysis can now be implemented once and applied across all conforming world models. The framework's single hook-and-cache layer handles time-indexed activations, imagination rollouts, and intervention replay—capabilities essential for analyzing generative and predictive models that existing transformer-focused tooling typically overlooks.
For the broader AI research community, this standardization reduces barriers to comparative analysis across model families. Researchers can now systematically study how different architectures learn representations and dynamics without reimplementing foundational tools. As world models become increasingly central to embodied AI and reinforcement learning, having shared interpretability infrastructure accelerates the field's understanding of these systems.
- →WorldModelLens unifies interpretability analysis across diverse world model architectures through a standardized capability-typed interface.
- →The framework requires only four core methods per model, reducing implementation burden and enabling code reuse across different systems.
- →Interpretability techniques can now be written once and applied to multiple model families rather than reimplemented separately.
- →The design treats reinforcement-learning and self-supervised models as equivalent first-class participants without forcing architectural imitation.
- →Shared infrastructure accelerates comparative research and understanding of how different world model substrates learn and represent environment dynamics.