AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
Researchers introduce AnyMo, a unified framework for conditional human motion generation that supports arbitrary modality combinations (text, speech, music, trajectory). The work is enabled by OmniHuMo, a large-scale dataset of 5,000+ hours of motion with precisely aligned multimodal annotations, addressing the critical bottleneck of training data scarcity in multimodal synthesis.
The release of AnyMo and OmniHuMo represents a significant advancement in multimodal AI, tackling a fundamental challenge in computer vision and robotics that has long suffered from architectural rigidity and data constraints. Previous methods typically locked in specific modality configurations at design time, forcing researchers to build separate models for different input combinations. This new unified framework instead uses masked modeling—a technique proven effective in language and vision tasks—to handle arbitrary modality combinations within a single architecture.
The introduction of OmniHuMo addresses what has been the primary limitation preventing scaling in this domain: the lack of large-scale, high-quality datasets with aligned multimodal annotations. With 3.2 million sequences spanning text, speech, music, and trajectory data, this dataset provides the foundation necessary to explore scaling laws in multimodal motion synthesis, a largely unexplored area despite its significance for animation, robotics, and virtual content creation.
The implications extend across multiple industries. Animation and game development studios could leverage more flexible motion generation for character control. Robotics teams could use diverse control signals—verbal commands, audio cues, or trajectory specifications—without redesigning core systems. The technology also demonstrates transferable patterns for other multimodal synthesis problems beyond motion generation.
Looking forward, the community should monitor whether AnyMo's scaling properties follow predictable patterns similar to language models, potentially enabling even larger models with improved quality. The open-source availability of methodologies and datasets will likely accelerate adoption and inspire similar multi-modal frameworks in adjacent domains.
- →AnyMo enables flexible motion generation from arbitrary combinations of input modalities using a single unified model architecture
- →OmniHuMo dataset of 5,000+ hours provides previously unavailable large-scale multimodal training data with aligned annotations
- →Masked modeling approach eliminates the need for task-specific architectures, improving generalization across diverse control signals
- →Framework addresses critical bottlenecks in multimodal synthesis applicable to animation, robotics, and virtual content creation
- →Research demonstrates unexplored scaling laws in multimodal-conditioned synthesis with implications for future foundation model development