MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation
Researchers introduce MUSE, a new benchmark for evaluating text-to-CAD generation that moves beyond simple geometry matching to assess manufacturability, functionality, and assemblability of complex 3D assemblies. Current LLM-based CAD generation systems fail significantly when evaluated against practical engineering requirements, revealing a critical gap between geometric generation and production-ready design.
The introduction of MUSE addresses a fundamental limitation in how AI-generated CAD models are currently evaluated. While large language models have made impressive strides in 3D generation tasks, existing benchmarks rely on geometric similarity metrics that ignore real-world engineering constraints. This creates a false sense of progress—models may generate geometrically correct shapes while producing designs that cannot actually be manufactured, assembled, or function as intended. MUSE's three-stage evaluation protocol (code check, geometric check, and design-intent alignment) establishes a more rigorous standard that mirrors how industrial engineers actually assess designs.
The benchmark emerges at a critical juncture in AI-assisted design. As companies increasingly explore automating CAD workflows to accelerate product development, the gap between academic benchmarks and industrial requirements has widened. MUSE's focus on boundary representation (B-Rep) assemblies rather than single parts reflects real product complexity, where multiple components must work together seamlessly. The finding that even top-performing models achieve limited success on engineering criteria suggests the field has been measuring progress against the wrong metrics.
For the CAD software and manufacturing industries, MUSE provides both opportunity and warning. It creates a clearer roadmap for developing production-ready AI tools, but it also demonstrates that current approaches fall short of replacing human engineers. The rubric-based visual language model validation methodology offers a scalable approach to evaluation, potentially accelerating the development of genuinely useful AI design assistants. Organizations investing in generative AI for product design should recognize that meeting academic benchmarks does not guarantee practical deployment success.
- →MUSE introduces the first comprehensive benchmark evaluating text-to-CAD generation on manufacturability and assemblability rather than shape similarity alone.
- →Existing LLM-based CAD systems show a 'failure cascade' from executable code through valid geometry to engineering-ready design quality.
- →The benchmark uses design-specific rubrics and VLM-based evaluation validated against human annotation for reliable assessment.
- →Current state-of-the-art models achieve limited success on fine-grained engineering criteria despite strong geometric performance.
- →The benchmark dataset and leaderboard are publicly available, enabling transparent comparison of text-to-CAD approaches across the research community.