Brick-Composer: Using MLLMs for Assembly with Diverse Bricks
Researchers introduce Brick-Composer, a learning framework that enhances multimodal large language models (MLLMs) with physical assembly capabilities through targeted training on brick construction tasks. The study reveals current MLLMs lack reliable spatial reasoning and fine-grained object recognition needed for real-world assembly, but demonstrates that structured learning approaches can improve performance significantly.
This research addresses a fundamental gap in AI capabilities: the transition from visual understanding to physical action in real-world environments. While MLLMs excel at language and image comprehension, their ability to translate visual designs into precise spatial operations remains underdeveloped. Brick-Composer tackles this by framing assembly as a two-stage decision process—brick selection and pose estimation—and introducing BC-Bench, the first standardized benchmark for measuring MLLM performance on diverse construction tasks.
The work builds on growing interest in embodied AI and robotics, where the challenge isn't just perception but grounding that perception in actionable physical outcomes. Recent advances in vision-language models created expectations that these systems could handle real-world coordination tasks, yet the baseline results—less than 1% step-level success—expose the significant distance between general visual understanding and task-specific spatial reasoning.
Brick-Composer's training methodology proves instructive: combining human demonstrations, visual feedback loops, and synthetic data generation creates a pathway to scale learning beyond limited real-world examples. Improving brick selection accuracy by 3x and raising step-level success to 15% demonstrates measurable progress, though the authors appropriately acknowledge this remains far from production-ready assembly systems.
For the AI industry, this research validates a crucial insight: general-purpose models need domain-specific grounding mechanisms to handle embodied tasks. The approach has implications beyond construction—manufacturing, robotics, and logistics could benefit from similar training frameworks that connect perception to physical consequence, suggesting the next frontier in LLM capability development focuses on bridging the perception-action gap.
- →Current state-of-the-art MLLMs struggle with precise spatial reasoning and fine-grained object selection for assembly tasks, achieving less than 1% step-level success baseline.
- →Brick-Composer's three-signal training approach (human demonstrations, world feedback, synthetic experience) improves brick selection accuracy by over 3x and raises assembly success to around 15%.
- →BC-Bench provides the first standardized benchmark for evaluating MLLM performance on diverse brick construction, enabling comparative research in embodied AI.
- →Even small 3-8B parameter models can acquire meaningful assembly skills through targeted physical grounding, suggesting efficient scaling pathways for embodied AI.
- →The gap between visual perception and spatial action execution represents a critical frontier in AI development with applications across manufacturing, robotics, and logistics sectors.