HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation
HomeFlow introduces a data flywheel system for training large language model agents in smart home environments, using procedural generation and Monte Carlo tree search to create diverse, verifiable training trajectories. The approach achieves 87.03% task success rates on a new SmartHome-Bench benchmark, outperforming GPT-5.5 by 1.23 percentage points.
HomeFlow addresses a fundamental challenge in AI development: generating high-quality training data for embodied agents operating in complex, dynamic physical environments. Traditional approaches struggle with the ambiguity and multi-step reasoning required for smart home tasks, where user intent must be interpreted and executed across interconnected devices. The proposed system combines HomeEnv simulation, procedural home generation through HomeMaker, and MCTS-Flow trajectory synthesis to create a closed-loop training cycle that improves iteratively through authentic feedback.
This work reflects the broader AI industry shift toward embodied intelligence and multimodal reasoning. Smart homes represent an accessible but non-trivial domain for testing agent capabilities—requiring natural language understanding, environment navigation, and multi-turn planning. The introduction of SmartHome-Bench provides a standardized evaluation framework, addressing fragmentation in how embodied AI agents are assessed across research groups.
The performance results warrant scrutiny. HomeFlow-RL-8B surpassing GPT-5.5 on this specific benchmark suggests that domain-specialized, smaller models with better training data may outcompete general-purpose systems on narrow tasks. This has implications for enterprise adoption: organizations may prefer fine-tuned, smaller models for smart home automation over costly API calls to frontier models. However, the comparison's validity depends on how SmartHome-Bench is designed and whether it truly captures real-world smart home complexity.
Looking ahead, the verifiable simulation approach could extend beyond smart homes to robotics, manufacturing, and autonomous systems. Key questions remain about sim-to-real transfer gaps and whether the procedurally generated trajectories capture genuine user behavior patterns.
- →HomeFlow achieves 87.03% task success on smart home benchmarks, exceeding GPT-5.5 performance on this domain
- →The data flywheel combines procedural generation, Monte Carlo tree search, and reinforcement learning to create verifiable training trajectories
- →SmartHome-Bench provides a standardized evaluation framework for embodied AI agents in domestic environments
- →Smaller domain-specialized models may outperform general-purpose LLMs on narrow, well-defined tasks with sufficient training data
- →Verifiable simulation environments enable iterative agent improvement through authentic physical feedback loops