From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs
Researchers demonstrate a two-stage methodology for deploying large language models end-to-end on energy-efficient spatial NPUs, progressing from human-guided optimization to fully autonomous agent deployment. The approach achieves significant performance improvements and successfully deploys eight additional LLM variants on AMD XDNA 2 NPUs with minimal human intervention, marking the first open-source deployments of these models on AMD hardware.
This research addresses a critical bottleneck in edge AI infrastructure: efficiently deploying LLMs on resource-constrained spatial neural processing units without extensive manual engineering. The methodology's progression from human guidance to autonomous agent control represents a meaningful shift in how AI systems can handle complex deployment tasks. The team's reference implementation of Llama-3.2-1B achieved substantial speedups (2.2x prefill, 4.0x decode), establishing a performance baseline that subsequent autonomous deployments could match or exceed.
The work builds on growing momentum in AI-assisted development, where agent systems augment human expertise rather than replace it. By systematizing the optimization knowledge gained from manual development into an eight-phase skill system, the researchers created a reusable framework applicable to previously unseen models. This approach contrasts with earlier single-kernel optimization studies, tackling the harder problem of end-to-end deployment on constrained hardware.
The practical impact extends across edge computing and embedded AI markets. Faster deployment cycles reduce time-to-market for edge applications and lower barriers for developers lacking deep hardware expertise. Successfully deploying models like Qwen and SmolLM variants on AMD NPUs through open-source tooling expands the competitive landscape beyond proprietary solutions, potentially accelerating adoption of spatial computing architectures. The fact that three deployments matched reference performance without model-specific tuning suggests the methodology generalizes effectively, validating the agent skill system's design.
Future developments may include extending this framework to larger models, different hardware architectures, and multi-agent coordination for complex optimization problems. The open-source compiler stack integration positions AMD's XDNA platform as increasingly accessible to the broader developer community.
- βAutonomous agents successfully deployed eight additional LLMs on AMD XDNA 2 NPUs with minimal human guidance using open-source tools
- βReference Llama-3.2-1B implementation achieved 2.2x speedup on prefill and 4.0x on decode compared to hand-optimized baselines
- βAgent skill system consisting of eight optimization phases enables functional generalization to previously unencountered model architectures
- βThree of eight autonomous deployments matched or exceeded reference performance without additional model-specific engineering
- βThis marks the first documented open-source deployment of multiple LLM variants (Qwen, SmolLM) on AMD NPUs, expanding edge AI accessibility