Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation
Researchers introduce FLIGHT, a benchmark for training UAV agents to follow natural language instructions with precise, continuous flight control over long-horizon tasks. The accompanying FLIGHT VLA architecture decouples high-level reasoning from low-frequency control, advancing autonomous drone navigation beyond existing discrete-action systems.
The FLIGHT benchmark addresses a critical gap in autonomous UAV research by combining long-horizon semantic instructions with dense 6-degree-of-freedom trajectory annotations. Existing Vision-Language Navigation systems operate at discrete action levels inadequate for real-world drone flight, while current UAV Vision-Language-Action tasks focus on short, isolated maneuvers. This work bridges that divide, enabling agents to execute multi-stage instructions while producing smooth, physically feasible continuous commands.
The FLIGHT VLA architecture's asynchronous design reflects sophisticated engineering choices. A low-frequency Streaming Pilot Vision-Language Model handles task-state reasoning and mission planning, while a separate high-frequency diffusion model manages real-time control—eliminating the computational bottleneck of reasoning at control-loop speeds. This decoupling proves essential for practical deployment, where reasoning cycles may span seconds while control frequencies operate at 30-100 Hz. The introduction of explicit "Pilot Reasoning" supervision texts—articulating current flight state and anticipating next subgoals—provides interpretable training signals.
Bench performance demonstrates substantial improvements over existing VLN and VLA baselines across multi-stage completion, subgoal adherence, and terminal control metrics. The trained Streaming Pilot Reasoning VLM further enhances UAV video reasoning capabilities, validating the framework's design philosophy. These results suggest the approach generalizes beyond the specific benchmark.
For the autonomous systems and robotics sector, this work signals maturation in language-guided drone control. Real-world deployment of delivery, inspection, and surveillance UAVs depends precisely on this capability—parsing natural language missions into smooth, safe flight trajectories. The architecture may influence broader robotics systems requiring synchronized high-level reasoning with low-level control.
- →FLIGHT benchmark introduces dense trajectory annotations for fine-grained UAV navigation under natural language instructions, filling a gap between discrete-action and atomic-task systems.
- →FLIGHT VLA's asynchronous architecture decouples task reasoning from control, enabling real-time precision while maintaining interpretable mission planning.
- →Explicit Pilot Reasoning supervision provides interpretable training signals that improve both navigation and video understanding tasks.
- →Experimental results show consistent improvements in multi-stage task completion and subgoal adherence compared to existing VLN and VLA baselines.
- →The framework demonstrates practical relevance for autonomous UAV applications requiring complex, safety-critical flight operations.