Towards Precise Intent-Aligned VLA Aerial Navigation via Expert-Guided GRPO

Tianyang Chen1,2,*, Wenjun Li1,2,*, Xin Zhou1,2, Yuze Wu1,2,†, Fei Gao1,2,†

1Zhejiang University    2Differential Robotics

*Equal contribution;    Corresponding author.

Abstract

Vision-Language-Action (VLA) models offer a promising end-to-end paradigm for unmanned aerial vehicles (UAVs) to accomplish complex tasks specified by fine-grained instructions. However, standard supervised fine-tuning (SFT) suffers from data scarcity, limited generalization, and weak supervision for nuanced and complicated human intents. Reinforcement fine-tuning offers a natural way to mitigate these challenges and align policy behaviors with human intents through designable feedback, but applying it to aerial navigation remains challenging due to inefficient exploration in expansive continuous spaces. To address these challenges, we introduce an efficient reinforcement learning (RL) framework for VLA-based aerial navigation. At its core, we propose EG-GRPO (Expert-Guided Group Relative Policy Optimization) to augment online rollouts with few-shot expert data. Additionally, we design a heterogeneous pipeline enabling parallel simulation and inference, which reduces rollout time by 43.5%. Across multiple tasks specified by complex human intents, EG-GRPO improves the success rate to 2.13× that of the SFT baseline, while improving intent alignment performance by 60.9%. These results demonstrate that our framework can move aerial navigation toward precise intent-aligned flight. Our videos are available on Intent-Aligned_AerialNav, code will be released soon.

Simulation Demos

Fly a triangular path around the fountain ahead
Go below the table tennis table and fly through to the opposite wall
Land on the right side of the second lounge chair
Pass around the tabletop artwork on the narrow side

Real-World Demos

More demos coming soon...

Method Highlights

EG-GRPO pipeline

EG-GRPO Training Pipeline

Initializes UAV navigation from SFT, then injects few-shot expert trajectories into GRPO groups to provide stable trajectory-level rewards for intent-aligned policy updates.

Heterogeneous parallelization workflow

Heterogeneous Parallelization of Inference and Simulation

Decouples Isaac Lab rollout and VLA inference with double-buffered task groups across L20 and A100 hardware, reducing rollout idle time and accelerating online RL.