Accepted to ICLR 2026
This is the official repository for DriveAgent-R1. We introduce an autonomous driving agent that pioneers active perception and a hybrid-thinking framework for high-level behavioral planning.
At its core, DriveAgent-R1 is designed to mimic human-like cognitive patterns. Instead of passively processing a fixed set of visual inputs, it can proactively seek crucial visual evidence through a specialized Vision Toolkit when faced with uncertainty. Furthermore, its hybrid-thinking framework allows it to adaptively switch between efficient text-only reasoning for simple scenarios and robust, tool-augmented visual reasoning for complex ones.
Our 3B parameter model achieves performance competitive with top-tier systems like GPT-5 and human drivers, while remaining efficient and deployment-friendly.
1. Active Perception for Grounded Reasoning
In complex scenarios, DriveAgent-R1 proactively uses tools like RoI Inspection to clarify uncertainty. This grounds its decisions in verifiable visual evidence, enhancing reliability and interpretability.
The agent actively inspects a confusing scene to discover a minor collision, leading to a safe plan to stop.
2. Hybrid-Thinking Framework
DriveAgent-R1 dynamically adapts its reasoning mode based on scene complexity, balancing computational efficiency with robust, in-depth analysis.
For simple cases, it uses text-based reasoning. For complex cases, it interleaves thoughts with tool calls to acquire new visual evidence.
We have released the Training Code for DriveAgent-R1, including the complete source code for all training stages with multi-round tool calling framework.
- Training Code: The complete source code for all training stages, including Supervised Fine-Tuning (SFT) and Cascaded Reinforcement Learning (Cascaded RL). See Installation and Project Structure for details.
We are committed to releasing the following assets:
-
Evaluation Scripts: The full scripts to reproduce the benchmark results reported in our paper.
-
Datasets (pending company approval)
- Drive-Internal Dataset: The complete dataset, including training and test splits with all corresponding meta-action labels.
- nuScenes Test Set: Our specific test split and the generated meta-action labels.
-
Models (pending company approval)
- Trained Model Weights: Model checkpoint of
DriveAgent-R1to allow for direct inference and replication of our results.
- Trained Model Weights: Model checkpoint of
This project integrates DetAny3D for 3D object detection and Depth-Anything-V2 for depth estimation. Please follow the installation instructions below.
- Python 3.12
- CUDA 11.8+
- Ubuntu
Follow the official DetAny3D installation instructions to set up the environment. Download the required checkpoints and place them in the DetAny3D/checkpoints/ directory.
Important: Before training or inference, you need to start the 3D detection server in a separate terminal:
cd DetAny3D
bash start_server.shFollow the official Depth-Anything-V2 installation instructions to set up the environment. Download the model weight depth_anything_v2_vitl.pth and place it in the tool-rl/Depth-Anything-V2/checkpoints/ directory.
Note: The weight file used in this project is depth_anything_v2_vitl.pth (Depth-Anything-V2-Large model).
You can download the checkpoint from:
cd tool-rl
pip install -e .This will install the training framework and all necessary dependencies.
The key components of the training codebase are organized as follows:
-
Tool Library:
tool-rl/src/r1-v/src/open_r1/tools- Contains all vision tools including 3D object detection, depth estimation, RoI inspection, etc.
-
Training Scripts:
tool-rl/src/scripts -
Multi-round Tool Calling Training Framework:
tool-rl/src/r1-v/src/open_r1/trainer/vllm_grpo_trainer_modified.py- The core training framework that supports multi-round tool calling with GRPO (Group Relative Policy Optimization)
Here are some qualitative examples illustrating DriveAgent-R1's capabilities in diverse driving scenarios.
For common, low-complexity situations, DriveAgent-R1 defaults to its efficient text-only reasoning mode.
Case 1: Navigating a Toll Booth
In a routine scenario approaching a toll booth, DriveAgent-R1 recognizes the low complexity and lack of ambiguity. It employs its efficient text-only mode to formulate a safe plan: decelerate to pass through the gate, then maintain speed to proceed.
Case 2: Driving on an Open Road
With a straight road and an open view, the agent defaults to text-only reasoning due to the low scene complexity[cite: 429]. It accurately assesses the simple conditions and plans to "Keep Speed, Continue Straight" without invoking its vision toolkit, demonstrating the efficiency of the hybrid framework.
In complex or uncertain environments, DriveAgent-R1 proactively invokes its Vision Toolkit to gather crucial evidence and ground its decisions.
Case 3: Navigating a Busy Intersection
Facing a busy intersection with heavy traffic, the initial view is insufficient to determine the traffic light's status. The agent uses RoI Inspection to confirm the green light and Retrieve View to assess the dense flow of crossing traffic. This evidence allows it to formulate a safe plan: stop and wait for pedestrians to clear before turning.
Case 4: Uncovering Hazards at Night
While navigating a narrow road at night, the initial perception fails to detect a pedestrian in the dark. By invoking the 3D Object Detection tool, the agent performs active perception and successfully identifies the overlooked pedestrian, prompting it to decelerate in response to the newly revealed risk.
Case 5: Passing a Barrier with Depth Awareness
When precise distance judgment is needed, such as passing a barrier, the Depth Estimation tool proves invaluable. The depth map confirms the gate's proximity while also revealing the open road beyond, allowing the agent to formulate a multi-stage plan: slow down for the obstacle, then accelerate once safely through.
Case 6: Assessing Distant Pedestrian Risk
Even on a seemingly clear road, the agent exhibits proactive caution by investigating distant pedestrians whose proximity to the road is uncertain[cite: 439]. It deploys RoI Inspection to get a magnified view, revealing that the individuals are very close to the lane of travel. This insight elevates the potential risk, prompting a prudent decision to decelerate.
If you find this work useful, please consider citing:
@inproceedings{driveagentr1,
title={DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking},
author={Weicheng Zheng, Xiaofei Mao, Nanfei Ye, Pengxiang Li, Kun Zhan, XianPeng Lang, Hang Zhao},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}This project is built upon R1-V, a powerful framework for reinforcement learning in vision-language models. We sincerely thank the following open-source projects and communities:
- R1-V: The foundation framework for our training infrastructure
- DetAny3D: 3D object detection capabilities
- Depth-Anything-V2: Depth estimation model
- Qwen-VL: Vision-language model backbone
We are grateful for their excellent work and contributions to the open-source community.







