Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Zhenxin Li1, 2  Kailin Li3  Shihao Wang1, 4  Shiyi Lan1  Zhiding Yu1 Yishen Ji5
Zhiqi Li5  Ziyue Zhu6  Jan Kautz1  Zuxuan Wu2  Yu-Gang Jiang2  Jose M. Alvarez1
1NVIDIA  2Fudan University  3East China Normal University
4Beijing Institute of Technology  5Nanjing University  6Nankai University
Abstract

We propose Hydra-MDP, a novel paradigm employing multiple teachers in a teacher-student model. This approach uses knowledge distillation from both human and rule-based teachers to train the student model, which features a multi-head decoder to learn diverse trajectory candidates tailored to various evaluation metrics. With the knowledge of rule-based teachers, Hydra-MDP learns how the environment influences the planning in an end-to-end manner instead of resorting to non-differentiable post-processing. This method achieves the 1stsuperscript1𝑠𝑡1^{st}1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT place in the Navsim challenge, demonstrating significant improvements in generalization across diverse driving environments and conditions. Code will be available at https://github.com/woxihuanjiangguo/Hydra-MDP

1 Introduction

End-to-end autonomous driving, which involves learning a neural planner with raw sensor inputs, is considered a promising direction to achieve full autonomy. Despite the promising progress in this field [11, 12], recent studies [8, 14, 4] have exposed multiple vulnerabilities and limitations of imitation learning (IL) methods, particularly the inherent issues in open-loop evaluation, such as the dysfunctional metrics and implicit biases [14, 8]. This is critical as it fails to guarantee safety, efficiency, comfort, and compliance with traffic rules. To address this main limitation, several works have proposed incorporating closed-loop metrics, which more effectively evaluate end-to-end autonomous driving by ensuring that the machine-learned planner meets essential criteria beyond merely mimicking human drivers.

Therefore, end-to-end planning is ideally a multi-target and multimodal task, where multi-target planning involves meeting various evaluation metrics from either open-loop and closed-loop settings. In this context, multimodal indicates the existence of multiple optimal solutions for each metric.

Existing end-to-end approaches [4, 12, 11] often try to consider closed-loop evaluation via post-processing, which is not streamlined and may result in the loss of additional information compared to a fully end-to-end pipeline. Meanwhile, rule-based planners [8, 18] struggle with imperfect perception inputs. These imperfect inputs degrade the performance of rule-based planning under both closed-loop and open-loop metrics, as they rely on predicted perception instead of ground truth (GT) labels.

Refer to caption
Figure 1: Comparison between End-to-end Planning Paradigms.

To address the issues, we propose a novel end-to-end autonomous driving framework called Hydra-MDP (Multimodal Planning with Multi-target Hydra-distillation). Hydra-MDP is based on a novel teacher-student knowledge distillation (KD) architecture. The student model learns diverse trajectory candidates tailored to various evaluation metrics through KD from both human and rule-based teachers. We instantiate the multi-target Hydra-distillation with a multi-head decoder, thus effectively integrating the knowledge from specialized teachers. Hydra-MDP also features an extendable KD architecture, allowing for easy integration of additional teachers.

The student model uses environmental observations during training, while the teacher models use ground truth (GT) data. This setup allows the teacher models to generate better planning predictions, helping the student model to learn effectively. By training the student model with environmental observations, it becomes adept at handling realistic conditions where GT perception is not accessible during testing.

Our contributions are summarized as follows:

  1. 1.

    We propose a universal framework of end-to-end multimodal planning via multi-target hydra-distillation, allowing the model to learn from both rule-based planners and human drivers in a scalable manner.

  2. 2.

    Our approach achieves the state-of-the-art performance under the simulation-based evaluation metrics on Navsim.

2 Solution

Refer to caption
Figure 2: The Overall Architecture of Hydra-MDP.

2.1 Preliminaries

Let O𝑂Oitalic_O represent sensor observations, P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG and P𝑃Pitalic_P denote ground truth and predicted perceptions (e.g. 3D object detection, lane detection), T^^𝑇\hat{T}over^ start_ARG italic_T end_ARG be the expert trajectory, and Tsuperscript𝑇T^{*}italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the predicted trajectory. imsubscript𝑖𝑚\mathcal{L}_{im}caligraphic_L start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT represents the imitation loss. We first introduce the two prevailing paradigms and our proposed paradigm (Fig. 1) in this section:

A. Single-modal Planning + Single-target Learning. In this paradigm [11, 12, 14], the planning network directly regresses the planned trajectory from the sensor observations. Ground truth perceptions can be used as auxiliary supervision but does not influence the planning output. Perception losses are not included in the formula for simplicity. The whole processing can be formulated as:

=im(T,T^),subscript𝑖𝑚superscript𝑇^𝑇\mathcal{L}=\mathcal{L}_{im}(T^{*},\hat{T}),\vspace{-0.15cm}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over^ start_ARG italic_T end_ARG ) , (1)

where imsubscript𝑖𝑚\mathcal{L}_{im}caligraphic_L start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT is usually an L2 loss.

B. Multimodal Planning + Single-target Learning. This approach [4, 1] predicts multiple trajectories {Ti}i=1ksuperscriptsubscriptsubscript𝑇𝑖𝑖1𝑘\{T_{i}\}_{i=1}^{k}{ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, whose similarities to the expert trajectory are computed:

=iim(Ti,T^),subscript𝑖subscript𝑖𝑚subscript𝑇𝑖^𝑇\mathcal{L}=\sum_{i}\mathcal{L}_{im}(T_{i},\hat{T}),\vspace{-0.15cm}caligraphic_L = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_T end_ARG ) , (2)

where imsubscript𝑖𝑚\mathcal{L}_{im}caligraphic_L start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT can be KL-Divergence [4] or the max-margin loss [1]. Perception outputs P𝑃Pitalic_P are explicitly used to post-process suitable trajectories via a cost function f(Ti,P)𝑓subscript𝑇𝑖𝑃f(T_{i},P)italic_f ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P ). The trajectory with the lowest cost is selected:

T=argminTif(Ti,P),superscript𝑇subscript𝑇𝑖𝑓subscript𝑇𝑖𝑃T^{*}=\underset{T_{i}}{\arg\min}f(T_{i},P),\vspace{-0.15cm}italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_arg roman_min end_ARG italic_f ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P ) , (3)

which is a non-differentiable process based on imperfect perception P𝑃Pitalic_P.

C. Multimodal Planning + Multi-target Learning. We propose this paradigm to simultaneously predict various costs (e.g., collision cost, drivable area compliance cost) via a neural network f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG. This is performed in a teacher-student distillation manner, where the teacher has access to ground truth perception P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG but the student relies only on sensor observations O𝑂Oitalic_O. This paradigm can be formulated as:

=iim(Ti,T^)+kd(f(Ti,P^),f~(Ti,O)).subscript𝑖subscript𝑖𝑚subscript𝑇𝑖^𝑇subscript𝑘𝑑𝑓subscript𝑇𝑖^𝑃~𝑓subscript𝑇𝑖𝑂\mathcal{L}=\sum_{i}\mathcal{L}_{im}(T_{i},\hat{T})+\mathcal{L}_{kd}(f(T_{i},% \hat{P}),\tilde{f}(T_{i},O)).\vspace{-0.15cm}caligraphic_L = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_T end_ARG ) + caligraphic_L start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT ( italic_f ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG ) , over~ start_ARG italic_f end_ARG ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_O ) ) . (4)

Here, we only consider one cost function f𝑓fitalic_f for clarity. The trajectory with the lowest predicted cost is selected:

T=argminTif~(Ti,O).superscript𝑇subscript𝑇𝑖~𝑓subscript𝑇𝑖𝑂T^{*}=\underset{T_{i}}{\arg\min}\tilde{f}(T_{i},O).\vspace{-0.15cm}italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_arg roman_min end_ARG over~ start_ARG italic_f end_ARG ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_O ) . (5)

We stress that this framework is not restricted by non-differentiable post-processing. It can be easily scaled in an end-to-end fashion by involving more cost functions or leveraging imitation similarity in our implementation (Sec. 2.4).

2.2 Overall Framework

As shown in Fig. 2, Hydra-MDP consists of two networks: a Perception Network and a Trajectory Decoder.

Perception Network. Our perception network builds upon the official challenge baseline Transfuser [5, 6], which consists of an image backbone, a LiDAR backbone, and perception heads for 3D object detection and BEV segmentation. Multiple transformer layers [19] connect features from stages of both backbones, extracting meaningful information from different modalities. The final output of the perception network comprises environmental tokens Fenvsubscript𝐹𝑒𝑛𝑣F_{env}italic_F start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT, which encode abundant semantic information derived from both images and LiDAR point clouds.

Trajectory Decoder. Following Vadv2 [4], we construct a fixed planning vocabulary to discretize the continuous action space. To build the vocabulary, we first sample 700K trajectories randomly from the original nuPlan database [2]. Each trajectory Ti(i=1,,k)subscript𝑇𝑖𝑖1𝑘T_{i}(i=1,...,k)italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_i = 1 , … , italic_k ) consists of 40 timestamps of (x,y,heading)𝑥𝑦𝑒𝑎𝑑𝑖𝑛𝑔(x,y,heading)( italic_x , italic_y , italic_h italic_e italic_a italic_d italic_i italic_n italic_g ), corresponding to the desired 10Hz frequency and a 4-second future horizon in the challenge. The planning vocabulary 𝒱ksubscript𝒱𝑘\mathcal{V}_{k}caligraphic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is formed as K-means clustering centers of the 700K trajectories, where k𝑘kitalic_k denotes the size of the vocabulary. 𝒱ksubscript𝒱𝑘\mathcal{V}_{k}caligraphic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is then embedded as k𝑘kitalic_k latent queries with an MLP, sent into layers of transformer encoders [19], and added to the ego status E𝐸Eitalic_E:

𝒱k=Transformer(Q,K,V=Mlp(𝒱k))+E.subscriptsuperscript𝒱𝑘𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟𝑄𝐾𝑉𝑀𝑙𝑝subscript𝒱𝑘𝐸\mathcal{V}^{\prime}_{k}=Transformer(Q,K,V=Mlp(\mathcal{V}_{k}))+E.\vspace{-0.% 15cm}caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_T italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m italic_e italic_r ( italic_Q , italic_K , italic_V = italic_M italic_l italic_p ( caligraphic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + italic_E . (6)

To incorporate environmental clues in Fenvsubscript𝐹𝑒𝑛𝑣F_{env}italic_F start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT, transformer decoders are leveraged:

𝒱k′′=Transformer(Q=𝒱k,K,V=Fenv).subscriptsuperscript𝒱′′𝑘𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟formulae-sequence𝑄subscriptsuperscript𝒱𝑘𝐾𝑉subscript𝐹𝑒𝑛𝑣\mathcal{V}^{\prime\prime}_{k}=Transformer(Q=\mathcal{V}^{\prime}_{k},K,V=F_{% env}).\vspace{-0.15cm}caligraphic_V start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_T italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m italic_e italic_r ( italic_Q = caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_K , italic_V = italic_F start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT ) . (7)

Using the log-replay trajectory T^^𝑇\hat{T}over^ start_ARG italic_T end_ARG, we implement a distance-based cross-entropy loss to imitate human drivers:

im=i=1kyilog(𝒮iim),subscript𝑖𝑚superscriptsubscript𝑖1𝑘subscript𝑦𝑖subscriptsuperscript𝒮𝑖𝑚𝑖\mathcal{L}_{im}=-\sum_{i=1}^{k}y_{i}\log(\mathcal{S}^{im}_{i}),\vspace{-0.15cm}caligraphic_L start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( caligraphic_S start_POSTSUPERSCRIPT italic_i italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (8)

where 𝒮iimsubscriptsuperscript𝒮𝑖𝑚𝑖\mathcal{S}^{im}_{i}caligraphic_S start_POSTSUPERSCRIPT italic_i italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th softmax score of 𝒱k′′subscriptsuperscript𝒱′′𝑘\mathcal{V}^{\prime\prime}_{k}caligraphic_V start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the imitation target produced by L2 distances between log-replays and the vocabulary. Softmax is applied on L2 distances to produce a probability distribution:

yi=e(T^Ti)2j=1ke(T^Tj)2.subscript𝑦𝑖superscript𝑒superscript^𝑇subscript𝑇𝑖2superscriptsubscript𝑗1𝑘superscript𝑒superscript^𝑇subscript𝑇𝑗2y_{i}=\frac{e^{-(\hat{T}-T_{i})^{2}}}{\sum_{j=1}^{k}e^{-(\hat{T}-T_{j})^{2}}}.% \vspace{-0.15cm}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT - ( over^ start_ARG italic_T end_ARG - italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - ( over^ start_ARG italic_T end_ARG - italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG . (9)

The intuition behind this imitation target is to reward trajectory proposals that are close to human driving behaviors.

2.3 Multi-target Hydra-Distillation

Though the imitation target provides certain clues for the planner, it is insufficient for the model to associate the planning decision with the driving environment under the closed-loop setting, leading to failures such as collisions and leaving drivable areas [14]. Therefore, to boost the closed-loop performance of our end-to-end planner, we propose Multi-target Hydra-Distillation, a learning strategy that aligns the planner with simulation-based metrics in this challenge.

The distillation process expands the learning target through two steps: (1) running offline simulations [8] of the planning vocabulary 𝒱ksubscript𝒱𝑘\mathcal{V}_{k}caligraphic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for the entire training dataset; (2) introducing supervision from simulation scores for each trajectory in 𝒱ksubscript𝒱𝑘\mathcal{V}_{k}caligraphic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT during the training process. For a given scenario, step 1 generates ground truth simulation scores {{\{{𝒮^imsubscriptsuperscript^𝒮𝑚𝑖\hat{\mathcal{S}}^{m}_{i}over^ start_ARG caligraphic_S end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT|i=1,,k}m=1|M||i=1,...,k\}_{m=1}^{|M|}| italic_i = 1 , … , italic_k } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_M | end_POSTSUPERSCRIPT for each metric mM𝑚𝑀m\in Mitalic_m ∈ italic_M and the i𝑖iitalic_i-th trajectory, where M𝑀Mitalic_M represents the set of closed-loop metrics used in the challenge. For score predictions, latent vectors 𝒱k′′subscriptsuperscript𝒱′′𝑘\mathcal{V}^{\prime\prime}_{k}caligraphic_V start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are processed with a set of Hydra Prediction Heads, yielding predicted scores {{\{{𝒮imsubscriptsuperscript𝒮𝑚𝑖\mathcal{S}^{m}_{i}caligraphic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT|i=1,,k}m=1|M||i=1,...,k\}_{m=1}^{|M|}| italic_i = 1 , … , italic_k } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_M | end_POSTSUPERSCRIPT. With a binary cross-entropy loss, we distill rule-based driving knowledge into the end-to-end planner:

kd=m,i𝒮^imlog𝒮im+(1𝒮^im)log(1𝒮im).subscript𝑘𝑑subscript𝑚𝑖subscriptsuperscript^𝒮𝑚𝑖subscriptsuperscript𝒮𝑚𝑖1subscriptsuperscript^𝒮𝑚𝑖1subscriptsuperscript𝒮𝑚𝑖\mathcal{L}_{kd}=-\sum_{m,i}\hat{\mathcal{S}}^{m}_{i}\log\mathcal{S}^{m}_{i}+(% 1-\hat{\mathcal{S}}^{m}_{i})\log(1-\mathcal{S}^{m}_{i}).caligraphic_L start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT over^ start_ARG caligraphic_S end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log caligraphic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - over^ start_ARG caligraphic_S end_ARG start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - caligraphic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

(10)

For a trajectory Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its distillation loss of each sub-score acts as a learned cost value in Eq. 4, measuring the violation of particular traffic rules associated with that metric.

2.4 Inference and Post-processing

2.4.1 Inference

Given the predicted imitation scores {𝒮iim|i=1,,k}conditional-setsubscriptsuperscript𝒮𝑖𝑚𝑖𝑖1𝑘\{\mathcal{S}^{im}_{i}|i=1,...,k\}{ caligraphic_S start_POSTSUPERSCRIPT italic_i italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , … , italic_k } and metric sub-scores {{\{{𝒮imsubscriptsuperscript𝒮𝑚𝑖\mathcal{S}^{m}_{i}caligraphic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT|i=1,,k}m=1|M||i=1,...,k\}_{m=1}^{|M|}| italic_i = 1 , … , italic_k } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_M | end_POSTSUPERSCRIPT, we calculate an assembled cost measuring the likelihood of each trajectory being selected in the given scenario as follows:

f~(Ti,O)=~𝑓subscript𝑇𝑖𝑂absent\displaystyle\tilde{f}(T_{i},O)=over~ start_ARG italic_f end_ARG ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_O ) = (w1log𝒮iim+w2log𝒮iNC+w3log𝒮iDAC\displaystyle-(w_{1}\log{\mathcal{S}^{im}_{i}}+w_{2}\log{\mathcal{S}^{NC}_{i}}% +w_{3}\log{\mathcal{S}^{DAC}_{i}}- ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_log caligraphic_S start_POSTSUPERSCRIPT italic_i italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_log caligraphic_S start_POSTSUPERSCRIPT italic_N italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_log caligraphic_S start_POSTSUPERSCRIPT italic_D italic_A italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
+w4log(5𝒮iTTC+2𝒮iC+5𝒮iEP)),\displaystyle+w_{4}\log{(5\mathcal{S}^{TTC}_{i}}+2\mathcal{S}^{C}_{i}+5% \mathcal{S}^{EP}_{i})),+ italic_w start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT roman_log ( 5 caligraphic_S start_POSTSUPERSCRIPT italic_T italic_T italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 2 caligraphic_S start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 5 caligraphic_S start_POSTSUPERSCRIPT italic_E italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (11)

where {wi}i=14superscriptsubscriptsubscript𝑤𝑖𝑖14\{w_{i}\}_{i=1}^{4}{ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT represent confidence weighting parameters to mitigate the imperfect fitting of different teachers. The optimal combination of weights is obtained via grid search, which typically fall within the following ranges: 0.01w10.1,0.1w2,w31,1w410formulae-sequence0.01subscript𝑤10.1formulae-sequence0.1subscript𝑤2formulae-sequencesubscript𝑤311subscript𝑤4100.01\leq w_{1}\leq 0.1,0.1\leq w_{2},w_{3}\leq 1,1\leq w_{4}\leq 100.01 ≤ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 0.1 , 0.1 ≤ italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ≤ 1 , 1 ≤ italic_w start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ≤ 10, indicating the necessity to prioritize rule-based costs over imitation. Finally, the trajectory with the lowest overall cost is chosen.

2.4.2 Model Ensembling

We present two model ensembling techniques: Mixture of Encoders and Sub-score Ensembling. The former technique uses a linear layer to combine features from different vision encoders, while the latter calculates a weighted sum of sub-scores from independent models for trajectory selection.

3 Experiments

Method Inputs NC DAC EP TTC C Score
PDM-Closed [8]\diamond Perception GT 94.6 99.8 89.9 86.9 99.9 89.1
Transfuser  [5] LiDAR & Camera 96.5 87.9 73.9 90.2 100 78.0
Vadv2-𝒱4096subscript𝒱4096\mathcal{V}_{4096}caligraphic_V start_POSTSUBSCRIPT 4096 end_POSTSUBSCRIPT [4]* LiDAR & Camera 97.1 88.8 74.9 91.4 100 79.7
Vadv2-𝒱4096subscript𝒱4096\mathcal{V}_{4096}caligraphic_V start_POSTSUBSCRIPT 4096 end_POSTSUBSCRIPT [4]*-PP LiDAR & Camera 97.0 89.1 75.0 91.2 100 79.9
Vadv2-𝒱8192subscript𝒱8192\mathcal{V}_{8192}caligraphic_V start_POSTSUBSCRIPT 8192 end_POSTSUBSCRIPT [4]* LiDAR & Camera 97.2 89.1 76.0 91.6 100 80.9
Hydra-MDP-𝒱4096subscript𝒱4096\mathcal{V}_{4096}caligraphic_V start_POSTSUBSCRIPT 4096 end_POSTSUBSCRIPT LiDAR & Camera 97.7 91.5 77.5 92.7 100 82.6
Hydra-MDP-𝒱8192subscript𝒱8192\mathcal{V}_{8192}caligraphic_V start_POSTSUBSCRIPT 8192 end_POSTSUBSCRIPT LiDAR & Camera 97.9 91.7 77.6 92.9 100 83.0
Hydra-MDP-𝒱8192subscript𝒱8192\mathcal{V}_{8192}caligraphic_V start_POSTSUBSCRIPT 8192 end_POSTSUBSCRIPT-PDM LiDAR & Camera 97.5 88.9 74.8 92.5 100 80.2
Hydra-MDP-𝒱8192subscript𝒱8192\mathcal{V}_{8192}caligraphic_V start_POSTSUBSCRIPT 8192 end_POSTSUBSCRIPT-W LiDAR & Camera 98.1 96.1 77.8 93.9 100 85.7
Hydra-MDP-𝒱8192subscript𝒱8192\mathcal{V}_{8192}caligraphic_V start_POSTSUBSCRIPT 8192 end_POSTSUBSCRIPT-W-EP LiDAR & Camera 98.3 96.0 78.7 94.6 100 86.5
Table 1: Performance on the Navtest Split. \diamond The official Navsim implementation of PDM-Closed is potentially prone to errors due to inconsistent braking maneuvers and offset formulation compared with the nuPlan implementation [8]. All end-to-end methods use the official Transfuser [5] as the perception network. * Our distance-based imitation loss is adopted for training. PP: Transfuser perception is used for post-processing. PDM: The learning target is the overall PDM score. W: Weighted confidence during inference. EP: The model is trained to fit the continuous EP (Ego Progress) metric.
Method Img. Resolution Backbone NC DAC EP TTC C Score
PDM-Closed [8]\diamond - - 94.6 99.8 89.9 86.9 99.9 89.1
Hydra-MDP-A 256×10242561024256\times 1024256 × 1024 ViT-L* 98.4 97.7 85.0 94.5 100 89.9
Hydra-MDP-B 512×20485122048512\times 2048512 × 2048 V2-99 98.4 97.8 86.5 93.9 100 90.3
Hydra-MDP-C 256×10242561024256\times 1024256 × 1024 ViT-L* 98.7 98.2 86.5 95.0 100 91.0
256×10242561024256\times 1024256 × 1024 ViT-L†
512×20485122048512\times 2048512 × 2048 V2-99
Table 2: The Impact of Scaling Up on the Navtest Split. \diamond The official Navsim implementation of PDM-Closed. * ViT-L is initialized from Depth Anything [20]. †ViT-L is EVA [9] pretrained on Objects365 [17] and COCO [15]. V2-99 [13] is initialized from DD3D [16].

3.1 Dataset and metrics

Dataset. The Navsim dataset builds on the existing OpenScene [7] dataset, a compact version of nuPlan [3] with only relevant annotations and sensor data sampled at 2 Hz. The dataset primarily focuses on scenarios involving changes in intention, where the ego vehicle’s historical data cannot be extrapolated into a future plan. The dataset provides annotated 2D high-definition maps with semantic categories and 3D bounding boxes for objects. The dataset is split into two parts: Navtrain and Navtest, which respectively contain 1192 and 136 scenarios for training/validation and testing.

Metrics. For this challenge, we evaluate our models based on the PDM score, which can be formulated as follows:

PDMscore=NC×DAC×DDC×(5×TTC+2×C+5×EP)12,𝑃𝐷subscript𝑀𝑠𝑐𝑜𝑟𝑒𝑁𝐶𝐷𝐴𝐶𝐷𝐷𝐶5𝑇𝑇𝐶2𝐶5𝐸𝑃12PDM_{score}=NC\times DAC\times DDC\times\frac{(5\times TTC+2\times C+5\times EP% )}{12},italic_P italic_D italic_M start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT = italic_N italic_C × italic_D italic_A italic_C × italic_D italic_D italic_C × divide start_ARG ( 5 × italic_T italic_T italic_C + 2 × italic_C + 5 × italic_E italic_P ) end_ARG start_ARG 12 end_ARG ,

(12)

where sub-metrics NC𝑁𝐶NCitalic_N italic_C, DAC𝐷𝐴𝐶DACitalic_D italic_A italic_C, TTC𝑇𝑇𝐶TTCitalic_T italic_T italic_C, C𝐶Citalic_C, EP𝐸𝑃EPitalic_E italic_P correspond to the No at-fault Collisions, Drivable Area Compliance, Time to Collision, Comfort, and Ego Progress. For the distillation process and subsequent results, DDC𝐷𝐷𝐶DDCitalic_D italic_D italic_C is neglected due to an implementation problem.111https://github.com/autonomousvision/navsim/issues/14.

3.2 Implementation Details

We train our models on the Navtrain split using 8 NVIDIA A100 GPUs, with a total batch size of 256 across 20 epochs. The learning rate and weight decay are set to 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and 0.0 following the official baseline. LiDAR points from 4 frames are splatted onto the BEV plane to form a density BEV feature, which is encoded using ResNet34 [10]. For images, the front-view image is concatenated with the center-cropped front-left-view and front-right-view images, yielding an input resolution of 256×10242561024256\times 1024256 × 1024 by default. ResNet34 is also applied for feature extraction unless otherwise specified. No data or test-time augmentations are used.

3.3 Main Results

Our results, presented in Tab. 1, highlight the absolute advantage of Hydra-MDP over the baseline. In our exploration of different planning vocabularies [4], utilizing a larger vocabulary 𝒱8192subscript𝒱8192\mathcal{V}_{8192}caligraphic_V start_POSTSUBSCRIPT 8192 end_POSTSUBSCRIPT demonstrates improvements across different methods. Furthermore, non-differentiable post-processing yields fewer performance gains than our framework, while weighted confidence enhances the performance comprehensively. To ablate the effect of different learning targets, the continuous metric EP (Ego Progress) is not considered in early experiments and we attempt the distillation of the overall PDM score. Nonetheless, the irregular distribution of the PDM score incurs performance degradation, which suggests the necessity of our multi-target learning paradigm. In the final version of Hydra-MDP-𝒱8192subscript𝒱8192\mathcal{V}_{8192}caligraphic_V start_POSTSUBSCRIPT 8192 end_POSTSUBSCRIPT-W-EP, the distillation of EP can improve the corresponding metric.

3.4 Scaling Up and Model Ensembling

Previous literature [11] suggests larger backbones only lead to minor improvements in planning performance. Nevertheless, we further demonstrate the scalability of our model with larger backbones. Tab. 2 shows three best-performing versions of Hydra-MDP with ViT-L [20, 9] and V2-99 [13] as the image backbone. For the final submission, we use the ensembled sub-scores of these three models for inference.

References

  • Biswas et al. [2024] Sourav Biswas, Sergio Casas, Quinlan Sykora, Ben Agro, Abbas Sadat, and Raquel Urtasun. Quad: Query-based interpretable neural motion planning for autonomous driving. arXiv preprint arXiv:2404.01486, 2024.
  • Caesar et al. [2021a] Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021a.
  • Caesar et al. [2021b] Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. arXiv preprint arXiv:2106.11810, 2021b.
  • Chen et al. [2024] Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243, 2024.
  • Chitta et al. [2022] Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • Contributors [2024] NAVSIM Contributors. Navsim: Data-driven non-reactive autonomous vehicle simulation. https://github.com/autonomousvision/navsim, 2024.
  • Contributors [2023] OpenScene Contributors. Openscene: The largest up-to-date 3d occupancy prediction benchmark in autonomous driving. https://github.com/OpenDriveLab/OpenScene, 2023.
  • Dauner et al. [2023] Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with misconceptions about learning-based vehicle motion planning. In Conference on Robot Learning, pages 1268–1281. PMLR, 2023.
  • Fang et al. [2023] Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hu et al. [2023] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
  • Jiang et al. [2023] Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023.
  • Lee et al. [2019] Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, and Jongyoul Park. An energy and gpu-computation efficient backbone network for real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
  • Li et al. [2023] Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? arXiv preprint arXiv:2312.03031, 2023.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • Park et al. [2021] Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3142–3152, 2021.
  • Shao et al. [2019] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
  • Treiber et al. [2000] Martin Treiber, Ansgar Hennecke, and Dirk Helbing. Congested traffic states in empirical observations and microscopic simulations. Physical review E, 62(2):1805, 2000.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Yang et al. [2024] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891, 2024.