End-to-End Driving at Scale with Hydra-MDP

Building an autonomous system to navigate the complex physical world is extremely challenging. The system must perceive its environment and make quick, sensible decisions. Passenger experience is also important and includes acceleration, curvature, smoothness, road adherence, and time-to-collision.

In this post, we introduce Hydra-MDP, an innovative framework that advances the field of end-to-end autonomous driving. Hydra-MDP uses a novel multi-teacher, student-teacher knowledge distillation architecture, integrating knowledge from both human and rule-based planners. This enables the model to learn diverse trajectories, improving generalization across diverse driving environments and conditions.

Diagram shows the various capabilities that are available with single-modal planning and single-target learning compared to multimodal planning and multi-target learning. — *Figure 1. End-to-end planning paradigm comparison*

Hydra-MDP provides a universal framework showing how machine learning-based planning can be enhanced by rule-based planners. This integration ensures the model not only mimics human driving behaviors but also adheres to traffic rules and safety standards, addressing traditional imitation learning limitations.

Hydra-MDP’s data-driven scaling laws demonstrate its robustness and adaptability. By using pretrained foundation models with extensive data and GPU hours, Hydra-MDP showcases its scalability and potential for continuous improvement.

The NVIDIA model Hydra-MDP won first place and the innovation award in the E2E Driving at Scale Challenge at CVPR 2024, outperforming state-of-the-art planners on the nuPlan benchmark. It offers a promising roadmap for the application of ML-based planning systems in autonomous driving.

Video 1. End-to-end autonomous driving refers to a holistic approach where a system takes in raw sensor data from cameras, radar, and lidar, and directly outputs vehicle controls.

Enhancing multimodal planning through multi-target hydra-distillation

Developing Hydra-MDP taught us several critical lessons that shaped its architecture and success. Hydra-MDP combines human and rule-based knowledge distillation to create a robust and versatile autonomous driving model.

Here are the key lessons we learned:

Embrace the complexity of multimodal and multi-target planning
Embrace the power of multi-target hydra-distillation
Overcome the limitations of post-processing
Understand the importance of environmental context
Refine iteratively through simulation
Use effective model ensembling

Embrace the complexity of multimodal and multi-target planning

A foundational lesson was the necessity of embracing both multimodal and multi-target planning.

Traditional end-to-end autonomous driving systems often focus on single-modal and single-target objectives, limiting their real-world effectiveness. Hydra-MDP integrates diverse trajectories tailored to multiple metrics, including safety, efficiency, and comfort. This ensures that the model adapts to complex driving environments, not just mimicking human drivers.

Diagram shows three modalities: Perception Network, Trajectory Decoder, and Multi-Target Hydra-Distillation. — *Figure 2. Hydra-MDP architecture*

Embrace the power of multi-target hydra-distillation

Multi-target Hydra-distillation, a teacher-student multimodal framework, was a pivotal strategy in our approach. By employing multiple specialized teachers—both human and rule-based—the model learns to predict trajectories that align with various simulation-based metrics. This technique enhances the model’s generalization across diverse driving conditions.

We learned that incorporating rule-based planners provided a structured framework, while human teachers introduced adaptability and nuanced decision-making capabilities, essential for navigating unpredictable scenarios.

Overcome the limitations of post-processing

Another insight was the inherent limitations of relying on post-processing for trajectory selection.

Traditional methods often lose valuable information by separating perception and planning into distinct, non-differentiable steps. Hydra-MDP’s end-to-end architecture integrates perception and planning in a seamless pipeline and maintains the richness of environmental data throughout the decision-making process. This integration enables more informed and accurate predictions.

Understand the importance of environmental context

Incorporating detailed environmental context is crucial for accurate planning.

Hydra-MDP’s perception network builds on the Transfuser baseline, combining features from LiDAR and camera inputs. This multimodal fusion helps the model better understand and react to complex driving environments.

Transformer layers connect these modalities, ensuring thorough encoding of environmental context and providing rich, actionable insights.

Refine iteratively through simulation

The iterative refinement process, facilitated by offline simulations, proved invaluable.

Running simulations on the entire training dataset generated ground truth simulation scores for various metrics. This data was then used to supervise the training process, enabling the model to learn from a wide range of simulated driving scenarios.

This step highlighted the importance of extensive simulation in bridging the gap between theoretical performance and real-world applicability.

Method	Image Resolution	Backbone	Pretraining	NC	DAC	EP	TTC	C	Score
Hydra-MDP-A	256 × 1024	ViT-L	Depth anything	98.4	97.7	85.0	94.5	100	89.9
Hydra-MDP-B	512 × 2048	V2-99	DD3D	98.4	97.8	86.5	93.9	100	90.3
Hydra-MDP-C	256 × 1024256 × 1024512 × 2048	ViT-LViT-L V2-99	Depth anything Objects365 + COCODD3D	98.7	98.2	86.5	95.0	100	91.0

Table 1. Accuracy of Hydra-MDP as a function of the resolution of the input image resolution, pretraining and backbone architecture. The winning solution, Hydra-MDP-C, combines them to boost performance.

Use effective model ensembling

Effective model ensembling was critical to our success.

We used techniques like Mixture of Encoders and Sub-score Ensembling to combine model strengths. This improved Hydra-MDP’s robustness and ensured that the final model could handle a diverse array of driving scenarios with high accuracy.

Ensembling techniques balance computational efficiency and performance, crucial for real-time applications.

Conclusion

Developing Hydra-MDP was a journey of innovation, experimentation, and continuous learning. By embracing multimodal and multi-target planning, leveraging multi-target hydra-distillation, and refining through extensive simulations, we created a model that significantly outperforms existing state-of-the-art methods. These lessons contributed to Hydra-MDP’s success and provided valuable insights for future advancements in autonomous driving.

For more information, see Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation. For related works, see AV Applied Research.