研究论文

Vision-Language Foundation Models for Zero-Shot Autonomous Driving Scene Understanding and Risk Assessment

Views Downloads

摘要

Autonomous driving perception systems trained on fixed taxonomies fail when encountering novel objects or unusual scenarios not represented in their training data — the "long tail" problem. We present DriveLM, a vision-language model (VLM) fine-tuned on 2.8 million driving scene-narration pairs that performs zero-shot risk assessment by generating natural language scene descriptions and structured risk scores. On the nuScenes-QA benchmark, DriveLM achieves 78.4% accuracy on novel-object-related questions (vs. 31.2% for CLIP-based baselines) and a 0.91 Spearman correlation with human risk ratings. In closed-loop CARLA simulation, DriveLM-guided planning reduces collision rate by 42% in rare-event scenarios compared to end-to-end learned planners.

作者简介

  • Yue Wang Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA
    Yue Wang is an associate professor at Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA. Their research focuses on computational science, with over 17 publications in peer-reviewed journals.
  • Hang Zhao Institute for AI Industry Research, Tsinghua University, Beijing 100084, China
    Hang Zhao is a professor at Institute for AI Industry Research, Tsinghua University, Beijing 100084, China. Their research focuses on energy systems, with over 64 publications in peer-reviewed journals.
  • Laura Leal-Taixé Department of Informatics, Technical University of Munich, 85748 Garching, Germany
    Laura Leal-Taixé is a professor at Department of Informatics, Technical University of Munich, 85748 Garching, Germany. Their research focuses on machine learning, with over 22 publications in peer-reviewed journals.