Vision-Language Foundation Models for Zero-Shot Autonomous Driving Scene Understanding and Risk Assessment

Yue Wang; Hang Zhao; Laura Leal-Taixé

doi:10.55001/faids.v1i3.78

研究论文

Vision-Language Foundation Models for Zero-Shot Autonomous Driving Scene Understanding and Risk Assessment

Yue Wang , Hang Zhao , Laura Leal-Taixé

已出版: 2026-05-25 DOI: https://doi.org/10.55001/faids.v1i3.78 卷 1 期 3 (2026)

— Views — Downloads

摘要

Autonomous driving perception systems trained on fixed taxonomies fail when encountering novel objects or unusual scenarios not represented in their training data — the "long tail" problem. We present DriveLM, a vision-language model (VLM) fine-tuned on 2.8 million driving scene-narration pairs that performs zero-shot risk assessment by generating natural language scene descriptions and structured risk scores. On the nuScenes-QA benchmark, DriveLM achieves 78.4% accuracy on novel-object-related questions (vs. 31.2% for CLIP-based baselines) and a 0.91 Spearman correlation with human risk ratings. In closed-loop CARLA simulation, DriveLM-guided planning reduces collision rate by 42% in rare-event scenarios compared to end-to-end learned planners.

作者简介

Yue Wang Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA

Yue Wang is an associate professor at Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA. Their research focuses on computational science, with over 17 publications in peer-reviewed journals.
Hang Zhao Institute for AI Industry Research, Tsinghua University, Beijing 100084, China

Hang Zhao is a professor at Institute for AI Industry Research, Tsinghua University, Beijing 100084, China. Their research focuses on energy systems, with over 64 publications in peer-reviewed journals.
Laura Leal-Taixé Department of Informatics, Technical University of Munich, 85748 Garching, Germany

Laura Leal-Taixé is a professor at Department of Informatics, Technical University of Munich, 85748 Garching, Germany. Their research focuses on machine learning, with over 22 publications in peer-reviewed journals.

在线阅读下载 PDF

Vision-Language Foundation Models for Zero-Shot Autonomous Driving Scene Understanding and Risk Assessment. (2026). 人工智能与数据科学前沿, 1(3). https://doi.org/10.55001/faids.v1i3.78

Endnote/Zotero/Mendeley (RIS) BibTeX

Yue Wang Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA
Hang Zhao Institute for AI Industry Research, Tsinghua University, Beijing 100084, China
Laura Leal-Taixé Department of Informatics, Technical University of Munich, 85748 Garching, Germany