BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving

Overview: BEV-LLM is a lightweight 1B model for 3D scene captioning in autonomous driving. It fuses LiDAR and multi-view images using BEVFusion and introduces novel absolute positional encoding for view-specific descriptions. Despite its size, it performs competitively on the nuCaption benchmark, enhancing transparency, safety, and human-AI interaction.

© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
bevllm		bevllm
config		config
datasets		datasets
figures		figures
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
validate_model.py		validate_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages