more details about image-text pair data

Hi,
Thanks for your great work. 

I have several questions about the data and method:
1. I am curious about the pipeline about generation of text list about HD map. Can you share more details about how to get text for multi-view and bev images? Are those information from a pretrained mulit-modal model or rules based on hd map?
2. Are the visual encoder the same for multi-view images and bev cloud images? The encoder in the paper seems different, but in the inference code https://github.com/LLVM-AD/MAPLM/blob/main/baseline/evaluation/inference.py#L72C29-L72C44, the image processors are the same. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more details about image-text pair data #8

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

more details about image-text pair data #8

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions