Thanks for your awesome work!
I wonder the patch selection and calibration will be conducted during inference?
In other words,the inference is consistent with training process where we need to apply LAPS for vision encoder and split sentence into textual words and then conduct patch-word alignment rather than image-sentence alignment?
Thanks for your awesome work!
I wonder the patch selection and calibration will be conducted during inference?
In other words,the inference is consistent with training process where we need to apply LAPS for vision encoder and split sentence into textual words and then conduct patch-word alignment rather than image-sentence alignment?