Thank you for your work!
Now I would like to directly to GPT4v input the image and a prompt like “This is an image, now I need to do the visual grounding task where you generate the coordinates [x,y,h,w] of a bounding box based on a query.”
But I found that this doesn't output very well, the model is even outputting the coordinates randomly. Should I have to preprocess the image first? How should this go about? Thank you!