Dear author,
I'm confused that these lines texts are from solutions but not completions so the reward calculating is depends on the ground truth not the model predict
texts = [item['text'] for item in solutions]
https://github.com/bio-mlhui/MedGround-R1/blob/37b210dd7d6ee71179b3013d7f8042af1de0d5d3/open-r1-multimodal/src/open_r1/grpo_rec.py#L265
inputs = processor(
text=texts,
images=image,
return_tensors="pt",
padding=True
)
https://github.com/bio-mlhui/MedGround-R1/blob/37b210dd7d6ee71179b3013d7f8042af1de0d5d3/open-r1-multimodal/src/open_r1/grpo_rec.py#L268
Yours,
Jerry