First, I thank you very much for your contribution. 💯 💯 💯
In MathVerse, You have proven that most MLLMs solve problems based on "Text Redundancy".
I saw that, in InternVL they scale up the vision encoder to reduce the gap between Visual and Textual information. And it's also achieved Top 1 in MathVista.
Can you provide the benchmark results of InternVL on the MathVerse dataset? I think it will add useful information to your hypothesis.
Reference papers:
https://arxiv.org/pdf/2312.14238.pdf
First, I thank you very much for your contribution. 💯 💯 💯
In MathVerse, You have proven that most MLLMs solve problems based on "Text Redundancy".
I saw that, in
InternVLthey scale up the vision encoder to reduce the gap between Visual and Textual information. And it's also achieved Top 1 inMathVista.Can you provide the benchmark results of
InternVLon theMathVersedataset? I think it will add useful information to your hypothesis.Reference papers: