-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
Hello, using your code, the model I trained gets much worse metrics than those reported in the paper. GPT-2 only achieves 37% accuracy on GSM8K (when trained only on GSM8K-AUG and then evaluated). I’d like to ask whether there are any training tricks involved, or whether it’s necessary to train on all datasets and then evaluate on GSM8K to reach the reported 43.7% accuracy.
Additionally, I’d like to ask: since the open-source GSM8K-AUG-NL only provides a training set, how should I perform evaluation?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels