Description
Thank you for providing the Python script that allows us to obtain model responses. However, we noticed that evaluating model performance usually involves computing several important metrics, such as:
- IF (Instruction Following) - The degree to which instructions are followed
- ED (Error Diagnosis) - The ability to diagnose errors
- SA (Solution Accuracy) - The accuracy of the solutions provided
- PQ (Problem Quality) - The quality of the problems
- ACC
We would like to know if there are any plans to open source the code for calculating these metrics. If so, could you provide an estimated timeline? These metrics are crucial for further analysis and research.
Expectation
We hope to receive more information regarding the open sourcing of this code or guidance on how we might implement these calculations ourselves.
Thank you for your hard work and contributions!
Description
Thank you for providing the Python script that allows us to obtain model responses. However, we noticed that evaluating model performance usually involves computing several important metrics, such as:
We would like to know if there are any plans to open source the code for calculating these metrics. If so, could you provide an estimated timeline? These metrics are crucial for further analysis and research.
Expectation
We hope to receive more information regarding the open sourcing of this code or guidance on how we might implement these calculations ourselves.
Thank you for your hard work and contributions!