Evaluation Metrics Code

#### Description

Thank you for providing the Python script that allows us to obtain model responses. However, we noticed that evaluating model performance usually involves computing several important metrics, such as:

1. **IF (Instruction Following)** - The degree to which instructions are followed
2. **ED (Error Diagnosis)** - The ability to diagnose errors
3. **SA (Solution Accuracy)** - The accuracy of the solutions provided
4. **PQ (Problem Quality)** - The quality of the problems
5. **ACC**

We would like to know if there are any plans to open source the code for calculating these metrics. If so, could you provide an estimated timeline? These metrics are crucial for further analysis and research.

#### Expectation

We hope to receive more information regarding the open sourcing of this code or guidance on how we might implement these calculations ourselves.

Thank you for your hard work and contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Metrics Code #1

Description

Expectation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Evaluation Metrics Code #1

Description

Description

Expectation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions