Skip to content

Evaluation Metrics Code #1

@hccngu

Description

@hccngu

Description

Thank you for providing the Python script that allows us to obtain model responses. However, we noticed that evaluating model performance usually involves computing several important metrics, such as:

  1. IF (Instruction Following) - The degree to which instructions are followed
  2. ED (Error Diagnosis) - The ability to diagnose errors
  3. SA (Solution Accuracy) - The accuracy of the solutions provided
  4. PQ (Problem Quality) - The quality of the problems
  5. ACC

We would like to know if there are any plans to open source the code for calculating these metrics. If so, could you provide an estimated timeline? These metrics are crucial for further analysis and research.

Expectation

We hope to receive more information regarding the open sourcing of this code or guidance on how we might implement these calculations ourselves.

Thank you for your hard work and contributions!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions