chore: add llm evaluation tests#234
Conversation
|
Thanks for the pull request, @Henrrypg! This repository is currently maintained by Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. DetailsWhere can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
|
/integration-test |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #234 +/- ##
=======================================
Coverage 95.32% 95.32%
=======================================
Files 69 69
Lines 8086 8086
Branches 432 432
=======================================
Hits 7708 7708
Misses 283 283
Partials 95 95
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
felipemontoya
left a comment
There was a problem hiding this comment.
This is a good start.
I left a bunch of comments inline. Probably the most important is about what the question (or follow up message) is for every test.
Other smaller concerns:
- is there a lightweight library to replace the adhoc judge class? I like the class, but we would need to maintain that as opposed to keep our lib up to date.
- There is no retry on transient errors. Anthropic 529/overloaded and rate limits will flake live tests. We could perhaps add a retry after 5 seconds, but this can always be defered to when that proves to be a problem.
- Sub-field validation gap. ask() verifies each question name is present but not that its schema fields are. The test then does instruction_verdict['missed_requirements'] — a raw KeyError if a provider ignores strict mode. You're relying entirely on upstream strictness for that not to blow up with a confusing traceback.
| TONE = JudgeQuestion( | ||
| name="tone", | ||
| prompt=( | ||
| "Is the RESPONSE's tone appropriate for an educational context (not " |
There was a problem hiding this comment.
Interesting but difficult to define what is a appropiate tone, specially if we are not setting the tone in the instructions passed to the llm.
This PR adds a set of tests and run an evaluation by a smarter modal over the response.