chore: add llm evaluation tests by Henrrypg · Pull Request #234 · openedx/openedx-ai-extensions

Henrrypg · 2026-06-18T19:41:10Z

This PR adds a set of tests and run an evaluation by a smarter modal over the response.

openedx-webhooks · 2026-06-18T19:41:19Z

Thanks for the pull request, @Henrrypg!

This repository is currently maintained by @felipemontoya.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

Henrrypg · 2026-06-18T19:43:52Z

/integration-test

codecov · 2026-06-18T19:44:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.32%. Comparing base (98f555e) to head (9bb9e6c).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #234   +/-   ##
=======================================
  Coverage   95.32%   95.32%           
=======================================
  Files          69       69           
  Lines        8086     8086           
  Branches      432      432           
=======================================
  Hits         7708     7708           
  Misses        283      283           
  Partials       95       95

Flag	Coverage Δ
unittests	`95.32% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

felipemontoya

This is a good start.

I left a bunch of comments inline. Probably the most important is about what the question (or follow up message) is for every test.

Other smaller concerns:

is there a lightweight library to replace the adhoc judge class? I like the class, but we would need to maintain that as opposed to keep our lib up to date.
There is no retry on transient errors. Anthropic 529/overloaded and rate limits will flake live tests. We could perhaps add a retry after 5 seconds, but this can always be defered to when that proves to be a problem.
Sub-field validation gap. ask() verifies each question name is present but not that its schema fields are. The test then does instruction_verdict['missed_requirements'] — a raw KeyError if a provider ignores strict mode. You're relying entirely on upstream strictness for that not to blow up with a confusing traceback.

felipemontoya · 2026-06-19T22:50:51Z

+TONE = JudgeQuestion(
+    name="tone",
+    prompt=(
+        "Is the RESPONSE's tone appropriate for an educational context (not "


Interesting but difficult to define what is a appropiate tone, specially if we are not setting the tone in the instructions passed to the llm.

chore: add llm evaluation tests

9bb9e6c

openedx-webhooks added open-source-contribution PR author is not from Axim or 2U core contributor PR author is a Core Contributor (who may or may not have write access to this repo). labels Jun 18, 2026

openedx-webhooks added this to Contributions Jun 18, 2026

github-project-automation Bot moved this to Needs Triage in Contributions Jun 18, 2026

Henrrypg requested a review from felipemontoya June 18, 2026 19:48

felipemontoya requested changes Jun 19, 2026

View reviewed changes

mphilbrick211 moved this from Needs Triage to In Eng Review in Contributions Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add llm evaluation tests#234

chore: add llm evaluation tests#234
Henrrypg wants to merge 1 commit into
openedx:mainfrom
eduNEXT:hpg/judge

Henrrypg commented Jun 18, 2026

Uh oh!

openedx-webhooks commented Jun 18, 2026

Uh oh!

Henrrypg commented Jun 18, 2026

Uh oh!

codecov Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

felipemontoya left a comment

Uh oh!

felipemontoya Jun 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Henrrypg commented Jun 18, 2026

Uh oh!

openedx-webhooks commented Jun 18, 2026

Uh oh!

Henrrypg commented Jun 18, 2026

Uh oh!

codecov Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

felipemontoya left a comment

Choose a reason for hiding this comment

Uh oh!

felipemontoya Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Jun 18, 2026 •

edited

Loading