Skip to content

chore: add llm evaluation tests#234

Open
Henrrypg wants to merge 1 commit into
openedx:mainfrom
eduNEXT:hpg/judge
Open

chore: add llm evaluation tests#234
Henrrypg wants to merge 1 commit into
openedx:mainfrom
eduNEXT:hpg/judge

Conversation

@Henrrypg

Copy link
Copy Markdown
Contributor

This PR adds a set of tests and run an evaluation by a smarter modal over the response.

@openedx-webhooks openedx-webhooks added open-source-contribution PR author is not from Axim or 2U core contributor PR author is a Core Contributor (who may or may not have write access to this repo). labels Jun 18, 2026
@openedx-webhooks

Copy link
Copy Markdown

Thanks for the pull request, @Henrrypg!

This repository is currently maintained by @felipemontoya.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.
🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads
🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details
Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

@Henrrypg

Copy link
Copy Markdown
Contributor Author

/integration-test

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.32%. Comparing base (98f555e) to head (9bb9e6c).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #234   +/-   ##
=======================================
  Coverage   95.32%   95.32%           
=======================================
  Files          69       69           
  Lines        8086     8086           
  Branches      432      432           
=======================================
  Hits         7708     7708           
  Misses        283      283           
  Partials       95       95           
Flag Coverage Δ
unittests 95.32% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Henrrypg Henrrypg requested a review from felipemontoya June 18, 2026 19:48

@felipemontoya felipemontoya left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start.

I left a bunch of comments inline. Probably the most important is about what the question (or follow up message) is for every test.

Other smaller concerns:

  1. is there a lightweight library to replace the adhoc judge class? I like the class, but we would need to maintain that as opposed to keep our lib up to date.
  2. There is no retry on transient errors. Anthropic 529/overloaded and rate limits will flake live tests. We could perhaps add a retry after 5 seconds, but this can always be defered to when that proves to be a problem.
  3. Sub-field validation gap. ask() verifies each question name is present but not that its schema fields are. The test then does instruction_verdict['missed_requirements'] — a raw KeyError if a provider ignores strict mode. You're relying entirely on upstream strictness for that not to blow up with a confusing traceback.

TONE = JudgeQuestion(
name="tone",
prompt=(
"Is the RESPONSE's tone appropriate for an educational context (not "

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting but difficult to define what is a appropiate tone, specially if we are not setting the tone in the instructions passed to the llm.

Comment thread backend/tests/integration/judge.py
Comment thread backend/tests/integration/test_semantic_quality.py
Comment thread backend/tests/integration/test_semantic_quality.py
Comment thread backend/tests/integration/test_semantic_quality.py
Comment thread backend/tests/integration/test_semantic_quality.py
Comment thread backend/tests/integration/judge.py
@mphilbrick211 mphilbrick211 moved this from Needs Triage to In Eng Review in Contributions Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core contributor PR author is a Core Contributor (who may or may not have write access to this repo). open-source-contribution PR author is not from Axim or 2U

Projects

Status: In Eng Review

Development

Successfully merging this pull request may close these issues.

4 participants