Skip to content

feat(python-sdk): contract test scaffold and conventionality contract test#39

Merged
czi-fsisenda merged 10 commits into
fsisenda/sdk_python_basic_conventionalityfrom
fsisenda/sdk_python_contract_tests
May 14, 2026
Merged

feat(python-sdk): contract test scaffold and conventionality contract test#39
czi-fsisenda merged 10 commits into
fsisenda/sdk_python_basic_conventionalityfrom
fsisenda/sdk_python_contract_tests

Conversation

@czi-fsisenda
Copy link
Copy Markdown
Contributor

Summary

Jira:

Contract tests for evaluators in the Python SDK

  • introduces capture.py that captures llm inputs, outputs, info from eval notebooks
  • adds capture to conventionality notebook
  • generates contract artifact for conventionality from data captured from notebook
  • introduces make commands for building and validating contract artifacts
  • contract test for conventionality

Test Plan

  • Wrote automated tests
  • Manually tested my changes, and here are the details:

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds contract-test infrastructure to the Python SDK and seeds it with an initial Conventionality evaluator contract artifact + test, ensuring evaluator behavior matches the reference notebook and that bundled artifacts stay synced with canonical settings.

Changes:

  • Introduces contracts.toml artifacts for the Conventionality evaluator (canonical under sdks/settings/ plus bundled copy under the Python package).
  • Adds a contract-test loader + harness and a Conventionality contract test that asserts prompt fidelity and result mapping.
  • Adds Makefile targets and a sync-guard test to keep bundled contract artifacts byte-identical to the canonical source.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sdks/settings/conventionality/contracts.toml Adds canonical Conventionality contract artifact captured from the notebook.
sdks/python/src/learning_commons_evaluators/settings/conventionality/contracts.toml Adds bundled package copy of the Conventionality contract artifact for installed-package testing.
sdks/python/tests/settings/test_load_settings.py Adds bundled-artifact presence check and a canonical-vs-bundled sync guard.
sdks/python/tests/contract_tests/loader.py Adds TOML-backed contract case model + loader resolving via the package settings root.
sdks/python/tests/contract_tests/harness.py Adds provider-mocking harness that captures prompt requests and asserts contract fidelity.
sdks/python/tests/contract_tests/conventionality.py Adds Conventionality case loader and notebook→SDK expected-result mapper.
sdks/python/tests/contract_tests/test_conventionality.py Adds the initial Conventionality contract test for the “turnip” case.
sdks/python/tests/contract_tests/__init__.py Defines the contract-tests package and documents the contract-test approach.
sdks/python/Makefile Adds build/check-build and contract-test/sync targets for artifact maintenance.
evals/conventionality_evaluator.ipynb Updates the notebook to capture LLM calls and print a contracts.toml block.
evals/capture.py Adds notebook utilities for capturing prompt/response snapshots and emitting TOML artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sdks/python/scripts/capture.py Outdated
Comment thread sdks/python/scripts/capture.py
Comment thread sdks/python/scripts/capture.py
Comment thread sdks/python/tests/contract_tests/test_conventionality.py Outdated
Comment thread sdks/python/tests/contract_tests/loader.py Outdated
Comment thread sdks/python/tests/contract_tests/conventionality.py Outdated
Comment thread sdks/python/Makefile
temperature: float
llm_response: str

def is_populated(self) -> bool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 - Currently unused. Is this used in a downstream PR and in tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Currently used to check if test artifact still has placeholders.

Comment thread sdks/python/scripts/capture.py
Comment thread evals/conventionality_evaluator.ipynb
Copy link
Copy Markdown
Contributor

@adnanrhussain adnanrhussain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm


def __exit__(self, *args: Any) -> None:
if self._patch is not None:
self._patch.stop()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 - in case the test author misses calling the assert_prompt_step,
perhaps in the exit, we can compare the prompt_steps & _captured counts as an exit validation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests can get better. In an earlier iteration, I missed an assert. It's a decent start that we can build on.

* feat: vocabulary evaluator

* chore: update vocabulary settings to use  instead of  for prompt settings

* chore: fix capture and contract tests

* chore: vocabulary settings are required

* feat: eval instance settings overrides

* chore: addressing PR comments

* chore: restore vocabulary notebook

* feat: base eval support for json normalizers

* chore: cleaner implementation of vocab

* chore: same step name as typescript sdk + edge case unit test
@czi-fsisenda czi-fsisenda merged commit f34050e into fsisenda/sdk_python_basic_conventionality May 14, 2026
4 checks passed
@czi-fsisenda czi-fsisenda deleted the fsisenda/sdk_python_contract_tests branch May 14, 2026 09:19
czi-fsisenda added a commit that referenced this pull request May 14, 2026
* feat(python-sdk): python SDK scaffold
* feat(python-sdk): conventionality evaluator  (#38)
* feat(python-sdk): contract test scaffold and conventionality contract test (#39)
* feat(python-sdk): vocabulary evaluator (#36)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants