Skip to content

feat(python-sdk): vocabulary evaluator#36

Merged
czi-fsisenda merged 12 commits into
fsisenda/sdk_python_contract_testsfrom
fsisenda/sdk_python_vocabulary
May 14, 2026
Merged

feat(python-sdk): vocabulary evaluator#36
czi-fsisenda merged 12 commits into
fsisenda/sdk_python_contract_testsfrom
fsisenda/sdk_python_vocabulary

Conversation

@czi-fsisenda
Copy link
Copy Markdown
Contributor

Summary

Jira:

Implementation of the vocabulary evaluator in the Python sdk

  • adds a cell to notebook to capture llm input and outputs for vocabulary
  • creates settings file for vocabulary
  • generates contract artifact for vocabulary from data captured from notebook
  • generates Python settings from settings file
  • implements vocabulary
  • unit tests
  • contract tests

Test Plan

  • Wrote automated tests
  • Manually tested my changes, and here are the details:

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Vocabulary Complexity evaluator to the Python SDK, including shared TOML settings, a generated settings module, an evaluator implementation with two LLM-step flow (background knowledge + vocab complexity), and accompanying unit + contract tests (with notebook capture support).

Changes:

  • Introduces vocabulary evaluator settings (settings.toml) and contract snapshots (contracts.toml) in both shared sdks/settings/ and bundled Python package settings.
  • Implements VocabularyEvaluator + Pydantic schemas and exports them from the package API.
  • Adds unit tests and contract-test harness integration for vocabulary, and updates settings sync/tests to include the new evaluator.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sdks/settings/vocabulary/settings.toml New shared TOML spec for vocabulary evaluator metadata, prompts, and step model settings.
sdks/settings/vocabulary/contracts.toml New shared contract snapshots for two vocabulary cases (grade 3 and grade 7).
sdks/python/tests/settings/test_load_settings.py Ensures bundled vocabulary contracts exist and match canonical settings.
sdks/python/tests/evaluators/test_vocabulary.py Unit tests for evaluator logic, mappings, and input validation behavior.
sdks/python/tests/contract_tests/vocabulary.py Case loaders + mapping helpers for vocabulary contract tests.
sdks/python/tests/contract_tests/test_vocabulary.py Contract tests asserting prompt fidelity + result fidelity vs notebook captures.
sdks/python/src/learning_commons_evaluators/settings/vocabulary/contracts.toml Bundled (package) copy of vocabulary contract snapshots.
sdks/python/src/learning_commons_evaluators/settings/_generated_vocabulary_settings.py Auto-generated Python settings module from the TOML spec.
sdks/python/src/learning_commons_evaluators/schemas/vocabulary.py New settings + output schemas for the vocabulary evaluator.
sdks/python/src/learning_commons_evaluators/evaluators/vocabulary.py New evaluator implementation with grade-specific prompt/model paths.
sdks/python/src/learning_commons_evaluators/evaluators/__init__.py Exports vocabulary evaluator types from the evaluators package.
sdks/python/src/learning_commons_evaluators/__init__.py Exposes vocabulary evaluator and schemas at the top-level SDK API.
evals/vocabulary_evaluator.ipynb Adds capture utilities usage to generate contract TOML from notebook runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sdks/settings/vocabulary/settings.toml
Comment thread sdks/settings/vocabulary/settings.toml
Comment thread sdks/settings/vocabulary/contracts.toml Outdated
Comment thread sdks/python/src/learning_commons_evaluators/settings/vocabulary/contracts.toml Outdated
Comment thread sdks/python/tests/evaluators/test_vocabulary.py Outdated
Comment thread sdks/python/src/learning_commons_evaluators/evaluators/vocabulary.py Outdated
@czi-fsisenda czi-fsisenda changed the title feat: vocabulary evaluator feat(python-sdk): vocabulary evaluator Apr 30, 2026
Copy link
Copy Markdown
Contributor

@adnanrhussain adnanrhussain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 P0 and other low pri feedback. lgtm
pre-approving to unblock

model='gemini-2.5-pro',
temperature=0.0,
),
prompt_settings_step_vocab_other_grades=PromptSettings(provider_type=LlmProvider.OPENAI, model='gpt-4.1', temperature=0.0),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 - Can you please double-check this. I think we need to pin it to a specific snapshot

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adnanrhussain Comes from the notebook: https://github.com/learning-commons-org/evaluators/blob/df5def4fc6a48958245c65ec5e296232c5741c29/evals/vocabulary_evaluator.ipynb
Succeeded running in the notebook and using this implementation in the sdk.

).partial(format_instructions=parser.get_format_instructions())

output = self.execute_prompt_chain_step(
step_name="vocab_complexity",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 - In TS SDK, I think we use complexity_evaluation as the step_name

value = result["answer"]
if isinstance(value, str) and value.isdigit():
value = int(value)
result["complexity_score"] = mapping.get(value, str(value))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 - Unknown ints may pass through and fail silently

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They wouldn't fail silently. TextComplexityAnswer has a strict definition. We use TextComplexityAnswer.from_score to map the final result. An unknown int would pass through the first mapper and get caught in from_score or possibly earlier in the JsonParser validate. I'll add a test.

class TextComplexityAnswer(Enum):
"""
Allowed text complexity answers. Each member's value is an EvaluationAnswer;
use .label and .score for the human label and score string.
"""
SLIGHTLY_COMPLEX = EvaluationAnswer(score="slightly_complex", label="Slightly complex")
MODERATELY_COMPLEX = EvaluationAnswer(score="moderately_complex", label="Moderately complex")
VERY_COMPLEX = EvaluationAnswer(score="very_complex", label="Very complex")
EXCEEDINGLY_COMPLEX = EvaluationAnswer(score="exceedingly_complex", label="Exceedingly complex")
@property
def score(self) -> str:
return self.value.score
@property
def label(self) -> str:
return self.value.label
@classmethod
def from_score(cls, score: str) -> "TextComplexityAnswer":
for member in cls:
if member.value.score == score:
return member
raise ValueError(f"Unknown text complexity score: {score!r}")

Comment on lines +54 to +56
for key in ("tier_2_words", "tier_3_words", "archaic_words", "other_complex_words"):
if key not in result or result[key] is None:
result[key] = ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 - Handles missing objects, great, but this will mask LLM response inconsistencies

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notebook evaluation doesn't fail if these are missing. The prettify method in the notebook assumes these can be missing and fills them in with N/A.
The SDK is strict, so this allows results that would be valid in the notebook to be valid in the SDK too.

@czi-fsisenda czi-fsisenda merged commit 5d1089c into fsisenda/sdk_python_contract_tests May 14, 2026
4 checks passed
@czi-fsisenda czi-fsisenda deleted the fsisenda/sdk_python_vocabulary branch May 14, 2026 09:17
czi-fsisenda added a commit that referenced this pull request May 14, 2026
… test (#39)

* feat: contract test scaffold and conventionality contract test

* chore: fix build issues

* ci: fixing build

* chore: moved capture script to scripts folder within python sdk

* Align conventionality_evaluator notebook with main

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore: addressing PR comments

* feat(python-sdk): vocabulary evaluator (#36)

* feat: vocabulary evaluator

* chore: update vocabulary settings to use  instead of  for prompt settings

* chore: fix capture and contract tests

* chore: vocabulary settings are required

* feat: eval instance settings overrides

* chore: addressing PR comments

* chore: restore vocabulary notebook

* feat: base eval support for json normalizers

* chore: cleaner implementation of vocab

* chore: same step name as typescript sdk + edge case unit test

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
czi-fsisenda added a commit that referenced this pull request May 14, 2026
* feat(python-sdk): python SDK scaffold
* feat(python-sdk): conventionality evaluator  (#38)
* feat(python-sdk): contract test scaffold and conventionality contract test (#39)
* feat(python-sdk): vocabulary evaluator (#36)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants