feat(python-sdk): vocabulary evaluator#36
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Vocabulary Complexity evaluator to the Python SDK, including shared TOML settings, a generated settings module, an evaluator implementation with two LLM-step flow (background knowledge + vocab complexity), and accompanying unit + contract tests (with notebook capture support).
Changes:
- Introduces vocabulary evaluator settings (
settings.toml) and contract snapshots (contracts.toml) in both sharedsdks/settings/and bundled Python package settings. - Implements
VocabularyEvaluator+ Pydantic schemas and exports them from the package API. - Adds unit tests and contract-test harness integration for vocabulary, and updates settings sync/tests to include the new evaluator.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
sdks/settings/vocabulary/settings.toml |
New shared TOML spec for vocabulary evaluator metadata, prompts, and step model settings. |
sdks/settings/vocabulary/contracts.toml |
New shared contract snapshots for two vocabulary cases (grade 3 and grade 7). |
sdks/python/tests/settings/test_load_settings.py |
Ensures bundled vocabulary contracts exist and match canonical settings. |
sdks/python/tests/evaluators/test_vocabulary.py |
Unit tests for evaluator logic, mappings, and input validation behavior. |
sdks/python/tests/contract_tests/vocabulary.py |
Case loaders + mapping helpers for vocabulary contract tests. |
sdks/python/tests/contract_tests/test_vocabulary.py |
Contract tests asserting prompt fidelity + result fidelity vs notebook captures. |
sdks/python/src/learning_commons_evaluators/settings/vocabulary/contracts.toml |
Bundled (package) copy of vocabulary contract snapshots. |
sdks/python/src/learning_commons_evaluators/settings/_generated_vocabulary_settings.py |
Auto-generated Python settings module from the TOML spec. |
sdks/python/src/learning_commons_evaluators/schemas/vocabulary.py |
New settings + output schemas for the vocabulary evaluator. |
sdks/python/src/learning_commons_evaluators/evaluators/vocabulary.py |
New evaluator implementation with grade-specific prompt/model paths. |
sdks/python/src/learning_commons_evaluators/evaluators/__init__.py |
Exports vocabulary evaluator types from the evaluators package. |
sdks/python/src/learning_commons_evaluators/__init__.py |
Exposes vocabulary evaluator and schemas at the top-level SDK API. |
evals/vocabulary_evaluator.ipynb |
Adds capture utilities usage to generate contract TOML from notebook runs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
adnanrhussain
left a comment
There was a problem hiding this comment.
1 P0 and other low pri feedback. lgtm
pre-approving to unblock
| model='gemini-2.5-pro', | ||
| temperature=0.0, | ||
| ), | ||
| prompt_settings_step_vocab_other_grades=PromptSettings(provider_type=LlmProvider.OPENAI, model='gpt-4.1', temperature=0.0), |
There was a problem hiding this comment.
P0 - Can you please double-check this. I think we need to pin it to a specific snapshot
There was a problem hiding this comment.
@adnanrhussain Comes from the notebook: https://github.com/learning-commons-org/evaluators/blob/df5def4fc6a48958245c65ec5e296232c5741c29/evals/vocabulary_evaluator.ipynb
Succeeded running in the notebook and using this implementation in the sdk.
| ).partial(format_instructions=parser.get_format_instructions()) | ||
|
|
||
| output = self.execute_prompt_chain_step( | ||
| step_name="vocab_complexity", |
There was a problem hiding this comment.
P1 - In TS SDK, I think we use complexity_evaluation as the step_name
| value = result["answer"] | ||
| if isinstance(value, str) and value.isdigit(): | ||
| value = int(value) | ||
| result["complexity_score"] = mapping.get(value, str(value)) |
There was a problem hiding this comment.
P1 - Unknown ints may pass through and fail silently
There was a problem hiding this comment.
They wouldn't fail silently. TextComplexityAnswer has a strict definition. We use TextComplexityAnswer.from_score to map the final result. An unknown int would pass through the first mapper and get caught in from_score or possibly earlier in the JsonParser validate. I'll add a test.
| for key in ("tier_2_words", "tier_3_words", "archaic_words", "other_complex_words"): | ||
| if key not in result or result[key] is None: | ||
| result[key] = "" |
There was a problem hiding this comment.
P1 - Handles missing objects, great, but this will mask LLM response inconsistencies
There was a problem hiding this comment.
The notebook evaluation doesn't fail if these are missing. The prettify method in the notebook assumes these can be missing and fills them in with N/A.
The SDK is strict, so this allows results that would be valid in the notebook to be valid in the SDK too.
5d1089c
into
fsisenda/sdk_python_contract_tests
… test (#39) * feat: contract test scaffold and conventionality contract test * chore: fix build issues * ci: fixing build * chore: moved capture script to scripts folder within python sdk * Align conventionality_evaluator notebook with main Co-authored-by: Cursor <cursoragent@cursor.com> * chore: addressing PR comments * feat(python-sdk): vocabulary evaluator (#36) * feat: vocabulary evaluator * chore: update vocabulary settings to use instead of for prompt settings * chore: fix capture and contract tests * chore: vocabulary settings are required * feat: eval instance settings overrides * chore: addressing PR comments * chore: restore vocabulary notebook * feat: base eval support for json normalizers * chore: cleaner implementation of vocab * chore: same step name as typescript sdk + edge case unit test --------- Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Jira:
Implementation of the vocabulary evaluator in the Python sdk
Test Plan