feat(python-sdk): vocabulary evaluator by czi-fsisenda · Pull Request #36 · learning-commons-org/evaluators

czi-fsisenda · 2026-04-30T14:07:15Z

Summary

Jira:

Implementation of the vocabulary evaluator in the Python sdk

adds a cell to notebook to capture llm input and outputs for vocabulary
creates settings file for vocabulary
generates contract artifact for vocabulary from data captured from notebook
generates Python settings from settings file
implements vocabulary
unit tests
contract tests

Test Plan

Wrote automated tests
Manually tested my changes, and here are the details:

Copilot

Pull request overview

Adds a new Vocabulary Complexity evaluator to the Python SDK, including shared TOML settings, a generated settings module, an evaluator implementation with two LLM-step flow (background knowledge + vocab complexity), and accompanying unit + contract tests (with notebook capture support).

Changes:

Introduces vocabulary evaluator settings (settings.toml) and contract snapshots (contracts.toml) in both shared sdks/settings/ and bundled Python package settings.
Implements VocabularyEvaluator + Pydantic schemas and exports them from the package API.
Adds unit tests and contract-test harness integration for vocabulary, and updates settings sync/tests to include the new evaluator.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`sdks/settings/vocabulary/settings.toml`	New shared TOML spec for vocabulary evaluator metadata, prompts, and step model settings.
`sdks/settings/vocabulary/contracts.toml`	New shared contract snapshots for two vocabulary cases (grade 3 and grade 7).
`sdks/python/tests/settings/test_load_settings.py`	Ensures bundled vocabulary contracts exist and match canonical settings.
`sdks/python/tests/evaluators/test_vocabulary.py`	Unit tests for evaluator logic, mappings, and input validation behavior.
`sdks/python/tests/contract_tests/vocabulary.py`	Case loaders + mapping helpers for vocabulary contract tests.
`sdks/python/tests/contract_tests/test_vocabulary.py`	Contract tests asserting prompt fidelity + result fidelity vs notebook captures.
`sdks/python/src/learning_commons_evaluators/settings/vocabulary/contracts.toml`	Bundled (package) copy of vocabulary contract snapshots.
`sdks/python/src/learning_commons_evaluators/settings/_generated_vocabulary_settings.py`	Auto-generated Python settings module from the TOML spec.
`sdks/python/src/learning_commons_evaluators/schemas/vocabulary.py`	New settings + output schemas for the vocabulary evaluator.
`sdks/python/src/learning_commons_evaluators/evaluators/vocabulary.py`	New evaluator implementation with grade-specific prompt/model paths.
`sdks/python/src/learning_commons_evaluators/evaluators/__init__.py`	Exports vocabulary evaluator types from the evaluators package.
`sdks/python/src/learning_commons_evaluators/__init__.py`	Exposes vocabulary evaluator and schemas at the top-level SDK API.
`evals/vocabulary_evaluator.ipynb`	Adds capture utilities usage to generate contract TOML from notebook runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ython_vocabulary

adnanrhussain

1 P0 and other low pri feedback. lgtm
pre-approving to unblock

adnanrhussain · 2026-05-13T18:31:24Z

+        model='gemini-2.5-pro',
+        temperature=0.0,
+    ),
+    prompt_settings_step_vocab_other_grades=PromptSettings(provider_type=LlmProvider.OPENAI, model='gpt-4.1', temperature=0.0),


P0 - Can you please double-check this. I think we need to pin it to a specific snapshot

@adnanrhussain Comes from the notebook: https://github.com/learning-commons-org/evaluators/blob/df5def4fc6a48958245c65ec5e296232c5741c29/evals/vocabulary_evaluator.ipynb
Succeeded running in the notebook and using this implementation in the sdk.

adnanrhussain · 2026-05-13T18:32:10Z

+        ).partial(format_instructions=parser.get_format_instructions())
+
+        output = self.execute_prompt_chain_step(
+            step_name="vocab_complexity",


P1 - In TS SDK, I think we use complexity_evaluation as the step_name

adnanrhussain · 2026-05-13T18:33:11Z

+        value = result["answer"]
+        if isinstance(value, str) and value.isdigit():
+            value = int(value)
+        result["complexity_score"] = mapping.get(value, str(value))


P1 - Unknown ints may pass through and fail silently

They wouldn't fail silently. TextComplexityAnswer has a strict definition. We use TextComplexityAnswer.from_score to map the final result. An unknown int would pass through the first mapper and get caught in from_score or possibly earlier in the JsonParser validate. I'll add a test.

evaluators/sdks/python/src/learning_commons_evaluators/schemas/text_complexity.py

Lines 18 to 42 in 0b449c8

class TextComplexityAnswer(Enum):

"""

Allowed text complexity answers. Each member's value is an EvaluationAnswer;

use .label and .score for the human label and score string.

"""

SLIGHTLY_COMPLEX = EvaluationAnswer(score="slightly_complex", label="Slightly complex")

MODERATELY_COMPLEX = EvaluationAnswer(score="moderately_complex", label="Moderately complex")

VERY_COMPLEX = EvaluationAnswer(score="very_complex", label="Very complex")

EXCEEDINGLY_COMPLEX = EvaluationAnswer(score="exceedingly_complex", label="Exceedingly complex")

@property

def score(self) -> str:

return self.value.score

@property

def label(self) -> str:

return self.value.label

@classmethod

def from_score(cls, score: str) -> "TextComplexityAnswer":

for member in cls:

if member.value.score == score:

return member

raise ValueError(f"Unknown text complexity score: {score!r}")

adnanrhussain · 2026-05-13T18:35:16Z

+    for key in ("tier_2_words", "tier_3_words", "archaic_words", "other_complex_words"):
+        if key not in result or result[key] is None:
+            result[key] = ""


P1 - Handles missing objects, great, but this will mask LLM response inconsistencies

The notebook evaluation doesn't fail if these are missing. The prettify method in the notebook assumes these can be missing and fills them in with N/A.
The SDK is strict, so this allows results that would be valid in the notebook to be valid in the SDK too.

…ython_vocabulary

… test (#39) * feat: contract test scaffold and conventionality contract test * chore: fix build issues * ci: fixing build * chore: moved capture script to scripts folder within python sdk * Align conventionality_evaluator notebook with main Co-authored-by: Cursor <cursoragent@cursor.com> * chore: addressing PR comments * feat(python-sdk): vocabulary evaluator (#36) * feat: vocabulary evaluator * chore: update vocabulary settings to use instead of for prompt settings * chore: fix capture and contract tests * chore: vocabulary settings are required * feat: eval instance settings overrides * chore: addressing PR comments * chore: restore vocabulary notebook * feat: base eval support for json normalizers * chore: cleaner implementation of vocab * chore: same step name as typescript sdk + edge case unit test --------- Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(python-sdk): python SDK scaffold * feat(python-sdk): conventionality evaluator (#38) * feat(python-sdk): contract test scaffold and conventionality contract test (#39) * feat(python-sdk): vocabulary evaluator (#36)

feat: vocabulary evaluator

c56de93

czi-fsisenda requested review from adnanrhussain, Copilot and georgemelvin April 30, 2026 14:07

Copilot started reviewing on behalf of czi-fsisenda April 30, 2026 14:08 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

czi-fsisenda changed the title ~~feat: vocabulary evaluator~~ feat(python-sdk): vocabulary evaluator Apr 30, 2026

czi-fsisenda and others added 9 commits May 12, 2026 20:19

Merge branch 'fsisenda/sdk_python_contract_tests' into fsisenda/sdk_p…

451a29e

…ython_vocabulary

chore: update vocabulary settings to use instead of for prompt settings

4a6ec4d

chore: fix capture and contract tests

1ff2251

chore: vocabulary settings are required

dd8d1f4

feat: eval instance settings overrides

3d0e870

chore: addressing PR comments

bf383cc

chore: restore vocabulary notebook

918a659

feat: base eval support for json normalizers

ffa652f

chore: cleaner implementation of vocab

0b449c8

adnanrhussain approved these changes May 13, 2026

View reviewed changes

czi-fsisenda and others added 2 commits May 13, 2026 19:39

Merge branch 'fsisenda/sdk_python_contract_tests' into fsisenda/sdk_p…

785d60a

…ython_vocabulary

chore: same step name as typescript sdk + edge case unit test

c6ac496

czi-fsisenda merged commit 5d1089c into fsisenda/sdk_python_contract_tests May 14, 2026
4 checks passed

czi-fsisenda deleted the fsisenda/sdk_python_vocabulary branch May 14, 2026 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python-sdk): vocabulary evaluator#36

feat(python-sdk): vocabulary evaluator#36
czi-fsisenda merged 12 commits into
fsisenda/sdk_python_contract_testsfrom
fsisenda/sdk_python_vocabulary

czi-fsisenda commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adnanrhussain left a comment

Uh oh!

adnanrhussain May 13, 2026

Uh oh!

czi-fsisenda May 13, 2026

Uh oh!

adnanrhussain May 13, 2026

Uh oh!

adnanrhussain May 13, 2026

Uh oh!

czi-fsisenda May 14, 2026

Uh oh!

adnanrhussain May 13, 2026

Uh oh!

czi-fsisenda May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	class TextComplexityAnswer(Enum):
	"""
	Allowed text complexity answers. Each member's value is an EvaluationAnswer;
	use .label and .score for the human label and score string.
	"""

	SLIGHTLY_COMPLEX = EvaluationAnswer(score="slightly_complex", label="Slightly complex")
	MODERATELY_COMPLEX = EvaluationAnswer(score="moderately_complex", label="Moderately complex")
	VERY_COMPLEX = EvaluationAnswer(score="very_complex", label="Very complex")
	EXCEEDINGLY_COMPLEX = EvaluationAnswer(score="exceedingly_complex", label="Exceedingly complex")

	@property
	def score(self) -> str:
	return self.value.score

	@property
	def label(self) -> str:
	return self.value.label

	@classmethod
	def from_score(cls, score: str) -> "TextComplexityAnswer":
	for member in cls:
	if member.value.score == score:
	return member
	raise ValueError(f"Unknown text complexity score: {score!r}")

Conversation

czi-fsisenda commented Apr 30, 2026

Summary

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adnanrhussain left a comment

Choose a reason for hiding this comment

Uh oh!

adnanrhussain May 13, 2026

Choose a reason for hiding this comment

Uh oh!

czi-fsisenda May 13, 2026

Choose a reason for hiding this comment

Uh oh!

adnanrhussain May 13, 2026

Choose a reason for hiding this comment

Uh oh!

adnanrhussain May 13, 2026

Choose a reason for hiding this comment

Uh oh!

czi-fsisenda May 14, 2026

Choose a reason for hiding this comment

Uh oh!

adnanrhussain May 13, 2026

Choose a reason for hiding this comment

Uh oh!

czi-fsisenda May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants