Evaluator Lifecycle

Demonstrates the full lifecycle of an LLM-as-judge evaluator in LangSmith — from offline testing to live deployment.

The notebook (evaluator_lifecycle.ipynb) walks through:

Dataset — a labelled set of professional/unprofessional business responses, including edge cases. This is for validating the evaluator works as expected.
Evaluator prompt v1 — a minimal prompt pushed to the LangSmith Hub and tested offline against the dataset
Online deployment — an online evaluator created referencing the v1 prompt commit, running automatically on every live trace
Evaluator prompt v2 — an improved prompt commit with explicit criteria, tested offline head-to-head with v1
Canary deployment — a second online evaluator created referencing the v2 prompt, running alongside the first so both collect real-traffic signal simultaneously
Quality gate & promotion — if v2's offline accuracy meets or exceeds v1, the original online evaluator is updated to use v2' prompt; the canary online evaluator is then retired

The appendix shows how to create numeric and categorical evaluator output schemas in addition to the boolean type used in the main flow.

Setup

cp .env.example .env

Fill in .env:

Variable	Description
`OPENAI_API_KEY`	OpenAI API key used by the LLM judge
`LANGSMITH_API_KEY`	LangSmith API key
`LANGSMITH_PROJECT_ID`	UUID of the tracing project online evaluators will be attached to (visible in the LangSmith UI URL)

uv sync
uv run jupyter lab

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
evaluator_lifecycle.ipynb		evaluator_lifecycle.ipynb
pyproject.toml		pyproject.toml
uv.lock		uv.lock