Demonstrates the full lifecycle of an LLM-as-judge evaluator in LangSmith — from offline testing to live deployment.
The notebook (evaluator_lifecycle.ipynb) walks through:
- Dataset — a labelled set of professional/unprofessional business responses, including edge cases. This is for validating the evaluator works as expected.
- Evaluator prompt v1 — a minimal prompt pushed to the LangSmith Hub and tested offline against the dataset
- Online deployment — an online evaluator created referencing the v1 prompt commit, running automatically on every live trace
- Evaluator prompt v2 — an improved prompt commit with explicit criteria, tested offline head-to-head with v1
- Canary deployment — a second online evaluator created referencing the v2 prompt, running alongside the first so both collect real-traffic signal simultaneously
- Quality gate & promotion — if v2's offline accuracy meets or exceeds v1, the original online evaluator is updated to use v2' prompt; the canary online evaluator is then retired
The appendix shows how to create numeric and categorical evaluator output schemas in addition to the boolean type used in the main flow.
cp .env.example .envFill in .env:
| Variable | Description |
|---|---|
OPENAI_API_KEY |
OpenAI API key used by the LLM judge |
LANGSMITH_API_KEY |
LangSmith API key |
LANGSMITH_PROJECT_ID |
UUID of the tracing project online evaluators will be attached to (visible in the LangSmith UI URL) |
uv sync
uv run jupyter lab