Skip to content

langchain-samples/eval-ci-cd

Repository files navigation

Evaluator Lifecycle

Demonstrates the full lifecycle of an LLM-as-judge evaluator in LangSmith — from offline testing to live deployment.

The notebook (evaluator_lifecycle.ipynb) walks through:

  1. Dataset — a labelled set of professional/unprofessional business responses, including edge cases. This is for validating the evaluator works as expected.
  2. Evaluator prompt v1 — a minimal prompt pushed to the LangSmith Hub and tested offline against the dataset
  3. Online deployment — an online evaluator created referencing the v1 prompt commit, running automatically on every live trace
  4. Evaluator prompt v2 — an improved prompt commit with explicit criteria, tested offline head-to-head with v1
  5. Canary deployment — a second online evaluator created referencing the v2 prompt, running alongside the first so both collect real-traffic signal simultaneously
  6. Quality gate & promotion — if v2's offline accuracy meets or exceeds v1, the original online evaluator is updated to use v2' prompt; the canary online evaluator is then retired

The appendix shows how to create numeric and categorical evaluator output schemas in addition to the boolean type used in the main flow.

Setup

cp .env.example .env

Fill in .env:

Variable Description
OPENAI_API_KEY OpenAI API key used by the LLM judge
LANGSMITH_API_KEY LangSmith API key
LANGSMITH_PROJECT_ID UUID of the tracing project online evaluators will be attached to (visible in the LangSmith UI URL)

Running

uv sync
uv run jupyter lab

About

Demonstrates a CI/CD lifecycle for an online evaluation

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors