Skip to content

feat(evaluator): complete Langfuse observability pipeline (v2.2.3B)#64

Merged
Hidden-History merged 13 commits intomainfrom
feature/v2.2.3B-langfuse-observability
Mar 15, 2026
Merged

feat(evaluator): complete Langfuse observability pipeline (v2.2.3B)#64
Hidden-History merged 13 commits intomainfrom
feature/v2.2.3B-langfuse-observability

Conversation

@Hidden-History
Copy link
Owner

@Hidden-History Hidden-History commented Mar 15, 2026

Summary

Complete Langfuse observability pipeline — observation-level evaluation for all 6 evaluators, automated scheduling, exponential backoff retry, and security hardening.

  • Adds observation-level evaluation path to the evaluator runner — EV-01 through EV-04 now score individual Langfuse spans filtered by event_type name
  • Creates automated evaluator-scheduler Docker container with croniter-based scheduling under the langfuse profile
  • Adds exponential backoff retry logic for transient provider errors (500, 502, 503, 429)
  • Creates all 12 evaluator YAML + prompt definition files with filters aligned to actual codebase event_types
  • Makes create_score_configs.py truly idempotent with pre-check and --cleanup-duplicates (archive via isArchived)
  • Sanitizes all log injection vectors in monitoring/main.py for CodeQL compliance
  • Adds Ollama cloud auto-detection when OLLAMA_API_KEY is set
  • Fixes installer to copy requirements.txt and import user .env on Option 1 updates

Changes

Evaluator Pipeline

  • src/memory/evaluator/runner.py — observation-level eval path, target routing from per-evaluator YAML, CATEGORICAL score handling, page-based pagination, score_id collision prevention
  • src/memory/evaluator/provider.py — exponential backoff retry with jitter, Retry-After header support, Ollama cloud auto-detection
  • evaluators/ev01-ev06*.yaml + *_prompt.md — all 12 evaluator definition files
  • scripts/create_score_configs.py — idempotent score config creation with archive-based duplicate cleanup
  • evaluator_config.yaml — max_retries config, gemma3:4b default model

Scheduler Container

  • scripts/memory/evaluator_scheduler.py — cron daemon with health check, graceful shutdown, live config reload
  • docker/Dockerfile.evaluator-scheduler — python:3.12-slim based container
  • docker/docker-compose.langfuse.yml — evaluator-scheduler service under langfuse profile

Security

  • monitoring/main.py — inline sanitize_log_input() at all log call sites (CodeQL py/log-injection)

Installer Fixes

  • scripts/install.sh — always copy requirements.txt/pyproject.toml on updates; run import_user_env() on Option 1; fix SOURCE_DIR unbound variable

Documentation

  • CHANGELOG.md — complete v2.2.3 entry with upgrade instructions
  • docs/LANGFUSE-INTEGRATION.md — LLM-as-Judge evaluation pipeline section

Test Plan

  • 2540 tests pass locally (0 failures)
  • CI green on all checks (Lint, Unit Tests 3.10/3.11/3.12, Integration, CodeQL, Install Ubuntu/macOS)
  • Live test: 224/224 observations and traces scored via Ollama cloud (gemma3:4b)
  • Scheduler container starts, runs healthy, next evaluation scheduled
  • Score config idempotency verified (30 found, 0 created, 24 archived)
  • Installer Option 1 copies all required files including requirements.txt and user .env credentials

Resolves: TD-280, TD-281, TD-282, TD-283, TD-287, TD-288, TD-100, BUG-217

WB Solutions and others added 13 commits March 15, 2026 01:22
…on-level evaluation, automated scheduling, retry logic

- Add observation-level evaluation path to runner (EV-01 to EV-04 score
  individual spans by event_type name filtering)
- Fix pagination: cursor-based for observations.get_many(), page-based
  for trace.list() per V3 SDK
- Create all 12 evaluator YAML + prompt files with correct filter
  alignment against actual emit_trace_event() event_types
- Add evaluator-scheduler Docker container (croniter-based cron daemon)
  in docker-compose.langfuse.yml under langfuse profile
- Add exponential backoff retry logic for transient provider errors
  (500, 502, 503, 429) with configurable max_retries
- Make create_score_configs.py truly idempotent with pre-check and
  --cleanup-duplicates flag
- Sanitize all log injection vectors in monitoring/main.py (inline
  sanitize_log_input at every call site for CodeQL compliance)
- Add evaluator files to installer copy paths (both fresh and update)
- Add croniter>=2.0.0,<3.0.0 dependency

Resolves: TD-280, TD-281, TD-282, TD-283, TD-287, TD-288, TD-100, BUG-217

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…date

Option 1 (add-project) and copy_files() both skipped requirements.txt
if it already existed, preventing new dependencies like croniter from
reaching Docker builds. Now always overwrites both files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
V3 SDK ScoreConfigsClient exposes get(page=, limit=) not list().
Also fixes test mocks to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…no delete)

Langfuse V3 API returns 405 on DELETE for score configs. Uses
update(isArchived=True) instead — archived configs hidden in UI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
V3 SDK observations.get_many() uses page=/total_pages, not cursor.
Both trace.list() and observations.get_many() are page-based in V3.
Fixed runner and all test mocks to match actual SDK signatures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etect Ollama cloud

- Add import_user_env() call to update_shared_scripts() (Option 1 path)
  so credentials like OLLAMA_API_KEY are imported on updates, not just
  fresh installs
- Auto-detect Ollama cloud vs local: if OLLAMA_API_KEY env var is set
  and no explicit base_url configured, use https://api.ollama.com/v1
  instead of localhost

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
import_user_env() used SOURCE_DIR which is only set during full install.
Fall back to SCRIPT_DIR parent for Option 1 (add-project) updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
api.ollama.com returns 401; ollama.com/v1 is the correct OpenAI-compat endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mpatible)

llama3.2:8b is not available on Ollama cloud. gemma3:4b is small, fast,
and suitable for LLM-as-judge evaluation tasks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Changelog: complete v2.2.3 entry with upgrade instructions including
  scheduler build, score config setup, and provider configuration
- Langfuse docs: add LLM-as-Judge evaluation pipeline section with
  evaluator table, config reference, provider auto-detection, and
  manual evaluation commands. Add evaluator-scheduler to Docker services

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Matches evaluator_config.yaml change. Updates dataclass default and
test assertions from llama3.2:8b to gemma3:4b.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
test_evaluator_provider.py and test_evaluator_runner.py fixture
still had hardcoded llama3.2:8b model name assertions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Hidden-History Hidden-History merged commit ab5fb89 into main Mar 15, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant