Skip to content

feat: anonymizer measurement instrumentation and benchmark tooling#177

Merged
binaryaaron merged 26 commits into
mainfrom
binaryaaron/perf-epic
Jun 12, 2026
Merged

feat: anonymizer measurement instrumentation and benchmark tooling#177
binaryaaron merged 26 commits into
mainfrom
binaryaaron/perf-epic

Conversation

@binaryaaron

@binaryaaron binaryaaron commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds in-repo measurement sessions, stage timers, DataDesigner workflow metrics, direct model workflow metrics, and sanitized per-record safety metrics.
  • Records sanitized evaluation_record rows when benchmark replace configs set evaluate: true, preserving judge verdict booleans and invalid-item counts without persisting evaluated dataframes, raw judge traces, prompts, entity values, or replacement strings.
  • Refactors measurement internals into the anonymizer.measurement package while preserving the public anonymizer.measurement import surface.
  • Adds the core benchmark runner for repeatable workloads: suite preflight validation, row slicing, case retries, per-case raw measurement shards, combined measurements.jsonl, table export, detection-artifact sidecars, raw DataDesigner message trace capture, and sanitized DataDesigner scheduler task traces.
  • Uses DataDesigner native LLM column tracing for standard LLM columns, adds a temporary Anonymizer private model-registry/facade shim for model-backed CustomColumnConfig traces, and emits safe dd_trace_coverage records that show native, private-facade, and unsupported coverage.
  • Adds the front-door measurement table exporter plus first-order benchmark and detection-artifact analysis tools, including case/group rollups for sanitized replace judge evaluation rows.
  • Factors shared measurement-tool support into tools/measurement/measurement_tools/ for CLI logging, export formats, table writing, manifests, and small aggregation helpers. Scripts keep their row models and metric semantics local.
  • Documents the in-repo observability system and puts the measurements.jsonl to Parquet/CSV/JSONL workflow front and center in the measurement tool README.
  • Refactors benchmark-output tests to use checked-in measurement fixtures, and uses the repo synthetic biography sample for valid benchmark workload tests.

This PR intentionally does not carry the larger derivative strategy/probe/comparison tools. Those are split to a stacked follow-up so this PR stays focused on measurement capture, export, and the basic benchmark harness.

Stack

Alignment

Distributed DataDesigner execution is outside this PR. Detection export APIs, such as the work in #182, should build configs for external runtimes; the measurement tools here should consume the resulting measurement JSONL, detection artifacts, and trace sidecars.

Validation

Latest checks after benchmark analysis evaluation rollups:

  • uv run --frozen pytest tests/tools/test_measurement_tools.py tests/tools/test_benchmark_output_analysis.py -q
    • Result: 39 passed, 6 existing DataDesigner model-config deprecation warnings
  • uv run --frozen pytest tests/tools/test_benchmark_output_analysis.py -q
    • Result: 10 passed
  • uv run --frozen ruff check tools/measurement/analyze_benchmark_output.py tests/tools/test_benchmark_output_analysis.py
  • uv run --frozen ruff format --check tools/measurement/analyze_benchmark_output.py tests/tools/test_benchmark_output_analysis.py
  • uv run tools/codestyle/format.sh --check
  • tynav --ty-bin /root/.local/share/uv/tools/ty/bin/ty diagnostics tools/measurement/analyze_benchmark_output.py
    • Result: no diagnostics for the edited file; existing pyproject.toml deprecated tool.ty.src.root warning remains
  • git diff --check

Earlier checks after sanitized evaluation metrics:

  • uv run --frozen pytest tests/test_measurement.py tests/tools/test_measurement_tools.py -q
    • Result: 75 passed, 9 existing DataDesigner model-config deprecation warnings
  • uv run --frozen ruff check src/anonymizer/measurement/records/row.py src/anonymizer/measurement/__init__.py tools/measurement/run_benchmarks.py tests/tools/test_measurement_tools.py
  • uv run --frozen ruff format --check src/anonymizer/measurement/records/row.py src/anonymizer/measurement/__init__.py tools/measurement/run_benchmarks.py tests/tools/test_measurement_tools.py
  • tynav --ty-bin /root/.local/share/uv/tools/ty/bin/ty diagnostics src/anonymizer/measurement/records/row.py
    • Result: no diagnostics for the edited file; existing pyproject.toml deprecated tool.ty.src.root warning remains
  • git diff --check

Earlier checks after the custom-column DD trace shim:

  • uv run --frozen pytest tests/test_measurement.py tests/tools/test_measurement_tools.py tests/tools/test_benchmark_output_analysis.py -q
    • Result: 71 passed, 6 existing DataDesigner model-config deprecation warnings
  • uv run --frozen ruff check src/anonymizer/engine/ndd/adapter.py src/anonymizer/measurement/sinks.py tests/test_measurement.py
  • uv run --frozen ruff format --check src/anonymizer/engine/ndd/adapter.py src/anonymizer/measurement/sinks.py tests/test_measurement.py
  • uv run tools/codestyle/format.sh --check
  • git diff --check

Earlier checks after the split:

  • uv run --frozen pytest tests/test_measurement.py tests/engine/test_ndd_adapter.py tests/tools/test_measurement_tools.py tests/tools/test_benchmark_output_analysis.py tests/tools/test_detection_artifact_analysis.py -q
    • Result: 82 passed, 9 existing DataDesigner model-config deprecation warnings
  • uv run --frozen ruff check tools/measurement/run_benchmarks.py tools/measurement/export_measurements.py tools/measurement/analyze_benchmark_output.py tools/measurement/analyze_detection_artifacts.py tools/measurement/measurement_tools tests/tools/test_measurement_tools.py tests/tools/test_benchmark_output_analysis.py tests/tools/test_detection_artifact_analysis.py
  • uv run tools/codestyle/format.sh --check
  • git diff --cached --check
  • CLI smoke:
    • uv run python tools/measurement/run_benchmarks.py --help
    • uv run python tools/measurement/export_measurements.py --help
    • uv run python tools/measurement/analyze_benchmark_output.py --help
    • uv run python tools/measurement/analyze_detection_artifacts.py --help

Earlier branch checks also covered docs build, benchmark dry-run validation for tools/measurement/examples/repo-data-smoke.yaml, shell syntax validation for the DD-trace smoke script, and tynav diagnostics on the measurement module.

Dogfood

Dogfood with local vLLM endpoint:

  • Endpoint: http://nemotron-3-super-h100-svc.aagonzales-dev.svc.cluster.local:8000/v1
  • Model: nvidia/nemotron-3-super
  • Output: /tmp/anonymizer-dogfood-mini-h100
  • Result: 2/2 cases completed, 0 errors
  • biographies__biographies-redact-default__r000: ~12.0s, 21 final entities
  • legal__legal-hash-agent-labels__r000: ~9.1s, 14 final entities
  • measurements.jsonl: 14 records
  • task traces: 10 records per case
  • table export: 5 tables from 14 records (run, dd_trace_coverage, ndd_workflow, stage, record)

Notes

  • Checked-in benchmark suites use repository-relative paths and env-backed runtime config; they avoid machine-specific endpoints and absolute local paths.
  • DD message traces, raw case shards, and DataDesigner artifacts are sensitive run artifacts and may contain prompts, model outputs, secrets, or PII.
  • DataDesigner scheduler task traces are sanitized timing sidecars. They include queue/execution/total durations and error presence, but intentionally omit raw DD error strings.
  • Optional replace judge evaluation metrics are sanitized into evaluation_record rows. They include verdict booleans and invalid-item counts, but not original text, entity values, replacement values, raw judge outputs, prompts, or model responses.
  • Standard DD message tracing uses native DataDesigner trace side effects for LLMTextColumnConfig and LLMStructuredColumnConfig.
  • Model-backed CustomColumnConfig traces currently use a temporary Anonymizer shim that instruments the per-run private DataDesigner model registry and returned model facades. This is intentionally documented as brittle and should be replaced by a public DataDesigner model-call trace sink. No DataDesigner issue or PR has been opened from this PR.
  • Runtime timings, retry counts, token counts, and local endpoint names are environment-dependent and should not be treated as portable fixtures.

Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
@binaryaaron binaryaaron changed the title Add anonymizer measurement instrumentation and benchmark tooling feat: anonymizer measurement instrumentation and benchmark tooling Jun 3, 2026
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
@binaryaaron binaryaaron marked this pull request as ready for review June 8, 2026 21:15
@binaryaaron binaryaaron requested review from a team as code owners June 8, 2026 21:15
@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces a comprehensive measurement and benchmarking system for the Anonymizer project. It adds a new anonymizer.measurement package with session management, streaming JSONL sinks, per-record metrics, and sanitized evaluation capture, plus a full benchmark runner (run_benchmarks.py) that orchestrates repeatable workloads with per-case retries, detection-artifact sidecars, and DataDesigner trace shims.

  • Measurement core: MeasurementCollector with HMAC-keyed record hashing, ContextVar-backed session propagation, streaming vs. batch write modes, and three separate sinks (records, DD message traces, DD task traces) isolated to protect against PII leakage.
  • NDD adapter instrumentation: Instance-level _DataDesignerUsageProbe that patches _create_resource_provider, wraps ModelFacade methods per-instance, and captures native DD column traces — replacing the class-level monkey-patching flagged in a previous review.
  • Benchmark runner: Suite YAML spec validation, workload row-slicing, case retry loop, combined measurements.jsonl, table export (Parquet/CSV/JSONL), and detection-artifact analysis sidecars.

Confidence Score: 5/5

Safe to merge. The measurement instrumentation is opt-in and observability-only; no core anonymization paths are altered.

The class-level monkey-patching race and enum value mismatch flagged in earlier reviews are both resolved. The new _DataDesignerUsageProbe patches at the resource_provider and ModelFacade instance level, with patches properly restored in reverse order. ContextVar propagation, streaming sink thread-safety, and error-priority ordering in configured_measurement_session are all correct. The two findings are a minor error-drop in close() and a documentation gap on a placeholder function, neither of which affects measurement correctness.

No files require special attention beyond the minor close() error-drop in src/anonymizer/measurement/collector.py.

Important Files Changed

Filename Overview
src/anonymizer/measurement/collector.py New MeasurementCollector with HMAC-keyed record hashing, three separate sinks, and streaming support. Minor: close() only preserves the first error when multiple sinks fail to close.
src/anonymizer/measurement/session.py ContextVar-backed session management with correct error propagation — body errors take priority over write/close errors, write errors take priority over close errors.
src/anonymizer/engine/ndd/adapter.py Large instrumentation layer added: instance-level _DataDesignerUsageProbe replacing the class-level patching flagged in a prior review, _DDMessageTracePlan for native column tracing, and _temporary_dd_task_trace. RLock serialises concurrent run_workflow calls on the same adapter instance.
src/anonymizer/measurement/records/row.py Per-row record and evaluation metric capture. Evaluation records preserve only verdict booleans and invalid-item counts, correctly excluding raw text and entity values.
tools/measurement/run_benchmarks.py Full benchmark runner with suite validation, row slicing, retry loop, per-case streaming measurement, combined JSONL output, table export, and detection-artifact sidecars. CLI flags correctly match DDTraceMode enum values.
.github/workflows/benchmark-ci.yml Workflow_dispatch CI with correct enum choices (last_message/all_messages matching DDTraceMode), proper secret gating, always-run summary/upload steps, and appropriate self-hosted runner config.
src/anonymizer/measurement/sinks.py Thread-safe line-buffered JSONL streaming sink with Lock, plus batch JSONL/JSON writers. Parent directory creation on init ensures directories exist before streaming starts.
src/anonymizer/measurement/recorders.py stage_timer yields a mutable dict so callers can inject output_row_count/failed_record_count after the timed block; the finally clause then spreads those updates into the measurement record correctly.
src/anonymizer/engine/replace/llm_replace_workflow.py Adds synthetic-original collision repair for replacement maps and a new COL_REPLACEMENT_MAP_SOURCE column; collision candidates are generated via an index-incrementing loop with a protected-values guard.
src/anonymizer/measurement/_coerce.py Coercion helpers for JSON-safe output, text token counting (tiktoken with word-count fallback), and size bucketing. Buckets are non-overlapping and correctly labelled.

Sequence Diagram

sequenceDiagram
    participant CLI as run_benchmarks.py
    participant Session as configured_measurement_session
    participant Collector as MeasurementCollector
    participant Anon as Anonymizer._run_internal
    participant NDD as NddAdapter.run_workflow
    participant Probe as _DataDesignerUsageProbe
    participant Sink as _JsonlMeasurementSink

    CLI->>Session: "MeasurementConfig(streaming=True)"
    Session->>Sink: open JSONL sinks
    Session->>Collector: create collector
    Session-->>CLI: yield collector

    CLI->>Anon: anonymizer.run(config, data)
    Anon->>Collector: record_run_metadata()
    Anon->>NDD: run_workflow(...)
    NDD->>Probe: patch _create_resource_provider
    NDD->>NDD: acquire _run_lock
    NDD->>NDD: DataDesigner.create
    Probe->>Probe: wrap ModelFacade per instance
    NDD->>Collector: record_ndd_workflow()
    NDD->>Probe: flush_private_trace_records()
    Probe->>Collector: record_dd_message_trace()
    Anon->>Collector: record_record_metrics() per row

    opt config.evaluate
        CLI->>Collector: record_evaluation_metrics()
    end

    Session->>Collector: close() all sinks
    CLI->>CLI: combine_measurements()
    CLI->>CLI: export_measurement_tables()
Loading

Reviews (19): Last reviewed commit: "Add evaluation rollups to benchmark anal..." | Re-trigger Greptile

Comment thread src/anonymizer/measurement.py Outdated
Comment thread src/anonymizer/engine/ndd/adapter.py Outdated
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Comment thread src/anonymizer/engine/ndd/adapter.py Outdated
Comment thread tools/measurement/run_benchmarks.py
Comment thread tools/measurement/run_benchmarks.py Outdated
Comment thread src/anonymizer/measurement.py Outdated
@andreatgretel

Copy link
Copy Markdown
Collaborator

This overlaps with my benchmark CI PR #162 enough that I’m happy to close mine and let this be the main benchmark tooling direction.

The one thing I’d want to preserve from #162 is the CI/workflow shape: a manual GitHub Actions workflow, NVIDIA_API_KEY setup, benchmark artifact upload, and a step summary. This PR has the more complete runner/measurement stack, so it probably makes sense to retire my bespoke scripts/benchmark_ci.py path and have the workflow call tools/measurement/run_benchmarks.py instead.

Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Comment thread src/anonymizer/engine/ndd/adapter.py Outdated
Comment thread src/anonymizer/engine/ndd/adapter.py Outdated
@andreatgretel

Copy link
Copy Markdown
Collaborator

One more related follow-up, since GitHub does not let me anchor this on the unchanged chunked_validation._dispatch_chunk() call site in this PR diff:

_dispatch_chunk() was the main blind spot in the offline replace-mode async profile. On the biographies sample, each measured row made one validation call through that path, and those calls dominated pipeline wall time.

As an incremental path before a general DD model-call hook exists, it would be useful to record a sanitized validator_chunk_model_call event around the facade.generate() call there. Useful fields would be alias, chunk_index, attempt_index, elapsed_sec, ok, error_type, and prompt_char_count for the final prompt after PydanticResponseRecipe is applied. Token usage does not appear to be available from facade.generate() directly, so the workflow-level aggregate usage probe can keep covering that part.

Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Comment thread tools/measurement/run_benchmarks.py
Comment thread src/anonymizer/engine/ndd/adapter.py Outdated
Comment thread tools/measurement/run_benchmarks.py
Comment thread tools/measurement/examples/repo-data-smoke.yaml
@lipikaramaswamy

Copy link
Copy Markdown
Collaborator

This is awesome! Thank you @binaryaaron for setting up measurement 🤩

Took me a while, but I reviewed the measurement core, benchmark runner, exporters/analyzers, benchmark CI workflow, and the Anonymizer/NDD instrumentation. I also tried running the smoke suite locally. Some notes from that below -

The dry-run passed and planned 2 cases. Running the default smoke against build/integrate completed the biographies/redact case and produced the expected measurement outputs: run metadata, stage timings, NDD workflow request/token usage, tokens/sec, and per-record entity/replacement counts.

The legal/hash case initially failed on a transient openai/gpt-oss-120b health-check rate limit. I reran that case with explicit build provider/model config, skip_health_check: true, and lower model parallelism, and it completed successfully:

  • 5/5 rows completed
  • ~69s elapsed
  • observed_total_tokens: 38168
  • observed_tokens_per_sec: ~553
  • outputs included measurements.jsonl and normalized parquet tables for run, ndd_workflow, stage, and record

A few things I think would be good to have before merge:

  1. Can we add runner-level support for arbitrary run_tags? I left an inline comment on this, but for our GitLab flow we’ll want to stamp metadata like anonymizer_ref, commit_sha, benchmark_suite_ref, benchmark_suite_commit_sha, and pipeline_id into every measurement row.

  2. Can we make provider/model selection explicit in the benchmark docs/examples? Model IDs are provider-specific: build/integrate uses names like openai/gpt-oss-120b, while internal inference uses names like nvidia/openai/gpt-oss-120b, and other providers can use entirely different names. I think benchmark suites should treat model_configs and model_providers as a matched pair for reproducibility.

  3. Can we clarify the recommended health-check behavior for benchmark/smoke runs? The smoke can fail before producing useful measurements if a provider health check hits rate limits. It would be helpful to document when benchmark suites should use skip_health_check, retries/backoff, and lower parallelism, perhaps in AGENTS.md.

  4. Can we add an optional evaluate step to the runner if quality benchmarking is intended to be in scope? Right now the runner measures anonymizer.run(...), but does not call Anonymizer.evaluate(...), so LLM judge quality metrics are not produced or stored by the benchmark runner.

  5. Can we document the emitted record types and key fields? The outputs are useful, but downstream consumers will need a stable contract for run, ndd_workflow, stage, and record rows.

Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Comment thread src/anonymizer/engine/ndd/adapter.py Outdated
@binaryaaron

Copy link
Copy Markdown
Collaborator Author

Thanks for the detailed run notes. I addressed the merge-blocking items in 3afb44c: runner-level run_tags, explicit provider/model examples and docs, optional replace-mode evaluate, and documentation for emitted record types/key fields. The runner already has retries/backoff, and the example provider config shows skip_health_check so smoke runs can avoid provider health-check rate-limit noise when appropriate.

@binaryaaron

Copy link
Copy Markdown
Collaborator Author

Agreed that _dispatch_chunk() remains an important observability gap. I did not add a dedicated validator_chunk_model_call event in this PR because the current branch now leans on DD/native/private model-call tracing where possible, and the validator chunk path deserves a small focused follow-up. I would keep the proposed sanitized fields: alias, chunk index, attempt index, elapsed time, ok/error type, and final prompt character count.

Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Comment thread .github/workflows/benchmark-ci.yml
Comment thread src/anonymizer/engine/ndd/adapter.py
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Comment thread tools/measurement/run_benchmarks.py
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>

@lipikaramaswamy lipikaramaswamy left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just small comment on the analysis tables :) Thanks!!

Comment thread tools/measurement/analyze_benchmark_output.py
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
@binaryaaron binaryaaron merged commit f6dd05d into main Jun 12, 2026
13 checks passed
@binaryaaron binaryaaron deleted the binaryaaron/perf-epic branch June 12, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants