feat: anonymizer measurement instrumentation and benchmark tooling#177
Conversation
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Greptile SummaryThis PR introduces a comprehensive measurement and benchmarking system for the Anonymizer project. It adds a new
Confidence Score: 5/5Safe to merge. The measurement instrumentation is opt-in and observability-only; no core anonymization paths are altered. The class-level monkey-patching race and enum value mismatch flagged in earlier reviews are both resolved. The new _DataDesignerUsageProbe patches at the resource_provider and ModelFacade instance level, with patches properly restored in reverse order. ContextVar propagation, streaming sink thread-safety, and error-priority ordering in configured_measurement_session are all correct. The two findings are a minor error-drop in close() and a documentation gap on a placeholder function, neither of which affects measurement correctness. No files require special attention beyond the minor close() error-drop in src/anonymizer/measurement/collector.py. Important Files Changed
|
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
|
This overlaps with my benchmark CI PR #162 enough that I’m happy to close mine and let this be the main benchmark tooling direction. The one thing I’d want to preserve from #162 is the CI/workflow shape: a manual GitHub Actions workflow, |
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
|
One more related follow-up, since GitHub does not let me anchor this on the unchanged
As an incremental path before a general DD model-call hook exists, it would be useful to record a sanitized |
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
|
This is awesome! Thank you @binaryaaron for setting up measurement 🤩 Took me a while, but I reviewed the measurement core, benchmark runner, exporters/analyzers, benchmark CI workflow, and the Anonymizer/NDD instrumentation. I also tried running the smoke suite locally. Some notes from that below - The dry-run passed and planned 2 cases. Running the default smoke against build/integrate completed the biographies/redact case and produced the expected measurement outputs: run metadata, stage timings, NDD workflow request/token usage, tokens/sec, and per-record entity/replacement counts. The legal/hash case initially failed on a transient
A few things I think would be good to have before merge:
|
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
|
Thanks for the detailed run notes. I addressed the merge-blocking items in |
|
Agreed that |
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
lipikaramaswamy
left a comment
There was a problem hiding this comment.
Looks great, just small comment on the analysis tables :) Thanks!!
Signed-off-by: Aaron Gonzales <aagonzales@nvidia.com>
Summary
evaluation_recordrows when benchmark replace configs setevaluate: true, preserving judge verdict booleans and invalid-item counts without persisting evaluated dataframes, raw judge traces, prompts, entity values, or replacement strings.anonymizer.measurementpackage while preserving the publicanonymizer.measurementimport surface.measurements.jsonl, table export, detection-artifact sidecars, raw DataDesigner message trace capture, and sanitized DataDesigner scheduler task traces.CustomColumnConfigtraces, and emits safedd_trace_coveragerecords that show native, private-facade, and unsupported coverage.tools/measurement/measurement_tools/for CLI logging, export formats, table writing, manifests, and small aggregation helpers. Scripts keep their row models and metric semantics local.measurements.jsonlto Parquet/CSV/JSONL workflow front and center in the measurement tool README.This PR intentionally does not carry the larger derivative strategy/probe/comparison tools. Those are split to a stacked follow-up so this PR stays focused on measurement capture, export, and the basic benchmark harness.
Stack
Alignment
Distributed DataDesigner execution is outside this PR. Detection export APIs, such as the work in #182, should build configs for external runtimes; the measurement tools here should consume the resulting measurement JSONL, detection artifacts, and trace sidecars.
Validation
Latest checks after benchmark analysis evaluation rollups:
uv run --frozen pytest tests/tools/test_measurement_tools.py tests/tools/test_benchmark_output_analysis.py -quv run --frozen pytest tests/tools/test_benchmark_output_analysis.py -quv run --frozen ruff check tools/measurement/analyze_benchmark_output.py tests/tools/test_benchmark_output_analysis.pyuv run --frozen ruff format --check tools/measurement/analyze_benchmark_output.py tests/tools/test_benchmark_output_analysis.pyuv run tools/codestyle/format.sh --checktynav --ty-bin /root/.local/share/uv/tools/ty/bin/ty diagnostics tools/measurement/analyze_benchmark_output.pypyproject.tomldeprecatedtool.ty.src.rootwarning remainsgit diff --checkEarlier checks after sanitized evaluation metrics:
uv run --frozen pytest tests/test_measurement.py tests/tools/test_measurement_tools.py -quv run --frozen ruff check src/anonymizer/measurement/records/row.py src/anonymizer/measurement/__init__.py tools/measurement/run_benchmarks.py tests/tools/test_measurement_tools.pyuv run --frozen ruff format --check src/anonymizer/measurement/records/row.py src/anonymizer/measurement/__init__.py tools/measurement/run_benchmarks.py tests/tools/test_measurement_tools.pytynav --ty-bin /root/.local/share/uv/tools/ty/bin/ty diagnostics src/anonymizer/measurement/records/row.pypyproject.tomldeprecatedtool.ty.src.rootwarning remainsgit diff --checkEarlier checks after the custom-column DD trace shim:
uv run --frozen pytest tests/test_measurement.py tests/tools/test_measurement_tools.py tests/tools/test_benchmark_output_analysis.py -quv run --frozen ruff check src/anonymizer/engine/ndd/adapter.py src/anonymizer/measurement/sinks.py tests/test_measurement.pyuv run --frozen ruff format --check src/anonymizer/engine/ndd/adapter.py src/anonymizer/measurement/sinks.py tests/test_measurement.pyuv run tools/codestyle/format.sh --checkgit diff --checkEarlier checks after the split:
uv run --frozen pytest tests/test_measurement.py tests/engine/test_ndd_adapter.py tests/tools/test_measurement_tools.py tests/tools/test_benchmark_output_analysis.py tests/tools/test_detection_artifact_analysis.py -quv run --frozen ruff check tools/measurement/run_benchmarks.py tools/measurement/export_measurements.py tools/measurement/analyze_benchmark_output.py tools/measurement/analyze_detection_artifacts.py tools/measurement/measurement_tools tests/tools/test_measurement_tools.py tests/tools/test_benchmark_output_analysis.py tests/tools/test_detection_artifact_analysis.pyuv run tools/codestyle/format.sh --checkgit diff --cached --checkuv run python tools/measurement/run_benchmarks.py --helpuv run python tools/measurement/export_measurements.py --helpuv run python tools/measurement/analyze_benchmark_output.py --helpuv run python tools/measurement/analyze_detection_artifacts.py --helpEarlier branch checks also covered docs build, benchmark dry-run validation for
tools/measurement/examples/repo-data-smoke.yaml, shell syntax validation for the DD-trace smoke script, andtynavdiagnostics on the measurement module.Dogfood
Dogfood with local vLLM endpoint:
http://nemotron-3-super-h100-svc.aagonzales-dev.svc.cluster.local:8000/v1nvidia/nemotron-3-super/tmp/anonymizer-dogfood-mini-h100biographies__biographies-redact-default__r000: ~12.0s, 21 final entitieslegal__legal-hash-agent-labels__r000: ~9.1s, 14 final entitiesmeasurements.jsonl: 14 recordsrun,dd_trace_coverage,ndd_workflow,stage,record)Notes
evaluation_recordrows. They include verdict booleans and invalid-item counts, but not original text, entity values, replacement values, raw judge outputs, prompts, or model responses.LLMTextColumnConfigandLLMStructuredColumnConfig.CustomColumnConfigtraces currently use a temporary Anonymizer shim that instruments the per-run private DataDesigner model registry and returned model facades. This is intentionally documented as brittle and should be replaced by a public DataDesigner model-call trace sink. No DataDesigner issue or PR has been opened from this PR.