Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
856d9b7
add entity detection validation
memadi-nv May 12, 2026
8a8636e
add type fidelity metric
memadi-nv May 12, 2026
6c3ae51
add relational consistency metric
memadi-nv May 12, 2026
7946920
add attribute fidelity metric
memadi-nv May 13, 2026
8776b53
update prompts
memadi-nv May 14, 2026
dfad43b
disp;ay replacement map
memadi-nv May 14, 2026
96c297a
update metric display
memadi-nv May 14, 2026
186d311
more specific prompt
memadi-nv May 14, 2026
63592ad
change judge models for sparce error
memadi-nv May 14, 2026
8997e2c
nit-update namings in metric
memadi-nv May 20, 2026
f203d96
add a toggle for replace evaluation
memadi-nv May 20, 2026
514b1e5
format-nit
memadi-nv May 20, 2026
3e38013
run evaluate judges in parallel
memadi-nv May 21, 2026
f6c7d22
seperate evaluate_replace from preview
memadi-nv May 22, 2026
52c2f3a
merge conflicts
memadi-nv May 22, 2026
133f069
nit-format
memadi-nv May 23, 2026
c04aa0d
nit
memadi-nv May 23, 2026
74735b2
address greptile feedback
memadi-nv May 23, 2026
3ad5929
nit
memadi-nv May 23, 2026
b17b95d
feat: Make anonymizer evaluation mode-independent (#168)
memadi-nv May 26, 2026
e6859a2
address feedback-fixed empty-section render in detection judge.
memadi-nv May 26, 2026
e28ce8d
address feedback-preserve rows when LLM drops some during judge workflow
memadi-nv May 26, 2026
ab54a95
address feedback-refactor: move judge modules under engine/evaluation/
memadi-nv May 26, 2026
5c63ee2
address feedback-refactor(config): split judge aliases into a dedicat…
memadi-nv May 26, 2026
187c3a5
add evaluation as an optional step to replace notebooks
memadi-nv May 27, 2026
da6c682
Update src/anonymizer/interface/display.py
memadi-nv May 27, 2026
6a759f6
address Lipika's feedback
memadi-nv May 28, 2026
9c1379c
nit-format
memadi-nv May 28, 2026
495d8ab
refactor(evaluation): consolidate judge workflows behind a base class
memadi-nv May 29, 2026
4119738
add judge base
memadi-nv May 29, 2026
532c5c5
fix merge conflicts
memadi-nv Jun 1, 2026
f11b19b
fix(dependabot): move open-pull-requests-limit to the python group
memadi-nv Jun 1, 2026
e32aff4
Update SKILL according to evaluate; replace addition
memadi-nv Jun 1, 2026
6d58161
address greptile feedback
memadi-nv Jun 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,18 @@ multi-ecosystem-groups:
python:
schedule:
interval: "weekly"
open-pull-requests-limit: 5

updates:
- package-ecosystem: "pip"
directory: "/"
multi-ecosystem-group: python
patterns: ["*"]
open-pull-requests-limit: 5

- package-ecosystem: "uv"
directory: "/"
multi-ecosystem-group: python
patterns: ["*"]
open-pull-requests-limit: 5

- package-ecosystem: "github-actions"
directory: "/"
Expand Down
39 changes: 36 additions & 3 deletions skills/anonymizer/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Anonymize a text dataset using NeMo Anonymizer in the way the user describes:

$ARGUMENTS

The output is a single runnable Python script that builds an `AnonymizerConfig`, previews results on a few rows, inspects failures and quality metrics, and (on user approval) runs the full pipeline. The script is the durable artifact — the user keeps it for re-runs, version control, and production.
The output is a single runnable Python script that builds an `AnonymizerConfig`, previews results on a few rows, inspects failures and quality metrics, optionally scores Replace output with LLM-as-judge evaluation, and (on user approval) runs the full pipeline. The script is the durable artifact — the user keeps it for re-runs, version control, and production.

# Workflow

Expand All @@ -30,6 +30,7 @@ Read `workflows/interactive.md` and follow it. Anonymization is high-stakes —
- **For cross-record consistency** (same value → same replacement everywhere), use `Hash`, not `Substitute`. `Substitute` is consistent within a row only.
- **In Replace mode, default to `Substitute`** if the user hasn't specified a strategy. It's the most general-purpose choice and matches the bulk of production usage.
- **`Annotate` is for inspection, not production.** Its output keeps the original entity text and is not privacy-safe. Use it during iteration to confirm detection is working, then switch.
- **Evaluation is opt-in and runs as a separate step** (Replace mode). After `preview()` / `run()`, call `anonymizer.evaluate(result)` to score the output with LLM-as-judge. `Substitute` gets four judges (detection validity, type fidelity, relational consistency, attribute fidelity); `Redact` / `Annotate` / `Hash` get the detection-validity judge only. Evaluation is diagnostic — it scores quality, it does not change the anonymized output.
- **Always set `AnonymizerInput.data_summary`**, even briefly. It is the single cheapest quality lever and it improves both detection and rewrite.
- **Never claim privacy guarantees.** Anonymizer is best-effort. Outputs may need human review depending on `risk_tolerance`. Tell the user this when you finalize.

Expand All @@ -41,6 +42,9 @@ Read `workflows/interactive.md` and follow it. Anonymization is high-stakes —
- **`risk_tolerance` only applies to Rewrite mode**, not Replace.
- **`PrivacyGoal.protect` and `.preserve` must each be 10–1000 chars and at least 3 words.** Be specific (categories, named identifiers, structural facets); avoid generic phrasing like "preserve meaning".
- **Validator pool is the only model role with built-in load-spreading.** Set `entity_validator: [a, b, c]` in `models.yaml` if rate limits drop rows. Other roles (rewriter, evaluator, etc.) are single-alias.
- **The evaluation judges use their own model roles** (`detection_validity_judge`, `replace_type_fidelity_judge`, `replace_relational_consistency_judge`, `replace_attribute_fidelity_judge`), configured in the `evaluate` section of `models.yaml`. They are **not** consumed by `preview()` / `run()`, so a config that anonymizes fine can still fail validation at `evaluate()` if those roles are unset. Defaults ship in `src/anonymizer/config/default_model_configs/evaluate.yaml`.
- **`*_valid` verdict columns are `True` / `False` / `None`.** `None` means the judge was unavailable (model/infra failure), **not** that the row passed — treat it as "unscored", never as a pass. Inspect verdicts per record with `evaluated.display_record(i)`.
- **`EvaluateConfig` is an empty placeholder today** — no knobs to set. `anonymizer.evaluate(result)` is the whole API; pass nothing else.

# Reference Docs

Expand Down Expand Up @@ -69,8 +73,10 @@ Write a Python script to the current directory. Name it after the dataset (e.g.
Generated by the anonymizer agent skill.

Usage:
python <this_script>.py # preview on 5 rows (fast, cheap)
python <this_script>.py --full # run on the full dataset
python <this_script>.py # preview on 5 rows (fast, cheap)
python <this_script>.py --full # run on the full dataset
python <this_script>.py --evaluate # preview 5 rows, then LLM-judge-score those rows
python <this_script>.py --full --evaluate # run full dataset, then score the full output
"""

from __future__ import annotations
Expand Down Expand Up @@ -134,6 +140,11 @@ def main() -> None:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--full", action="store_true", help="Run on full dataset (default: preview 5 rows)")
parser.add_argument("--num-records", type=int, default=5, help="Rows to preview (ignored with --full)")
parser.add_argument(
"--evaluate",
action="store_true",
help="LLM-judge-score the output produced this run (preview rows, or full output with --full)",
)
args = parser.parse_args()

anonymizer = Anonymizer()
Expand Down Expand Up @@ -164,6 +175,28 @@ def main() -> None:
print("\nFix dropped rows before tweaking strategy. See docs/troubleshooting.md.")
sys.exit(1)

# Optional LLM-as-judge evaluation (Replace mode). Opt-in, separate step:
# scores how well detection + replacement worked without changing the
# output. Substitute -> 4 judges (detection validity, type fidelity,
# relational consistency, attribute fidelity); Redact/Annotate/Hash ->
# detection-validity judge only. Needs the `evaluate` model roles in
# models.yaml (see src/anonymizer/config/default_model_configs/evaluate.yaml).
if args.evaluate and config.replace is not None:
result = anonymizer.evaluate(result)
df = result.dataframe
for col in (
"detection_valid",
"type_fidelity_valid",
"relational_consistency_valid",
"attribute_fidelity_valid",
):
if col in df.columns:
passed = int(df[col].eq(True).sum()) # None = unscored, never a pass
scored = int(df[col].notna().sum())
print(f"{col}: {passed}/{scored} passed ({len(df) - scored} unscored)")
# In a notebook, inspect per-record verdicts visually:
# result.display_record(0)

# Rewrite-mode quality summary (skip for Replace mode).
if config.rewrite is not None:
df = result.dataframe
Expand Down
170 changes: 24 additions & 146 deletions src/anonymizer/engine/evaluation/detection_judge.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,11 @@

import json
import logging
from dataclasses import dataclass
from typing import ClassVar

import pandas as pd
from data_designer.config.column_configs import LLMStructuredColumnConfig
from data_designer.config.models import ModelConfig
from pydantic import BaseModel, Field

from anonymizer.config.models import EvaluateModelSelection
from anonymizer.engine.constants import (
COL_DETECTION_INVALID_ENTITIES,
COL_DETECTION_JUDGE,
Expand All @@ -22,10 +19,8 @@
ENTITY_LABEL_EXAMPLES,
_jinja,
)
from anonymizer.engine.ndd.adapter import FailedRecord, NddAdapter
from anonymizer.engine.ndd.model_loader import resolve_model_alias
from anonymizer.engine.evaluation.judge_base import _BaseJudgeWorkflow
from anonymizer.engine.prompt_utils import substitute_placeholders
from anonymizer.engine.row_partitioning import merge_and_reorder, split_rows
from anonymizer.engine.schemas import EntitiesByValueSchema

logger = logging.getLogger("anonymizer.evaluation.detection_judge")
Expand Down Expand Up @@ -57,17 +52,6 @@ class DetectionJudgmentSchema(BaseModel):
)


# ---------------------------------------------------------------------------
# Result
# ---------------------------------------------------------------------------


@dataclass(frozen=True)
class DetectionJudgeResult:
dataframe: pd.DataFrame
failed_records: list[FailedRecord]


# ---------------------------------------------------------------------------
# Prompt
# ---------------------------------------------------------------------------
Expand Down Expand Up @@ -192,150 +176,44 @@ def _label_examples_for_judge(parsed: EntitiesByValueSchema) -> str:
return json.dumps(examples, ensure_ascii=True)


def _flatten_judgment(raw: object) -> tuple[bool | None, list[dict[str, str]]]:
"""Normalize an LLM judge output into (all_valid, invalid_entities).

Returns ``(None, [])`` for any malformed or missing payload so downstream
display can render "judge unavailable" rather than fabricate a verdict.
"""
if raw is None:
return None, []
if hasattr(raw, "model_dump"):
raw = raw.model_dump(mode="python")
if isinstance(raw, str):
try:
raw = json.loads(raw)
except (json.JSONDecodeError, ValueError):
return None, []
if not isinstance(raw, dict):
return None, []
try:
parsed = DetectionJudgmentSchema.model_validate(raw)
except Exception:
return None, []
return parsed.all_valid, [entry.model_dump() for entry in parsed.invalid_entities]


# ---------------------------------------------------------------------------
# Workflow
# ---------------------------------------------------------------------------


class DetectionJudgeWorkflow:
class DetectionJudgeWorkflow(_BaseJudgeWorkflow):
"""LLM-as-judge evaluator that flags invalid PII detections per record.

Runs after replacement and validates the detection step that fed the
replacement. Output columns: ``COL_DETECTION_VALID`` (bool|None) and
``COL_DETECTION_INVALID_ENTITIES`` (list of {value, label, reasoning}).
"""

def __init__(self, adapter: NddAdapter) -> None:
self._adapter = adapter

# ------------------------------------------------------------------------
# Decomposed pieces — the orchestrator in ReplacementWorkflow uses these
# to merge all 4 judges into a single adapter.run_workflow() call.
# ------------------------------------------------------------------------

def prepare(
self,
dataframe: pd.DataFrame,
*,
entities_column: str = COL_ENTITIES_BY_VALUE,
) -> pd.DataFrame:
"""Add the intermediate columns this judge's prompt template references.

Returns a copy of ``dataframe`` with ``_entities_for_detection_judge`` and
``_entity_examples_for_detection_judge`` populated.
"""
RAW_COL: ClassVar[str] = COL_DETECTION_JUDGE
VALID_COL: ClassVar[str] = COL_DETECTION_VALID
INVALID_COL: ClassVar[str] = COL_DETECTION_INVALID_ENTITIES
SCHEMA: ClassVar[type[BaseModel]] = DetectionJudgmentSchema
VERDICT_FIELD: ClassVar[str] = "all_valid"
DEFAULT_PAYLOAD: ClassVar[dict] = {"all_valid": True, "invalid_entities": []}
MODEL_ROLE: ClassVar[str] = "detection_validity_judge"
WORKFLOW_NAME: ClassVar[str] = "replace-detection-judge"

def prepare(self, dataframe: pd.DataFrame) -> pd.DataFrame:
working_df = dataframe.copy()
parsed = working_df[entities_column].apply(EntitiesByValueSchema.from_raw)
parsed = working_df[COL_ENTITIES_BY_VALUE].apply(EntitiesByValueSchema.from_raw)
working_df[_ENTITIES_FOR_JUDGE_COL] = parsed.apply(_entities_for_judge)
working_df[_ENTITY_EXAMPLES_FOR_JUDGE_COL] = parsed.apply(_label_examples_for_judge)
return working_df

def column_config(self, selected_models: EvaluateModelSelection) -> LLMStructuredColumnConfig:
"""The DD column config — name, prompt, model alias, structured-output schema."""
return LLMStructuredColumnConfig(
name=COL_DETECTION_JUDGE,
prompt=_judge_prompt(),
model_alias=resolve_model_alias("detection_validity_judge", selected_models),
output_format=DetectionJudgmentSchema,
)

def postprocess(self, dataframe: pd.DataFrame) -> pd.DataFrame:
"""Flatten the raw judge output into VALID / INVALID columns and apply
the passthrough default (rows with no detected entities trivially pass).
"""
out = dataframe.copy()
flattened = out[COL_DETECTION_JUDGE].apply(_flatten_judgment) if COL_DETECTION_JUDGE in out.columns else None
def _passthrough_mask(self, dataframe: pd.DataFrame) -> pd.Series:
# `items` may be a numpy array after a parquet round-trip via DD, so use
# `len()` rather than `bool()` (which is ambiguous on multi-element arrays).
passthrough_mask = out[_ENTITIES_FOR_JUDGE_COL].apply(lambda items: items is None or len(items) == 0)

valid: list[bool | None] = []
invalid: list[list[dict[str, str]]] = []
for idx in out.index:
if passthrough_mask.loc[idx]:
valid.append(True)
invalid.append([])
elif flattened is not None:
v, inv = flattened.loc[idx]
valid.append(v)
invalid.append(inv)
else:
valid.append(None)
invalid.append([])
out[COL_DETECTION_VALID] = valid
out[COL_DETECTION_INVALID_ENTITIES] = invalid
# Stamp passthrough rows with the default raw judge payload so display logic stays consistent.
if COL_DETECTION_JUDGE in out.columns:
out.loc[passthrough_mask, COL_DETECTION_JUDGE] = [{"all_valid": True, "invalid_entities": []}] * int(
passthrough_mask.sum()
)
return out

# ------------------------------------------------------------------------
# Legacy single-judge entry point. Kept so existing callers/tests still work.
# ------------------------------------------------------------------------

def evaluate(
self,
dataframe: pd.DataFrame,
*,
model_configs: list[ModelConfig],
selected_models: EvaluateModelSelection,
entities_column: str = COL_ENTITIES_BY_VALUE,
preview_num_records: int | None = None,
) -> DetectionJudgeResult:
working_df = self.prepare(dataframe, entities_column=entities_column)

entity_rows, passthrough_rows = split_rows(working_df, column=_ENTITIES_FOR_JUDGE_COL, predicate=bool)
passthrough_rows[COL_DETECTION_JUDGE] = [
{"all_valid": True, "invalid_entities": []} for _ in range(len(passthrough_rows))
]
passthrough_rows[COL_DETECTION_VALID] = True
passthrough_rows[COL_DETECTION_INVALID_ENTITIES] = [[] for _ in range(len(passthrough_rows))]

if entity_rows.empty:
combined = merge_and_reorder(passthrough_rows)
return DetectionJudgeResult(dataframe=combined, failed_records=[])

effective_preview_num_records = (
min(preview_num_records, len(entity_rows)) if preview_num_records is not None else None
)
run_result = self._adapter.run_workflow(
entity_rows,
model_configs=model_configs,
columns=[self.column_config(selected_models)],
workflow_name="replace-detection-judge",
preview_num_records=effective_preview_num_records,
)

judged_df = run_result.dataframe.copy()
flattened = judged_df[COL_DETECTION_JUDGE].apply(_flatten_judgment)
judged_df[COL_DETECTION_VALID] = flattened.apply(lambda pair: pair[0])
judged_df[COL_DETECTION_INVALID_ENTITIES] = flattened.apply(lambda pair: pair[1])

combined = merge_and_reorder(judged_df, passthrough_rows)
return DetectionJudgeResult(dataframe=combined, failed_records=run_result.failed_records)
return dataframe[_ENTITIES_FOR_JUDGE_COL].apply(lambda items: items is None or len(items) == 0)

@classmethod
def _build_prompt(cls) -> str:
return _judge_prompt()

@classmethod
def _extract_invalid(cls, parsed: BaseModel) -> list[dict[str, object]]:
return [entry.model_dump() for entry in parsed.invalid_entities]
Loading
Loading