NVIDIA-NeMo · memadi-nv · May 12, 2026 · May 12, 2026 · May 12, 2026 · May 13, 2026
@@ -15,19 +15,18 @@ multi-ecosystem-groups:
   python:
     schedule:
       interval: "weekly"
+    open-pull-requests-limit: 5
 
 updates:
   - package-ecosystem: "pip"
     directory: "/"
     multi-ecosystem-group: python
     patterns: ["*"]
-    open-pull-requests-limit: 5
 
   - package-ecosystem: "uv"
     directory: "/"
     multi-ecosystem-group: python
     patterns: ["*"]
-    open-pull-requests-limit: 5
 
   - package-ecosystem: "github-actions"
     directory: "/"

@@ -16,7 +16,7 @@ Anonymize a text dataset using NeMo Anonymizer in the way the user describes:
 
 $ARGUMENTS
 
-The output is a single runnable Python script that builds an `AnonymizerConfig`, previews results on a few rows, inspects failures and quality metrics, and (on user approval) runs the full pipeline. The script is the durable artifact — the user keeps it for re-runs, version control, and production.
+The output is a single runnable Python script that builds an `AnonymizerConfig`, previews results on a few rows, inspects failures and quality metrics, optionally scores Replace output with LLM-as-judge evaluation, and (on user approval) runs the full pipeline. The script is the durable artifact — the user keeps it for re-runs, version control, and production.
 
 # Workflow
 
@@ -30,6 +30,7 @@ Read `workflows/interactive.md` and follow it. Anonymization is high-stakes —
 - **For cross-record consistency** (same value → same replacement everywhere), use `Hash`, not `Substitute`. `Substitute` is consistent within a row only.
 - **In Replace mode, default to `Substitute`** if the user hasn't specified a strategy. It's the most general-purpose choice and matches the bulk of production usage.
 - **`Annotate` is for inspection, not production.** Its output keeps the original entity text and is not privacy-safe. Use it during iteration to confirm detection is working, then switch.
+- **Evaluation is opt-in and runs as a separate step** (Replace mode). After `preview()` / `run()`, call `anonymizer.evaluate(result)` to score the output with LLM-as-judge. `Substitute` gets four judges (detection validity, type fidelity, relational consistency, attribute fidelity); `Redact` / `Annotate` / `Hash` get the detection-validity judge only. Evaluation is diagnostic — it scores quality, it does not change the anonymized output.
 - **Always set `AnonymizerInput.data_summary`**, even briefly. It is the single cheapest quality lever and it improves both detection and rewrite.
 - **Never claim privacy guarantees.** Anonymizer is best-effort. Outputs may need human review depending on `risk_tolerance`. Tell the user this when you finalize.
 
@@ -41,6 +42,9 @@ Read `workflows/interactive.md` and follow it. Anonymization is high-stakes —
 - **`risk_tolerance` only applies to Rewrite mode**, not Replace.
 - **`PrivacyGoal.protect` and `.preserve` must each be 10–1000 chars and at least 3 words.** Be specific (categories, named identifiers, structural facets); avoid generic phrasing like "preserve meaning".
 - **Validator pool is the only model role with built-in load-spreading.** Set `entity_validator: [a, b, c]` in `models.yaml` if rate limits drop rows. Other roles (rewriter, evaluator, etc.) are single-alias.
+- **The evaluation judges use their own model roles** (`detection_validity_judge`, `replace_type_fidelity_judge`, `replace_relational_consistency_judge`, `replace_attribute_fidelity_judge`), configured in the `evaluate` section of `models.yaml`. They are **not** consumed by `preview()` / `run()`, so a config that anonymizes fine can still fail validation at `evaluate()` if those roles are unset. Defaults ship in `src/anonymizer/config/default_model_configs/evaluate.yaml`.
+- **`*_valid` verdict columns are `True` / `False` / `None`.** `None` means the judge was unavailable (model/infra failure), **not** that the row passed — treat it as "unscored", never as a pass. Inspect verdicts per record with `evaluated.display_record(i)`.
+- **`EvaluateConfig` is an empty placeholder today** — no knobs to set. `anonymizer.evaluate(result)` is the whole API; pass nothing else.
 
 # Reference Docs
 
@@ -69,8 +73,10 @@ Write a Python script to the current directory. Name it after the dataset (e.g.
 Generated by the anonymizer agent skill.
 
 Usage:
-    python <this_script>.py            # preview on 5 rows (fast, cheap)
-    python <this_script>.py --full     # run on the full dataset
+    python <this_script>.py                 # preview on 5 rows (fast, cheap)
+    python <this_script>.py --full          # run on the full dataset
+    python <this_script>.py --evaluate      # preview 5 rows, then LLM-judge-score those rows
+    python <this_script>.py --full --evaluate  # run full dataset, then score the full output
 """
 
 from __future__ import annotations
@@ -134,6 +140,11 @@ def main() -> None:
     parser = argparse.ArgumentParser(description=__doc__)
     parser.add_argument("--full", action="store_true", help="Run on full dataset (default: preview 5 rows)")
     parser.add_argument("--num-records", type=int, default=5, help="Rows to preview (ignored with --full)")
+    parser.add_argument(
+        "--evaluate",
+        action="store_true",
+        help="LLM-judge-score the output produced this run (preview rows, or full output with --full)",
+    )
     args = parser.parse_args()
 
     anonymizer = Anonymizer()
@@ -164,6 +175,28 @@ def main() -> None:
         print("\nFix dropped rows before tweaking strategy. See docs/troubleshooting.md.")
         sys.exit(1)
 
+    # Optional LLM-as-judge evaluation (Replace mode). Opt-in, separate step:
+    # scores how well detection + replacement worked without changing the
+    # output. Substitute -> 4 judges (detection validity, type fidelity,
+    # relational consistency, attribute fidelity); Redact/Annotate/Hash ->
+    # detection-validity judge only. Needs the `evaluate` model roles in
+    # models.yaml (see src/anonymizer/config/default_model_configs/evaluate.yaml).
+    if args.evaluate and config.replace is not None:
+        result = anonymizer.evaluate(result)
+        df = result.dataframe
+        for col in (
+            "detection_valid",
+            "type_fidelity_valid",
+            "relational_consistency_valid",
+            "attribute_fidelity_valid",
+        ):
+            if col in df.columns:
+                passed = int(df[col].eq(True).sum())  # None = unscored, never a pass
+                scored = int(df[col].notna().sum())
+                print(f"{col}: {passed}/{scored} passed ({len(df) - scored} unscored)")
+        # In a notebook, inspect per-record verdicts visually:
+        #   result.display_record(0)
+
     # Rewrite-mode quality summary (skip for Replace mode).
     if config.rewrite is not None:
         df = result.dataframe

@@ -5,14 +5,11 @@
 
 import json
 import logging
-from dataclasses import dataclass
+from typing import ClassVar
 
 import pandas as pd
-from data_designer.config.column_configs import LLMStructuredColumnConfig
-from data_designer.config.models import ModelConfig
 from pydantic import BaseModel, Field
 
-from anonymizer.config.models import EvaluateModelSelection
 from anonymizer.engine.constants import (
     COL_DETECTION_INVALID_ENTITIES,
     COL_DETECTION_JUDGE,
@@ -22,10 +19,8 @@
     ENTITY_LABEL_EXAMPLES,
     _jinja,
 )
-from anonymizer.engine.ndd.adapter import FailedRecord, NddAdapter
-from anonymizer.engine.ndd.model_loader import resolve_model_alias
+from anonymizer.engine.evaluation.judge_base import _BaseJudgeWorkflow
 from anonymizer.engine.prompt_utils import substitute_placeholders
-from anonymizer.engine.row_partitioning import merge_and_reorder, split_rows
 from anonymizer.engine.schemas import EntitiesByValueSchema
 
 logger = logging.getLogger("anonymizer.evaluation.detection_judge")
@@ -57,17 +52,6 @@ class DetectionJudgmentSchema(BaseModel):
     )
 
 
-# ---------------------------------------------------------------------------
-# Result
-# ---------------------------------------------------------------------------
-
-
-@dataclass(frozen=True)
-class DetectionJudgeResult:
-    dataframe: pd.DataFrame
-    failed_records: list[FailedRecord]
-
-
 # ---------------------------------------------------------------------------
 # Prompt
 # ---------------------------------------------------------------------------
@@ -192,150 +176,44 @@ def _label_examples_for_judge(parsed: EntitiesByValueSchema) -> str:
     return json.dumps(examples, ensure_ascii=True)
 
 
-def _flatten_judgment(raw: object) -> tuple[bool | None, list[dict[str, str]]]:
-    """Normalize an LLM judge output into (all_valid, invalid_entities).
-
-    Returns ``(None, [])`` for any malformed or missing payload so downstream
-    display can render "judge unavailable" rather than fabricate a verdict.
-    """
-    if raw is None:
-        return None, []
-    if hasattr(raw, "model_dump"):
-        raw = raw.model_dump(mode="python")
-    if isinstance(raw, str):
-        try:
-            raw = json.loads(raw)
-        except (json.JSONDecodeError, ValueError):
-            return None, []
-    if not isinstance(raw, dict):
-        return None, []
-    try:
-        parsed = DetectionJudgmentSchema.model_validate(raw)
-    except Exception:
-        return None, []
-    return parsed.all_valid, [entry.model_dump() for entry in parsed.invalid_entities]
-
-
 # ---------------------------------------------------------------------------
 # Workflow
 # ---------------------------------------------------------------------------
 
 
-class DetectionJudgeWorkflow:
+class DetectionJudgeWorkflow(_BaseJudgeWorkflow):
     """LLM-as-judge evaluator that flags invalid PII detections per record.
 
     Runs after replacement and validates the detection step that fed the
     replacement. Output columns: ``COL_DETECTION_VALID`` (bool|None) and
     ``COL_DETECTION_INVALID_ENTITIES`` (list of {value, label, reasoning}).
     """
 
-    def __init__(self, adapter: NddAdapter) -> None:
-        self._adapter = adapter
-
-    # ------------------------------------------------------------------------
-    # Decomposed pieces — the orchestrator in ReplacementWorkflow uses these
-    # to merge all 4 judges into a single adapter.run_workflow() call.
-    # ------------------------------------------------------------------------
-
-    def prepare(
-        self,
-        dataframe: pd.DataFrame,
-        *,
-        entities_column: str = COL_ENTITIES_BY_VALUE,
-    ) -> pd.DataFrame:
-        """Add the intermediate columns this judge's prompt template references.
-
-        Returns a copy of ``dataframe`` with ``_entities_for_detection_judge`` and
-        ``_entity_examples_for_detection_judge`` populated.
-        """
+    RAW_COL: ClassVar[str] = COL_DETECTION_JUDGE
+    VALID_COL: ClassVar[str] = COL_DETECTION_VALID
+    INVALID_COL: ClassVar[str] = COL_DETECTION_INVALID_ENTITIES
+    SCHEMA: ClassVar[type[BaseModel]] = DetectionJudgmentSchema
+    VERDICT_FIELD: ClassVar[str] = "all_valid"
+    DEFAULT_PAYLOAD: ClassVar[dict] = {"all_valid": True, "invalid_entities": []}
+    MODEL_ROLE: ClassVar[str] = "detection_validity_judge"
+    WORKFLOW_NAME: ClassVar[str] = "replace-detection-judge"
+
+    def prepare(self, dataframe: pd.DataFrame) -> pd.DataFrame:
         working_df = dataframe.copy()
-        parsed = working_df[entities_column].apply(EntitiesByValueSchema.from_raw)
+        parsed = working_df[COL_ENTITIES_BY_VALUE].apply(EntitiesByValueSchema.from_raw)
         working_df[_ENTITIES_FOR_JUDGE_COL] = parsed.apply(_entities_for_judge)
         working_df[_ENTITY_EXAMPLES_FOR_JUDGE_COL] = parsed.apply(_label_examples_for_judge)
         return working_df
 
-    def column_config(self, selected_models: EvaluateModelSelection) -> LLMStructuredColumnConfig:
-        """The DD column config — name, prompt, model alias, structured-output schema."""
-        return LLMStructuredColumnConfig(
-            name=COL_DETECTION_JUDGE,
-            prompt=_judge_prompt(),
-            model_alias=resolve_model_alias("detection_validity_judge", selected_models),
-            output_format=DetectionJudgmentSchema,
-        )
-
-    def postprocess(self, dataframe: pd.DataFrame) -> pd.DataFrame:
-        """Flatten the raw judge output into VALID / INVALID columns and apply
-        the passthrough default (rows with no detected entities trivially pass).
-        """
-        out = dataframe.copy()
-        flattened = out[COL_DETECTION_JUDGE].apply(_flatten_judgment) if COL_DETECTION_JUDGE in out.columns else None
+    def _passthrough_mask(self, dataframe: pd.DataFrame) -> pd.Series:
         # `items` may be a numpy array after a parquet round-trip via DD, so use
         # `len()` rather than `bool()` (which is ambiguous on multi-element arrays).
-        passthrough_mask = out[_ENTITIES_FOR_JUDGE_COL].apply(lambda items: items is None or len(items) == 0)
-
-        valid: list[bool | None] = []
-        invalid: list[list[dict[str, str]]] = []
-        for idx in out.index:
-            if passthrough_mask.loc[idx]:
-                valid.append(True)
-                invalid.append([])
-            elif flattened is not None:
-                v, inv = flattened.loc[idx]
-                valid.append(v)
-                invalid.append(inv)
-            else:
-                valid.append(None)
-                invalid.append([])
-        out[COL_DETECTION_VALID] = valid
-        out[COL_DETECTION_INVALID_ENTITIES] = invalid
-        # Stamp passthrough rows with the default raw judge payload so display logic stays consistent.
-        if COL_DETECTION_JUDGE in out.columns:
-            out.loc[passthrough_mask, COL_DETECTION_JUDGE] = [{"all_valid": True, "invalid_entities": []}] * int(
-                passthrough_mask.sum()
-            )
-        return out
-
-    # ------------------------------------------------------------------------
-    # Legacy single-judge entry point. Kept so existing callers/tests still work.
-    # ------------------------------------------------------------------------
-
-    def evaluate(
-        self,
-        dataframe: pd.DataFrame,
-        *,
-        model_configs: list[ModelConfig],
-        selected_models: EvaluateModelSelection,
-        entities_column: str = COL_ENTITIES_BY_VALUE,
-        preview_num_records: int | None = None,
-    ) -> DetectionJudgeResult:
-        working_df = self.prepare(dataframe, entities_column=entities_column)
-
-        entity_rows, passthrough_rows = split_rows(working_df, column=_ENTITIES_FOR_JUDGE_COL, predicate=bool)
-        passthrough_rows[COL_DETECTION_JUDGE] = [
-            {"all_valid": True, "invalid_entities": []} for _ in range(len(passthrough_rows))
-        ]
-        passthrough_rows[COL_DETECTION_VALID] = True
-        passthrough_rows[COL_DETECTION_INVALID_ENTITIES] = [[] for _ in range(len(passthrough_rows))]
-
-        if entity_rows.empty:
-            combined = merge_and_reorder(passthrough_rows)
-            return DetectionJudgeResult(dataframe=combined, failed_records=[])
-
-        effective_preview_num_records = (
-            min(preview_num_records, len(entity_rows)) if preview_num_records is not None else None
-        )
-        run_result = self._adapter.run_workflow(
-            entity_rows,
-            model_configs=model_configs,
-            columns=[self.column_config(selected_models)],
-            workflow_name="replace-detection-judge",
-            preview_num_records=effective_preview_num_records,
-        )
-
-        judged_df = run_result.dataframe.copy()
-        flattened = judged_df[COL_DETECTION_JUDGE].apply(_flatten_judgment)
-        judged_df[COL_DETECTION_VALID] = flattened.apply(lambda pair: pair[0])
-        judged_df[COL_DETECTION_INVALID_ENTITIES] = flattened.apply(lambda pair: pair[1])
-
-        combined = merge_and_reorder(judged_df, passthrough_rows)
-        return DetectionJudgeResult(dataframe=combined, failed_records=run_result.failed_records)
+        return dataframe[_ENTITIES_FOR_JUDGE_COL].apply(lambda items: items is None or len(items) == 0)
+
+    @classmethod
+    def _build_prompt(cls) -> str:
+        return _judge_prompt()
+
+    @classmethod
+    def _extract_invalid(cls, parsed: BaseModel) -> list[dict[str, object]]:
+        return [entry.model_dump() for entry in parsed.invalid_entities]