Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ Run `anonymizer --help` or `anonymizer <subcommand> --help` for all options.
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Redact
DATA_URL = "https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv"

# Uses default model providers (build.nvidia.com) via NVIDIA_API_KEY env var
# Uses Anonymizer's bundled model providers (see src/anonymizer/config/default_model_configs/providers.yaml)
anonymizer = Anonymizer()

config = AnonymizerConfig(replace=Redact())
Expand Down
6 changes: 4 additions & 2 deletions docs/concepts/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@ Anonymizer uses LLMs for entity detection, replacement, and rewriting. Models ar

## Defaults

Set your API key for Anonymizer to use models hosted on [build.nvidia.com](https://build.nvidia.com).
Plain `Anonymizer()` uses Anonymizer's bundled provider and model configs — not DataDesigner's machine-local defaults from `~/.data-designer/model_providers.yaml`. Bundled providers live at [`providers.yaml`](https://github.com/NVIDIA-NeMo/Anonymizer/blob/main/src/anonymizer/config/default_model_configs/providers.yaml); bundled models at [`models.yaml`](https://github.com/NVIDIA-NeMo/Anonymizer/blob/main/src/anonymizer/config/default_model_configs/models.yaml).

Set your API key for Anonymizer to use models hosted on [build.nvidia.com](https://build.nvidia.com):

```bash
export NVIDIA_API_KEY="your-nvidia-api-key"
Expand All @@ -31,7 +33,7 @@ Each pipeline stage has a **role** mapped to one of these aliases. See the full

## Custom providers

Use `model_providers` to define named API endpoints for hosted models such as OpenAI or OpenRouter.
Pass `model_providers` when you need a non-default endpoint — for example OpenAI, OpenRouter, a local GLiNER server, or an internal inference deployment. Plain `Anonymizer()` already uses bundled [build.nvidia.com](https://build.nvidia.com) settings; override only when your models point at a different provider name or URL.

Set your API keys first:

Expand Down
2 changes: 1 addition & 1 deletion skills/anonymizer/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ The agent should consult these as it goes — *do not* try to enumerate field re
Environment-level issues only. Quality and pipeline issues are in `docs/troubleshooting.md`.

- **`anonymizer` not installed:** Tell the user `nemo-anonymizer` is not in this Python environment (requires Python ≥ 3.11). Ask if they want you to install it (`pip install nemo-anonymizer`) or do it themselves. Do not install without permission.
- **Model aliases not configured:** Anonymizer can't run without `model_configs` and `model_providers` (YAML files or Python objects). Tell the user to set these up — see `docs/concepts/models.md`. If they don't have a config yet, point them at `src/anonymizer/config/default_model_configs/` for the shipped defaults.
- **Model/provider setup:** Plain `Anonymizer()` ships with bundled `models.yaml` and `providers.yaml` (see `src/anonymizer/config/default_model_configs/`). For the default path, confirm `NVIDIA_API_KEY` is set. Pass custom `model_configs` and/or `model_providers` only when targeting non-default endpoints or model pools — see `docs/concepts/models.md`.
- **LLM calls failing at preview:** Usually an auth issue (missing or invalid API key), a network problem, or a wrong endpoint URL. See `docs/troubleshooting.md` "Validation passed but `preview` errors at LLM call".
- **Local / on-prem GLiNER:** Clone or download `tools/serve_gliner.py` from the Anonymizer repo, start the server, add a provider with `endpoint: http://localhost:8001/v1`, and point `gliner-pii-detector` at `provider: local-gliner` with `skip_health_check: true`. Preflight errors about missing aliases usually mean `model_configs` only listed the detector — include the full default pool. Wrong `endpoint` or a down server surfaces as detection failures at preview — see [`docs/concepts/self-hosting-gliner.md`](../../docs/concepts/self-hosting-gliner.md).

Expand Down
5 changes: 1 addition & 4 deletions skills/anonymizer/workflows/interactive.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,7 @@ Iterative design with the user. Do not disengage from the loop until the user sa

1. **Verify environment**
- **Install**: run `python -c "import anonymizer; print(anonymizer.__version__)"`. If the import fails, STOP and follow the Troubleshooting section in `SKILL.md`.
- **Model providers**: before going further, confirm an LLM provider is configured. Anonymizer cannot run without one. Check that:
- An API key is set in the environment (`NVIDIA_API_KEY` for the shipped default, or the equivalent for the user's provider)
- A `providers.yaml` exists (defaults ship at `src/anonymizer/config/default_model_configs/providers.yaml`)
- If either is missing, STOP and walk the user through [`docs/concepts/models.md`](../../../docs/concepts/models.md) setup. Do not proceed to data inspection until the user confirms providers are ready.
- **Model providers**: plain `Anonymizer()` loads bundled providers from `src/anonymizer/config/default_model_configs/providers.yaml`. Before going further, confirm the API key for those defaults is set (`NVIDIA_API_KEY` for build.nvidia.com). Only ask for a custom `providers.yaml` when the user targets a non-default endpoint. If the key is missing, STOP and walk the user through [`docs/concepts/models.md`](../../../docs/concepts/models.md) setup.

2. **Inspect the data** — Read the first few rows of the source file with pandas. You need to know:
- Path, format, encoding.
Expand Down
1 change: 1 addition & 0 deletions src/anonymizer/config/default_model_configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ This directory contains the default model configurations used by the Anonymizer
## Files

- **`models.yaml`** — Defines the pool of available models (alias, provider, inference parameters). Each entry becomes a `ModelConfig` that NeMo Data Designer can route requests to.
- **`providers.yaml`** — Defines named API endpoints (provider name, endpoint, API key env var). Loaded automatically when `Anonymizer(model_providers=None)`.
- **`detection.yaml`** — Maps detection workflow roles (e.g. `entity_detector`, `entity_validator`) to model aliases from `models.yaml`.
- **`replace.yaml`** — Maps replacement workflow roles (e.g. `replacement_generator`) to model aliases from `models.yaml`.
- **`rewrite.yaml`** — Maps rewrite workflow roles (`domain_classifier`, `disposition_analyzer`, `meaning_extractor`, `qa_generator`, `rewriter`, `evaluator`, `repairer`, `judge`) to model aliases from `models.yaml`.
Expand Down
8 changes: 8 additions & 0 deletions src/anonymizer/config/default_model_configs/providers.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

providers:
- name: nvidia
endpoint: https://integrate.api.nvidia.com/v1
provider_type: openai
api_key: NVIDIA_API_KEY
2 changes: 2 additions & 0 deletions src/anonymizer/engine/detection/detection_workflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@ def detect_and_validate_entities(
gliner_detection_threshold=gliner_detection_threshold,
validation_max_entities_per_call=validation_max_entities_per_call,
validation_excerpt_window_chars=validation_excerpt_window_chars,
validation_single_chunk_full_text=validation_single_chunk_full_text,
entity_labels=entity_labels,
data_summary=data_summary,
)
Expand All @@ -138,6 +139,7 @@ def _build_detection_spec(
gliner_detection_threshold: float,
validation_max_entities_per_call: int = _DEFAULT_VALIDATION_MAX_ENTITIES_PER_CALL,
validation_excerpt_window_chars: int = _DEFAULT_VALIDATION_EXCERPT_WINDOW_CHARS,
validation_single_chunk_full_text: bool = True,
entity_labels: list[str] | None = None,
data_summary: str | None = None,
) -> tuple[list[ModelConfig], list[ColumnConfigT]]:
Expand Down
65 changes: 63 additions & 2 deletions src/anonymizer/engine/ndd/model_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from pathlib import Path
from typing import Any

from data_designer.config.models import ModelConfig, load_model_configs
from data_designer.config.models import ModelConfig, ModelProvider, load_model_configs
from data_designer.config.utils.io_helpers import load_config_file
from pydantic import BaseModel

Expand Down Expand Up @@ -62,8 +62,10 @@ def parse_model_configs(raw: str | Path | None) -> ParsedModelConfigs:
parsed = _parse_yaml_string(raw)

user_selections = parsed.pop("selected_models", None)
_validate_raw_model_configs_have_provider(parsed)
model_configs = load_model_configs(parsed)
return ParsedModelConfigs(
model_configs=load_model_configs(parsed),
model_configs=model_configs,
selected_models=_merge_selections(user_selections),
)

Expand All @@ -82,6 +84,16 @@ def load_default_model_selection(config_dir: Path | None = None) -> ModelSelecti
)


def load_default_model_providers(config_dir: Path | None = None) -> list[ModelProvider]:
"""Load bundled provider definitions from ``providers.yaml``."""
resolved_dir = config_dir or DEFAULT_CONFIG_DIR
config_dict = _load_yaml_dict(resolved_dir / "providers.yaml")
raw_providers = config_dict.get("providers")
if not isinstance(raw_providers, list):
raise ValueError("Bundled providers YAML must contain a top-level 'providers' list.")
return [ModelProvider.model_validate(provider) for provider in raw_providers]


def load_models_config(config_dir: Path | None = None) -> dict[str, Any]:
"""Load raw model definitions from models.yaml.

Expand Down Expand Up @@ -222,6 +234,25 @@ def _merge(section: BaseModel, overrides: dict[str, Any]) -> BaseModel:
)


def validate_model_configs_reference_providers(
model_configs: list[ModelConfig],
providers: list[ModelProvider],
) -> None:
"""Validate that every model config ``provider`` name exists in ``providers``."""
known_providers = {provider.name for provider in providers}
unknown_by_alias = {
model_config.alias: model_config.provider
for model_config in model_configs
if model_config.provider is not None and model_config.provider not in known_providers
}
if unknown_by_alias:
details = ", ".join(f"{alias}={provider!r}" for alias, provider in sorted(unknown_by_alias.items()))
raise ValueError(
f"Model config provider names not found in model_providers: {details}. "
f"Known providers: {sorted(known_providers)}"
)


def validate_model_alias_references(
model_configs: list[ModelConfig],
selected_models: ModelSelection,
Expand Down Expand Up @@ -303,6 +334,36 @@ def _validate_alias_references(
)


def _provider_field_missing(entry: dict[str, Any]) -> bool:
provider = entry.get("provider")
if provider is None:
return True
if isinstance(provider, str):
return not provider.strip()
return False


def _validate_raw_model_configs_have_provider(parsed: dict[str, Any]) -> None:
"""Require an explicit ``provider`` on every user-supplied model config entry."""
raw_configs = parsed.get("model_configs")
if raw_configs is None:
return
if not isinstance(raw_configs, list):
raise ValueError("model_configs must be a list.")
missing: list[str] = []
for idx, entry in enumerate(raw_configs):
if not isinstance(entry, dict):
raise ValueError(f"model_configs[{idx}] must be a mapping.")
if _provider_field_missing(entry):
missing.append(str(entry.get("alias", f"<index {idx}>")))
if missing:
aliases = ", ".join(repr(alias) for alias in missing)
raise ValueError(
f"Model config entries missing required field 'provider': {aliases}. "
"Each entry in model_configs must specify provider= explicitly."
)


def _parse_yaml_string(raw: str) -> dict[str, Any]:
import yaml

Expand Down
42 changes: 24 additions & 18 deletions src/anonymizer/interface/anonymizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,12 @@
from anonymizer.engine.evaluation.replace.type_fidelity_judge import TypeFidelityJudgeWorkflow
from anonymizer.engine.io.reader import read_input
from anonymizer.engine.ndd.adapter import FailedRecord, NddAdapter
from anonymizer.engine.ndd.model_loader import parse_model_configs, validate_model_alias_references
from anonymizer.engine.ndd.model_loader import (
load_default_model_providers,
parse_model_configs,
validate_model_alias_references,
validate_model_configs_reference_providers,
)
from anonymizer.engine.replace.llm_replace_workflow import LlmReplaceWorkflow
from anonymizer.engine.replace.replace_runner import ReplacementWorkflow
from anonymizer.engine.resolved_input import ResolvedInput
Expand Down Expand Up @@ -114,7 +119,8 @@ def __init__(
pool and optional ``selected_models`` overrides. ``None`` uses
bundled defaults. See ``default_model_configs/README.md``.
model_providers: Provider definitions (list, YAML string, or file path).
Each provider maps a name to an endpoint and API key.
Each provider maps a name to an endpoint and API key. ``None`` uses
bundled defaults from ``default_model_configs/providers.yaml``.
artifact_path: Directory for intermediate artifacts. Defaults to
``.anonymizer-artifacts``.
data_designer: Pre-configured DataDesigner instance (advanced usage).
Expand All @@ -128,10 +134,14 @@ def __init__(
os.environ.setdefault("NEMO_SESSION_PREFIX", "anonymizer-")
os.environ.setdefault("NEMO_DEPLOYMENT_TYPE", "sdk")
resolved_artifact_path = Path(artifact_path or ".anonymizer-artifacts")
parsed = parse_model_configs(model_configs)
self._model_configs = parsed.model_configs
self._selected_models = parsed.selected_models
self._resolved_providers: list[ModelProvider] | None = _resolve_model_providers(model_providers)
try:
parsed = parse_model_configs(model_configs)
self._model_configs = parsed.model_configs
self._selected_models = parsed.selected_models
self._resolved_providers = _resolve_model_providers(model_providers)
validate_model_configs_reference_providers(self._model_configs, self._resolved_providers)
except ValueError as exc:
raise InvalidConfigError(str(exc)) from exc
logger.info("🔧 Anonymizer initialized with %d model configs", len(self._model_configs))
det = self._selected_models.detection
logger.info(LOG_INDENT + "🔎 detector: %s", det.entity_detector)
Expand Down Expand Up @@ -751,10 +761,12 @@ def _count_entities(df: pd.DataFrame) -> int:

def _resolve_model_providers(
model_providers: list[ModelProvider] | str | Path | None,
) -> list[ModelProvider] | None:
) -> list[ModelProvider]:
if model_providers is None:
return None
return load_default_model_providers()
if isinstance(model_providers, list):
if not model_providers:
raise ValueError("model_providers must contain at least one provider.")
return model_providers
if isinstance(model_providers, str) and "\n" not in model_providers:
candidate = Path(model_providers.strip()).expanduser()
Expand All @@ -766,6 +778,8 @@ def _resolve_model_providers(
raw_providers = config_dict.get("providers")
if not isinstance(raw_providers, list):
raise ValueError("model_providers YAML must contain a top-level 'providers' list.")
if not raw_providers:
raise ValueError("model_providers must contain at least one provider.")
return [ModelProvider.model_validate(provider) for provider in raw_providers]


Expand Down Expand Up @@ -977,14 +991,6 @@ def _repair_iterations_triggered(failed: list[FailedRecord], is_rewrite: bool) -
return len(iterations)


def _resolve_model_hosts(providers: list[ModelProvider] | None) -> list[str]:
"""Sorted, deduplicated list of provider host classifications.

Returns ``["nvidia-build"]`` when no custom providers are configured —
anonymizer's defaults route through build.nvidia.com.
"""
if not providers:
from anonymizer.telemetry import ModelHostEnum as _MH

return [_MH.NVIDIA_BUILD.value]
def _resolve_model_hosts(providers: list[ModelProvider]) -> list[str]:
"""Sorted, deduplicated list of provider host classifications."""
return collect_model_hosts([classify_model_host(p) for p in providers])
2 changes: 1 addition & 1 deletion src/anonymizer/interface/errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ class InvalidInputError(AnonymizerError):


class InvalidConfigError(AnonymizerError):
"""Raised when model aliases or semantic configuration are invalid."""
"""Raised when model, provider, alias, or semantic configuration is invalid."""


class AnonymizerIOError(AnonymizerError):
Expand Down
6 changes: 3 additions & 3 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,19 +50,19 @@ def _isolate_telemetry_env(monkeypatch: pytest.MonkeyPatch) -> None:
@pytest.fixture
def stub_detector_model_configs() -> list[ModelConfig]:
"""Model configs with the GLiNER PII detector alias."""
return [ModelConfig(alias="gliner-pii-detector", model="nvidia/nemotron-pii")]
return [ModelConfig(alias="gliner-pii-detector", model="nvidia/nemotron-pii", provider="stub")]


@pytest.fixture
def stub_model_configs() -> list[ModelConfig]:
"""Generic model configs for workflows that don't care about the alias."""
return [ModelConfig(alias="stub-model", model="stub-model")]
return [ModelConfig(alias="stub-model", model="stub-model", provider="stub")]


@pytest.fixture
def stub_known_model_configs() -> list[ModelConfig]:
"""Minimal model pool for alias validation tests."""
return [ModelConfig(alias="known", model="some/model")]
return [ModelConfig(alias="known", model="some/model", provider="stub")]


@pytest.fixture
Expand Down
Loading
Loading