NVIDIA-NeMo · lipikaramaswamy · Jun 10, 2026 · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026
@@ -65,7 +65,7 @@ Run `anonymizer --help` or `anonymizer <subcommand> --help` for all options.
 from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Redact
 DATA_URL = "https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv"
 
-# Uses default model providers (build.nvidia.com) via NVIDIA_API_KEY env var
+# Uses Anonymizer's bundled model providers (see src/anonymizer/config/default_model_configs/providers.yaml)
 anonymizer = Anonymizer()
 
 config = AnonymizerConfig(replace=Redact())

@@ -9,7 +9,9 @@ Anonymizer uses LLMs for entity detection, replacement, and rewriting. Models ar
 
 ## Defaults
 
-Set your API key for Anonymizer to use models hosted on [build.nvidia.com](https://build.nvidia.com).
+Plain `Anonymizer()` uses Anonymizer's bundled provider and model configs — not DataDesigner's machine-local defaults from `~/.data-designer/model_providers.yaml`. Bundled providers live at [`providers.yaml`](https://github.com/NVIDIA-NeMo/Anonymizer/blob/main/src/anonymizer/config/default_model_configs/providers.yaml); bundled models at [`models.yaml`](https://github.com/NVIDIA-NeMo/Anonymizer/blob/main/src/anonymizer/config/default_model_configs/models.yaml).
+
+Set your API key for Anonymizer to use models hosted on [build.nvidia.com](https://build.nvidia.com):
 
 ```bash
 export NVIDIA_API_KEY="your-nvidia-api-key"
@@ -31,7 +33,7 @@ Each pipeline stage has a **role** mapped to one of these aliases. See the full
 
 ## Custom providers
 
-Use `model_providers` to define named API endpoints for hosted models such as OpenAI or OpenRouter.
+Pass `model_providers` when you need a non-default endpoint — for example OpenAI, OpenRouter, a local GLiNER server, or an internal inference deployment. Plain `Anonymizer()` already uses bundled [build.nvidia.com](https://build.nvidia.com) settings; override only when your models point at a different provider name or URL.
 
 Set your API keys first:
 

@@ -58,7 +58,7 @@ The agent should consult these as it goes — *do not* try to enumerate field re
 Environment-level issues only. Quality and pipeline issues are in `docs/troubleshooting.md`.
 
 - **`anonymizer` not installed:** Tell the user `nemo-anonymizer` is not in this Python environment (requires Python ≥ 3.11). Ask if they want you to install it (`pip install nemo-anonymizer`) or do it themselves. Do not install without permission.
-- **Model aliases not configured:** Anonymizer can't run without `model_configs` and `model_providers` (YAML files or Python objects). Tell the user to set these up — see `docs/concepts/models.md`. If they don't have a config yet, point them at `src/anonymizer/config/default_model_configs/` for the shipped defaults.
+- **Model/provider setup:** Plain `Anonymizer()` ships with bundled `models.yaml` and `providers.yaml` (see `src/anonymizer/config/default_model_configs/`). For the default path, confirm `NVIDIA_API_KEY` is set. Pass custom `model_configs` and/or `model_providers` only when targeting non-default endpoints or model pools — see `docs/concepts/models.md`.
 - **LLM calls failing at preview:** Usually an auth issue (missing or invalid API key), a network problem, or a wrong endpoint URL. See `docs/troubleshooting.md` "Validation passed but `preview` errors at LLM call".
 - **Local / on-prem GLiNER:** Clone or download `tools/serve_gliner.py` from the Anonymizer repo, start the server, add a provider with `endpoint: http://localhost:8001/v1`, and point `gliner-pii-detector` at `provider: local-gliner` with `skip_health_check: true`. Preflight errors about missing aliases usually mean `model_configs` only listed the detector — include the full default pool. Wrong `endpoint` or a down server surfaces as detection failures at preview — see [`docs/concepts/self-hosting-gliner.md`](../../docs/concepts/self-hosting-gliner.md).
 

@@ -7,10 +7,7 @@ Iterative design with the user. Do not disengage from the loop until the user sa
 
 1. **Verify environment**
    - **Install**: run `python -c "import anonymizer; print(anonymizer.__version__)"`. If the import fails, STOP and follow the Troubleshooting section in `SKILL.md`.
-   - **Model providers**: before going further, confirm an LLM provider is configured. Anonymizer cannot run without one. Check that:
-     - An API key is set in the environment (`NVIDIA_API_KEY` for the shipped default, or the equivalent for the user's provider)
-     - A `providers.yaml` exists (defaults ship at `src/anonymizer/config/default_model_configs/providers.yaml`)
-   - If either is missing, STOP and walk the user through [`docs/concepts/models.md`](../../../docs/concepts/models.md) setup. Do not proceed to data inspection until the user confirms providers are ready.
+   - **Model providers**: plain `Anonymizer()` loads bundled providers from `src/anonymizer/config/default_model_configs/providers.yaml`. Before going further, confirm the API key for those defaults is set (`NVIDIA_API_KEY` for build.nvidia.com). Only ask for a custom `providers.yaml` when the user targets a non-default endpoint. If the key is missing, STOP and walk the user through [`docs/concepts/models.md`](../../../docs/concepts/models.md) setup.
 
 2. **Inspect the data** — Read the first few rows of the source file with pandas. You need to know:
    - Path, format, encoding.

@@ -5,6 +5,7 @@ This directory contains the default model configurations used by the Anonymizer
 ## Files
 
 - **`models.yaml`** — Defines the pool of available models (alias, provider, inference parameters). Each entry becomes a `ModelConfig` that NeMo Data Designer can route requests to.
+- **`providers.yaml`** — Defines named API endpoints (provider name, endpoint, API key env var). Loaded automatically when `Anonymizer(model_providers=None)`.
 - **`detection.yaml`** — Maps detection workflow roles (e.g. `entity_detector`, `entity_validator`) to model aliases from `models.yaml`.
 - **`replace.yaml`** — Maps replacement workflow roles (e.g. `replacement_generator`) to model aliases from `models.yaml`.
 - **`rewrite.yaml`** — Maps rewrite workflow roles (`domain_classifier`, `disposition_analyzer`, `meaning_extractor`, `qa_generator`, `rewriter`, `evaluator`, `repairer`, `judge`) to model aliases from `models.yaml`.

@@ -0,0 +1,8 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+providers:
+  - name: nvidia
+    endpoint: https://integrate.api.nvidia.com/v1
+    provider_type: openai
+    api_key: NVIDIA_API_KEY
@@ -117,6 +117,7 @@ def detect_and_validate_entities(
             gliner_detection_threshold=gliner_detection_threshold,
             validation_max_entities_per_call=validation_max_entities_per_call,
             validation_excerpt_window_chars=validation_excerpt_window_chars,
+            validation_single_chunk_full_text=validation_single_chunk_full_text,
             entity_labels=entity_labels,
             data_summary=data_summary,
         )
@@ -138,6 +139,7 @@ def _build_detection_spec(
         gliner_detection_threshold: float,
         validation_max_entities_per_call: int = _DEFAULT_VALIDATION_MAX_ENTITIES_PER_CALL,
         validation_excerpt_window_chars: int = _DEFAULT_VALIDATION_EXCERPT_WINDOW_CHARS,
+        validation_single_chunk_full_text: bool = True,
         entity_labels: list[str] | None = None,
         data_summary: str | None = None,
     ) -> tuple[list[ModelConfig], list[ColumnConfigT]]:

@@ -8,7 +8,7 @@
 from pathlib import Path
 from typing import Any
 
-from data_designer.config.models import ModelConfig, load_model_configs
+from data_designer.config.models import ModelConfig, ModelProvider, load_model_configs
 from data_designer.config.utils.io_helpers import load_config_file
 from pydantic import BaseModel
 
@@ -62,8 +62,10 @@ def parse_model_configs(raw: str | Path | None) -> ParsedModelConfigs:
             parsed = _parse_yaml_string(raw)
 
     user_selections = parsed.pop("selected_models", None)
+    _validate_raw_model_configs_have_provider(parsed)
+    model_configs = load_model_configs(parsed)
     return ParsedModelConfigs(
-        model_configs=load_model_configs(parsed),
+        model_configs=model_configs,
         selected_models=_merge_selections(user_selections),
     )
 
@@ -82,6 +84,16 @@ def load_default_model_selection(config_dir: Path | None = None) -> ModelSelecti
     )
 
 
+def load_default_model_providers(config_dir: Path | None = None) -> list[ModelProvider]:
+    """Load bundled provider definitions from ``providers.yaml``."""
+    resolved_dir = config_dir or DEFAULT_CONFIG_DIR
+    config_dict = _load_yaml_dict(resolved_dir / "providers.yaml")
+    raw_providers = config_dict.get("providers")
+    if not isinstance(raw_providers, list):
+        raise ValueError("Bundled providers YAML must contain a top-level 'providers' list.")
+    return [ModelProvider.model_validate(provider) for provider in raw_providers]
+
+
 def load_models_config(config_dir: Path | None = None) -> dict[str, Any]:
     """Load raw model definitions from models.yaml.
 
@@ -222,6 +234,25 @@ def _merge(section: BaseModel, overrides: dict[str, Any]) -> BaseModel:
     )
 
 
+def validate_model_configs_reference_providers(
+    model_configs: list[ModelConfig],
+    providers: list[ModelProvider],
+) -> None:
+    """Validate that every model config ``provider`` name exists in ``providers``."""
+    known_providers = {provider.name for provider in providers}
+    unknown_by_alias = {
+        model_config.alias: model_config.provider
+        for model_config in model_configs
+        if model_config.provider is not None and model_config.provider not in known_providers
+    }
+    if unknown_by_alias:
+        details = ", ".join(f"{alias}={provider!r}" for alias, provider in sorted(unknown_by_alias.items()))
+        raise ValueError(
+            f"Model config provider names not found in model_providers: {details}. "
+            f"Known providers: {sorted(known_providers)}"
+        )
+
+
 def validate_model_alias_references(
     model_configs: list[ModelConfig],
     selected_models: ModelSelection,
@@ -303,6 +334,36 @@ def _validate_alias_references(
         )
 
 
+def _provider_field_missing(entry: dict[str, Any]) -> bool:
+    provider = entry.get("provider")
+    if provider is None:
+        return True
+    if isinstance(provider, str):
+        return not provider.strip()
+    return False
+
+
+def _validate_raw_model_configs_have_provider(parsed: dict[str, Any]) -> None:
+    """Require an explicit ``provider`` on every user-supplied model config entry."""
+    raw_configs = parsed.get("model_configs")
+    if raw_configs is None:
+        return
+    if not isinstance(raw_configs, list):
+        raise ValueError("model_configs must be a list.")
+    missing: list[str] = []
+    for idx, entry in enumerate(raw_configs):
+        if not isinstance(entry, dict):
+            raise ValueError(f"model_configs[{idx}] must be a mapping.")
+        if _provider_field_missing(entry):
+            missing.append(str(entry.get("alias", f"<index {idx}>")))
+    if missing:
+        aliases = ", ".join(repr(alias) for alias in missing)
+        raise ValueError(
+            f"Model config entries missing required field 'provider': {aliases}. "
+            "Each entry in model_configs must specify provider= explicitly."
+        )
+
+
 def _parse_yaml_string(raw: str) -> dict[str, Any]:
     import yaml
 

@@ -51,7 +51,12 @@
 from anonymizer.engine.evaluation.replace.type_fidelity_judge import TypeFidelityJudgeWorkflow
 from anonymizer.engine.io.reader import read_input
 from anonymizer.engine.ndd.adapter import FailedRecord, NddAdapter
-from anonymizer.engine.ndd.model_loader import parse_model_configs, validate_model_alias_references
+from anonymizer.engine.ndd.model_loader import (
+    load_default_model_providers,
+    parse_model_configs,
+    validate_model_alias_references,
+    validate_model_configs_reference_providers,
+)
 from anonymizer.engine.replace.llm_replace_workflow import LlmReplaceWorkflow
 from anonymizer.engine.replace.replace_runner import ReplacementWorkflow
 from anonymizer.engine.resolved_input import ResolvedInput
@@ -114,7 +119,8 @@ def __init__(
                 pool and optional ``selected_models`` overrides. ``None`` uses
                 bundled defaults. See ``default_model_configs/README.md``.
             model_providers: Provider definitions (list, YAML string, or file path).
-                Each provider maps a name to an endpoint and API key.
+                Each provider maps a name to an endpoint and API key. ``None`` uses
+                bundled defaults from ``default_model_configs/providers.yaml``.
             artifact_path: Directory for intermediate artifacts. Defaults to
                 ``.anonymizer-artifacts``.
             data_designer: Pre-configured DataDesigner instance (advanced usage).
@@ -128,10 +134,14 @@ def __init__(
         os.environ.setdefault("NEMO_SESSION_PREFIX", "anonymizer-")
         os.environ.setdefault("NEMO_DEPLOYMENT_TYPE", "sdk")
         resolved_artifact_path = Path(artifact_path or ".anonymizer-artifacts")
-        parsed = parse_model_configs(model_configs)
-        self._model_configs = parsed.model_configs
-        self._selected_models = parsed.selected_models
-        self._resolved_providers: list[ModelProvider] | None = _resolve_model_providers(model_providers)
+        try:
+            parsed = parse_model_configs(model_configs)
+            self._model_configs = parsed.model_configs
+            self._selected_models = parsed.selected_models
+            self._resolved_providers = _resolve_model_providers(model_providers)
+            validate_model_configs_reference_providers(self._model_configs, self._resolved_providers)
+        except ValueError as exc:
+            raise InvalidConfigError(str(exc)) from exc
         logger.info("🔧 Anonymizer initialized with %d model configs", len(self._model_configs))
         det = self._selected_models.detection
         logger.info(LOG_INDENT + "🔎 detector:  %s", det.entity_detector)
@@ -751,10 +761,12 @@ def _count_entities(df: pd.DataFrame) -> int:
 
 def _resolve_model_providers(
     model_providers: list[ModelProvider] | str | Path | None,
-) -> list[ModelProvider] | None:
+) -> list[ModelProvider]:
     if model_providers is None:
-        return None
+        return load_default_model_providers()
     if isinstance(model_providers, list):
+        if not model_providers:
+            raise ValueError("model_providers must contain at least one provider.")
         return model_providers
     if isinstance(model_providers, str) and "\n" not in model_providers:
         candidate = Path(model_providers.strip()).expanduser()
@@ -766,6 +778,8 @@ def _resolve_model_providers(
     raw_providers = config_dict.get("providers")
     if not isinstance(raw_providers, list):
         raise ValueError("model_providers YAML must contain a top-level 'providers' list.")
+    if not raw_providers:
+        raise ValueError("model_providers must contain at least one provider.")
     return [ModelProvider.model_validate(provider) for provider in raw_providers]
 
 
@@ -977,14 +991,6 @@ def _repair_iterations_triggered(failed: list[FailedRecord], is_rewrite: bool) -
     return len(iterations)
 
 
-def _resolve_model_hosts(providers: list[ModelProvider] | None) -> list[str]:
-    """Sorted, deduplicated list of provider host classifications.
-
-    Returns ``["nvidia-build"]`` when no custom providers are configured —
-    anonymizer's defaults route through build.nvidia.com.
-    """
-    if not providers:
-        from anonymizer.telemetry import ModelHostEnum as _MH
-
-        return [_MH.NVIDIA_BUILD.value]
+def _resolve_model_hosts(providers: list[ModelProvider]) -> list[str]:
+    """Sorted, deduplicated list of provider host classifications."""
     return collect_model_hosts([classify_model_host(p) for p in providers])
@@ -13,7 +13,7 @@ class InvalidInputError(AnonymizerError):
 
 
 class InvalidConfigError(AnonymizerError):
-    """Raised when model aliases or semantic configuration are invalid."""
+    """Raised when model, provider, alias, or semantic configuration is invalid."""
 
 
 class AnonymizerIOError(AnonymizerError):

@@ -50,19 +50,19 @@ def _isolate_telemetry_env(monkeypatch: pytest.MonkeyPatch) -> None:
 @pytest.fixture
 def stub_detector_model_configs() -> list[ModelConfig]:
     """Model configs with the GLiNER PII detector alias."""
-    return [ModelConfig(alias="gliner-pii-detector", model="nvidia/nemotron-pii")]
+    return [ModelConfig(alias="gliner-pii-detector", model="nvidia/nemotron-pii", provider="stub")]
 
 
 @pytest.fixture
 def stub_model_configs() -> list[ModelConfig]:
     """Generic model configs for workflows that don't care about the alias."""
-    return [ModelConfig(alias="stub-model", model="stub-model")]
+    return [ModelConfig(alias="stub-model", model="stub-model", provider="stub")]
 
 
 @pytest.fixture
 def stub_known_model_configs() -> list[ModelConfig]:
     """Minimal model pool for alias validation tests."""
-    return [ModelConfig(alias="known", model="some/model")]
+    return [ModelConfig(alias="known", model="some/model", provider="stub")]
 
 
 @pytest.fixture