Skip to content

feat: load bundled model providers by default#189

Open
lipikaramaswamy wants to merge 4 commits into
mainfrom
lipikaramaswamy/feature/180-bundled-model-providers
Open

feat: load bundled model providers by default#189
lipikaramaswamy wants to merge 4 commits into
mainfrom
lipikaramaswamy/feature/180-bundled-model-providers

Conversation

@lipikaramaswamy

Copy link
Copy Markdown
Collaborator

Summary

  • Plain Anonymizer() now loads bundled providers.yaml and passes it to DataDesigner, so notebooks, CI, and benchmarks no longer depend on ~/.data-designer/model_providers.yaml.
  • User-supplied model configs must declare explicit provider: fields; init validates provider names against the resolved provider list and rejects empty provider lists with InvalidConfigError.
  • Docs, agent skill, and tests updated to reflect bundled defaults and hardened provider/model resolution.

Test plan

  • make test (833 passed)
  • make format-check
  • Verified bundled models.yaml provider names match providers.yaml
  • Verified Anonymizer() passes bundled providers to DataDesigner
  • Verified custom model_providers= override and empty-list rejection
  • Verified missing/unknown provider config raises InvalidConfigError

Plain Anonymizer() now uses bundled providers.yaml instead of DataDesigner's
machine-local defaults, with validation for explicit model provider fields and
cross-reference checks between model configs and provider definitions.
@lipikaramaswamy lipikaramaswamy requested review from a team as code owners June 10, 2026 23:42
@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR makes Anonymizer() load bundled providers.yaml by default instead of relying on DataDesigner's machine-local ~/.data-designer/model_providers.yaml, and adds validation that rejects empty provider lists and model configs that reference unknown provider names.

  • Bundled provider defaults: _resolve_model_providers(None) now calls load_default_model_providers() to load the new src/anonymizer/config/default_model_configs/providers.yaml, which ships the NVIDIA build endpoint. Anonymizer.__init__ always passes a resolved list[ModelProvider] to DataDesigner — never None.
  • Stricter config validation: User-supplied model configs must now declare an explicit provider: field; the new _validate_raw_model_configs_have_provider and validate_model_configs_reference_providers helpers enforce this at init time and wrap all ValueErrors as InvalidConfigError.
  • detection_workflow.py bug fix: validation_single_chunk_full_text was silently dropped when detect_and_validate_entities called _build_detection_spec; the parameter is now forwarded correctly.

Confidence Score: 5/5

Safe to merge; the bundled defaults path is well-tested, existing callers that relied on None providers are handled by the new fallback, and all validation errors are correctly surfaced.

The change is focused and thoroughly tested (833 passing tests, dedicated new tests for all new code paths). The bundled providers.yaml matches the provider names in models.yaml, validated by a dedicated test. The only gaps are minor defensive-programming omissions in non-critical paths.

model_loader.py (load_default_model_providers lacks an empty-list guard) and detection_workflow.py (build_detection_config and build_detection_builder_for_seed still don't expose validation_single_chunk_full_text).

Important Files Changed

Filename Overview
src/anonymizer/interface/anonymizer.py Core init now loads bundled providers when None, wraps all ValueError from validation as InvalidConfigError, and removes the now-dead None-providers fallback in _resolve_model_hosts. Changes are correct and well-tested.
src/anonymizer/engine/ndd/model_loader.py Adds load_default_model_providers, validate_model_configs_reference_providers, and _validate_raw_model_configs_have_provider. Logic is sound; load_default_model_providers lacks the empty-list guard present in the user-supplied path.
src/anonymizer/engine/detection/detection_workflow.py Fixes validation_single_chunk_full_text not being forwarded in detect_and_validate_entities. build_detection_config and build_detection_builder_for_seed still don't expose the parameter.
src/anonymizer/config/default_model_configs/providers.yaml New bundled provider file defining the NVIDIA build endpoint. Correctly structured with name, endpoint, provider_type, and api_key fields.
tests/engine/test_model_loader.py Solid new tests covering missing provider, empty provider string, unknown provider, bundled coverage check, and empty list rejection. All existing tests updated to include provider field.
tests/interface/test_anonymizer_interface.py New integration tests verify bundled providers are passed to DataDesigner, custom overrides work, and invalid configs raise InvalidConfigError. Good coverage of the new behavior.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Anonymizer.__init__(model_configs, model_providers)"] --> B["parse_model_configs(model_configs)"]
    B --> C{model_configs is None?}
    C -->|Yes| D["Load bundled models.yaml\n→ ParsedModelConfigs"]
    C -->|No| E["Parse user YAML/Path\n_validate_raw_model_configs_have_provider\nload_model_configs()"]
    E --> F["ParsedModelConfigs"]
    D --> G["_resolve_model_providers(model_providers)"]
    F --> G
    G --> H{model_providers is None?}
    H -->|Yes| I["load_default_model_providers()\nbundled providers.yaml"]
    H -->|list| J{empty?}
    H -->|str/Path| K["load_config_file()\ncheck providers key\ncheck not empty"]
    J -->|Yes| L["raise ValueError"]
    J -->|No| M["list[ModelProvider]"]
    I --> N["list[ModelProvider]"]
    K --> O["list[ModelProvider]"]
    M --> P["validate_model_configs_reference_providers"]
    N --> P
    O --> P
    P --> Q{unknown providers?}
    Q -->|Yes| R["raise ValueError → InvalidConfigError"]
    Q -->|No| S["DataDesigner(model_providers=resolved_providers)"]
Loading

Reviews (2): Last reviewed commit: "fix: resolve CI failures after merging m..." | Re-trigger Greptile

Comment thread src/anonymizer/interface/anonymizer.py Outdated
Comment on lines 872 to 878
def _resolve_model_hosts(providers: list[ModelProvider] | None) -> list[str]:
"""Sorted, deduplicated list of provider host classifications.

Returns ``["nvidia-build"]`` when no custom providers are configured —
anonymizer's defaults route through build.nvidia.com.
"""
"""Sorted, deduplicated list of provider host classifications."""
if not providers:
from anonymizer.telemetry import ModelHostEnum as _MH

return [_MH.NVIDIA_BUILD.value]
return [_MH.OTHER.value]
return collect_model_hosts([classify_model_host(p) for p in providers])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unreachable fallback with stale type hint

_resolve_model_providers now guarantees a non-empty list[ModelProvider] (loading bundled defaults when None, raising on empty list), so self._resolved_providers is always non-empty here. The if not providers: branch at line 874 is dead code and its return value was quietly changed from NVIDIA_BUILD to OTHER. The function signature also still advertises list[ModelProvider] | None, which contradicts the actual call-site type. If a future caller ever passes None or an empty list while relying on the OTHER fallback, the telemetry host would be misclassified compared to prior behaviour.

@lipikaramaswamy lipikaramaswamy changed the title feat: load bundled model providers by default (#180) feat: load bundled model providers by default Jun 12, 2026
Reject empty providers lists from YAML files and remove the unreachable
_resolve_model_hosts fallback now that bundled providers are always resolved.
Thread validation_single_chunk_full_text through _build_detection_spec,
add provider fields to benchmark preflight fixtures, and fix measurement
test ModelConfig deprecation warnings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant