feat: load bundled model providers by default#189
Conversation
Plain Anonymizer() now uses bundled providers.yaml instead of DataDesigner's machine-local defaults, with validation for explicit model provider fields and cross-reference checks between model configs and provider definitions.
Greptile SummaryThis PR makes
Confidence Score: 5/5Safe to merge; the bundled defaults path is well-tested, existing callers that relied on None providers are handled by the new fallback, and all validation errors are correctly surfaced. The change is focused and thoroughly tested (833 passing tests, dedicated new tests for all new code paths). The bundled providers.yaml matches the provider names in models.yaml, validated by a dedicated test. The only gaps are minor defensive-programming omissions in non-critical paths. model_loader.py (load_default_model_providers lacks an empty-list guard) and detection_workflow.py (build_detection_config and build_detection_builder_for_seed still don't expose validation_single_chunk_full_text). Important Files Changed
|
| def _resolve_model_hosts(providers: list[ModelProvider] | None) -> list[str]: | ||
| """Sorted, deduplicated list of provider host classifications. | ||
|
|
||
| Returns ``["nvidia-build"]`` when no custom providers are configured — | ||
| anonymizer's defaults route through build.nvidia.com. | ||
| """ | ||
| """Sorted, deduplicated list of provider host classifications.""" | ||
| if not providers: | ||
| from anonymizer.telemetry import ModelHostEnum as _MH | ||
|
|
||
| return [_MH.NVIDIA_BUILD.value] | ||
| return [_MH.OTHER.value] | ||
| return collect_model_hosts([classify_model_host(p) for p in providers]) |
There was a problem hiding this comment.
Unreachable fallback with stale type hint
_resolve_model_providers now guarantees a non-empty list[ModelProvider] (loading bundled defaults when None, raising on empty list), so self._resolved_providers is always non-empty here. The if not providers: branch at line 874 is dead code and its return value was quietly changed from NVIDIA_BUILD to OTHER. The function signature also still advertises list[ModelProvider] | None, which contradicts the actual call-site type. If a future caller ever passes None or an empty list while relying on the OTHER fallback, the telemetry host would be misclassified compared to prior behaviour.
Reject empty providers lists from YAML files and remove the unreachable _resolve_model_hosts fallback now that bundled providers are always resolved.
Thread validation_single_chunk_full_text through _build_detection_spec, add provider fields to benchmark preflight fixtures, and fix measurement test ModelConfig deprecation warnings.
Summary
Anonymizer()now loads bundledproviders.yamland passes it to DataDesigner, so notebooks, CI, and benchmarks no longer depend on~/.data-designer/model_providers.yaml.provider:fields; init validates provider names against the resolved provider list and rejects empty provider lists withInvalidConfigError.Test plan
make test(833 passed)make format-checkmodels.yamlprovider names matchproviders.yamlAnonymizer()passes bundled providers to DataDesignermodel_providers=override and empty-list rejectionInvalidConfigError