feat(detection): export detection workflow for distributed/SLURM exec… by mvansegbroeck · Pull Request #182 · NVIDIA-NeMo/Anonymizer

mvansegbroeck · 2026-06-08T19:11:16Z

Run the GLiNER + LLM detection workflow on an external at-scale DataDesigner runtime (e.g. a SLURM orchestrator) instead of in-process.

Anonymizer.export_detection_config() → EntityDetectionWorkflow.build_detection_config() → adapter.build_config(): assemble the DataDesignerConfigBuilder for the detection workflow without executing it. Extract _build_detection_spec() so the run path (detect_and_validate_entities) and the export path build identical (model_configs, columns); no behavior change to run().
In-process builder factory for distributed workers. The detection workflow uses CustomColumnConfig columns whose generator_function is a live Python callable (DataDesigner custom columns are "library only"), which can't survive JSON serialization. So a distributed runtime rebuilds the live builder per worker rather than shipping a serialized config:
- adapter.build_config_for_seed: assemble reading an EXISTING seed parquet (no rewrite — workers may share it) with an optional ordered PartitionBlock(job_index/num_jobs).
- detection_workflow.build_detection_builder_for_seed + Anonymizer.export_detection_builder_for_seed.
- anonymizer/distributed.py:build_detection_builder(seed_path, spec, job_index, num_jobs): the factory the runtime imports and calls in-process. Custom columns reference models by alias and get facades injected by the DataDesigner runtime, so the runtime's alias→server provider wiring routes their LLM calls.

Verified locally (DataDesigner installed): the factory builds a live builder from an existing parquet, the seed is not rewritten, all custom generator_functions are callables (not strings), and the ordered partition is set.

Summary

Type of Change

Testing

make test passes locally
make check passes locally (format + lint + typecheck + lock-check)
Added/updated tests for changes

Documentation

If docs changed: make docs-build passes locally

Related Issues

…ution Run the GLiNER + LLM detection workflow on an external at-scale DataDesigner runtime (e.g. a SLURM orchestrator) instead of in-process. - Anonymizer.export_detection_config() → EntityDetectionWorkflow.build_detection_config() → adapter.build_config(): assemble the DataDesignerConfigBuilder for the detection workflow without executing it. Extract _build_detection_spec() so the run path (detect_and_validate_entities) and the export path build identical (model_configs, columns); no behavior change to run(). - In-process builder factory for distributed workers. The detection workflow uses CustomColumnConfig columns whose generator_function is a live Python callable (DataDesigner custom columns are "library only"), which can't survive JSON serialization. So a distributed runtime rebuilds the live builder per worker rather than shipping a serialized config: - adapter.build_config_for_seed: assemble reading an EXISTING seed parquet (no rewrite — workers may share it) with an optional ordered PartitionBlock(job_index/num_jobs). - detection_workflow.build_detection_builder_for_seed + Anonymizer.export_detection_builder_for_seed. - anonymizer/distributed.py:build_detection_builder(seed_path, spec, job_index, num_jobs): the factory the runtime imports and calls in-process. Custom columns reference models by alias and get facades injected by the DataDesigner runtime, so the runtime's alias→server provider wiring routes their LLM calls. Verified locally (DataDesigner installed): the factory builds a live builder from an existing parquet, the seed is not rewritten, all custom generator_functions are callables (not strings), and the ordered partition is set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-06-08T19:14:41Z

Greptile Summary

This PR adds an export path so the GLiNER + LLM detection workflow can be handed to an external distributed runtime (e.g. SLURM) instead of running in-process. The core detection logic is unchanged; _build_detection_spec is extracted so both the run path and the two new export paths build exactly the same column/model configuration.

Anonymizer.export_detection_config → build_detection_config → adapter.build_config: assembles and writes the seed parquet, returns the DataDesignerConfigBuilder without executing it.
Anonymizer.export_detection_builder_for_seed → adapter.build_config_for_seed: reads an existing seed parquet (no rewrite), selects an ordered PartitionBlock when num_jobs > 1, and validates job_index bounds eagerly.
anonymizer/distributed.py:build_detection_builder: the in-process worker factory the external runtime imports; reconstitutes a live Anonymizer with placeholder providers (overridden by the runtime) and calls export_detection_builder_for_seed.

Confidence Score: 5/5

Safe to merge; both previously raised concerns are resolved and no new defects were found.

The refactoring is well-contained: _build_detection_spec is a pure extraction with no logic change, the run path is unmodified, and the two new export paths delegate cleanly to new adapter methods. The prior review concerns — missing detect key producing a confusing error, and job_index out of range being silently ignored for num_jobs=1 — are both addressed with explicit early checks. No new correctness, data-integrity, or security issues were found.

No files require special attention.

Important Files Changed

Filename	Overview
src/anonymizer/distributed.py	New distributed-worker factory; previous thread concerns (missing `detect` key check, silent no-op for out-of-range job_index) are both addressed. Logic is clean.
src/anonymizer/engine/detection/detection_workflow.py	Extracts `_build_detection_spec` so the run path and both export paths share identical column/model construction; adds `build_detection_config` and `build_detection_builder_for_seed` with clean delegation to the adapter.
src/anonymizer/engine/ndd/adapter.py	Adds `build_config` (submitter path, writes seed) and `build_config_for_seed` (worker path, reads existing seed); bounds-check on job_index/num_jobs correctly replaces the previous silent no-op.
src/anonymizer/interface/anonymizer.py	Surfaces the two new export methods as public API; parameter threading to detection workflow is correct and consistent with the existing `run` path.

Sequence Diagram

sequenceDiagram
    participant S as Submitter
    participant A as Anonymizer
    participant DW as EntityDetectionWorkflow
    participant AD as NddAdapter
    participant FS as Filesystem (seed.parquet)

    note over S,FS: Submitter side (export_detection_config)
    S->>A: export_detection_config(config, data, seed_path)
    A->>DW: build_detection_config(dataframe, seed_path, ...)
    DW->>DW: _build_detection_spec(...)
    DW->>AD: build_config(df, model_configs, columns, seed_path)
    AD->>FS: LocalFileSeedSource.from_dataframe → write seed.parquet
    AD-->>DW: DataDesignerConfigBuilder
    DW-->>A: DataDesignerConfigBuilder
    A-->>S: DataDesignerConfigBuilder (+ spec dict for workers)

    note over W,FS: Worker side (distributed.py / SLURM)
    participant W as Worker (distributed.py)
    participant R as DataDesigner Runtime
    W->>W: build_detection_builder(seed_path, spec, job_index, num_jobs)
    W->>A: Anonymizer(model_configs_yaml, placeholder_providers)
    W->>A: export_detection_builder_for_seed(config, seed_path, job_index, num_jobs)
    A->>DW: build_detection_builder_for_seed(seed_path, ..., job_index, num_jobs)
    DW->>DW: _build_detection_spec(...)
    DW->>AD: build_config_for_seed(model_configs, columns, seed_path, job_index, num_jobs)
    AD->>FS: LocalFileSeedSource(path) — read only, no write
    AD-->>DW: "DataDesignerConfigBuilder (with PartitionBlock if num_jobs > 1)"
    DW-->>A: DataDesignerConfigBuilder
    A-->>W: DataDesignerConfigBuilder
    W-->>R: DataDesignerConfigBuilder (live callables intact)
    R->>R: inject real model providers, run workflow

_{Reviews (4): Last reviewed commit: "fix(detection): fail fast on missing 'de..." | Re-trigger Greptile}

lipikaramaswamy

Validated this branch against a real Slurm/Big Iron POC on cw-pdx-cs-001.

Setup:

Anonymizer: feature/export-detection-config @ 42fad0efec9520515c6809a20adbb7970faa9dff
anonymizer-big-iron: main @ 678f905c2a06d2713b6284dee89c36e1bb6bf91f
big-iron: feature/gliner-service @ edce953075e4308b8cc6bf9ec291f56e435cffe3

Result:

Big Iron started both vLLM and GLiNER services successfully.
Runtime built the worker config through anonymizer.distributed:build_detection_builder.
Positive smoke job completed on Slurm: job 5744327, exit 0:0, elapsed 00:04:19.
Downstream scoring in the anonymizer-big-iron harness processed 8 records, including 2 GT-positive records, with recall/precision/F1 all 1.000 on the scored positives. We'll probably change anonymizer-big-iron to use measurement tools from #177 once it lands in main.

So I think this PR appears necessary and sufficient for the Anonymizer-side API/export surface needed by the Slurm path. The full end-to-end path still depends on the Big Iron GLiNER/factory support branch until that lands. But we should be ok to just use that branch for now. Just a few nits in the comments from greptile worth addressing. And of course, make format is required.

Addresses review feedback on the distributed export path: - distributed.build_detection_builder raises a clear KeyError when the 'detect' section is omitted, instead of a misleading KeyError on 'gliner_threshold' that points at the wrong config level. - NddAdapter.build_config_for_seed validates num_jobs >= 1 and 0 <= job_index < num_jobs, so a misconfigured shard fails fast before the distributed run starts instead of silently processing the full seed.

mvansegbroeck requested a review from a team as a code owner June 8, 2026 19:11

greptile-apps Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread src/anonymizer/distributed.py Outdated

Comment thread src/anonymizer/engine/ndd/adapter.py

binaryaaron mentioned this pull request Jun 9, 2026

feat: anonymizer measurement instrumentation and benchmark tooling #177

Merged

lipikaramaswamy approved these changes Jun 12, 2026

View reviewed changes

mvansegbroeck force-pushed the feature/export-detection-config branch 2 times, most recently from 515f74d to 0d59ecc Compare June 12, 2026 18:52

mvansegbroeck force-pushed the feature/export-detection-config branch from 0d59ecc to b14e1e1 Compare June 12, 2026 18:55

mvansegbroeck merged commit 4247433 into main Jun 12, 2026
11 checks passed

mvansegbroeck deleted the feature/export-detection-config branch June 12, 2026 19:02

lipikaramaswamy mentioned this pull request Jun 12, 2026

fix(detection): pass single chunk validation flag to exports #190

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(detection): export detection workflow for distributed/SLURM exec…#182

feat(detection): export detection workflow for distributed/SLURM exec…#182
mvansegbroeck merged 2 commits into
mainfrom
feature/export-detection-config

mvansegbroeck commented Jun 8, 2026

Uh oh!

greptile-apps Bot commented Jun 8, 2026 •

edited

Loading

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

lipikaramaswamy left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mvansegbroeck commented Jun 8, 2026

Summary

Type of Change

Testing

Documentation

Related Issues

Uh oh!

greptile-apps Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

lipikaramaswamy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 8, 2026 •

edited

Loading

lipikaramaswamy left a comment •

edited

Loading