Skip to content

feat(detection): export detection workflow for distributed/SLURM exec…#182

Merged
mvansegbroeck merged 2 commits into
mainfrom
feature/export-detection-config
Jun 12, 2026
Merged

feat(detection): export detection workflow for distributed/SLURM exec…#182
mvansegbroeck merged 2 commits into
mainfrom
feature/export-detection-config

Conversation

@mvansegbroeck

Copy link
Copy Markdown
Contributor

Run the GLiNER + LLM detection workflow on an external at-scale DataDesigner runtime (e.g. a SLURM orchestrator) instead of in-process.

  • Anonymizer.export_detection_config() → EntityDetectionWorkflow.build_detection_config() → adapter.build_config(): assemble the DataDesignerConfigBuilder for the detection workflow without executing it. Extract _build_detection_spec() so the run path (detect_and_validate_entities) and the export path build identical (model_configs, columns); no behavior change to run().

  • In-process builder factory for distributed workers. The detection workflow uses CustomColumnConfig columns whose generator_function is a live Python callable (DataDesigner custom columns are "library only"), which can't survive JSON serialization. So a distributed runtime rebuilds the live builder per worker rather than shipping a serialized config:

    • adapter.build_config_for_seed: assemble reading an EXISTING seed parquet (no rewrite — workers may share it) with an optional ordered PartitionBlock(job_index/num_jobs).
    • detection_workflow.build_detection_builder_for_seed + Anonymizer.export_detection_builder_for_seed.
    • anonymizer/distributed.py:build_detection_builder(seed_path, spec, job_index, num_jobs): the factory the runtime imports and calls in-process. Custom columns reference models by alias and get facades injected by the DataDesigner runtime, so the runtime's alias→server provider wiring routes their LLM calls.

Verified locally (DataDesigner installed): the factory builds a live builder from an existing parquet, the seed is not rewritten, all custom generator_functions are callables (not strings), and the ordered partition is set.

Summary

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring

Testing

  • make test passes locally
  • make check passes locally (format + lint + typecheck + lock-check)
  • Added/updated tests for changes

Documentation

  • If docs changed: make docs-build passes locally

Related Issues

…ution

Run the GLiNER + LLM detection workflow on an external at-scale DataDesigner runtime
(e.g. a SLURM orchestrator) instead of in-process.

- Anonymizer.export_detection_config() → EntityDetectionWorkflow.build_detection_config()
  → adapter.build_config(): assemble the DataDesignerConfigBuilder for the detection
  workflow without executing it. Extract _build_detection_spec() so the run path
  (detect_and_validate_entities) and the export path build identical (model_configs,
  columns); no behavior change to run().

- In-process builder factory for distributed workers. The detection workflow uses
  CustomColumnConfig columns whose generator_function is a live Python callable
  (DataDesigner custom columns are "library only"), which can't survive JSON
  serialization. So a distributed runtime rebuilds the live builder per worker rather
  than shipping a serialized config:
    - adapter.build_config_for_seed: assemble reading an EXISTING seed parquet (no
      rewrite — workers may share it) with an optional ordered PartitionBlock(job_index/num_jobs).
    - detection_workflow.build_detection_builder_for_seed + Anonymizer.export_detection_builder_for_seed.
    - anonymizer/distributed.py:build_detection_builder(seed_path, spec, job_index, num_jobs):
      the factory the runtime imports and calls in-process. Custom columns reference models
      by alias and get facades injected by the DataDesigner runtime, so the runtime's
      alias→server provider wiring routes their LLM calls.

Verified locally (DataDesigner installed): the factory builds a live builder from an
existing parquet, the seed is not rewritten, all custom generator_functions are callables
(not strings), and the ordered partition is set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mvansegbroeck mvansegbroeck requested a review from a team as a code owner June 8, 2026 19:11
@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds an export path so the GLiNER + LLM detection workflow can be handed to an external distributed runtime (e.g. SLURM) instead of running in-process. The core detection logic is unchanged; _build_detection_spec is extracted so both the run path and the two new export paths build exactly the same column/model configuration.

  • Anonymizer.export_detection_configbuild_detection_configadapter.build_config: assembles and writes the seed parquet, returns the DataDesignerConfigBuilder without executing it.
  • Anonymizer.export_detection_builder_for_seedadapter.build_config_for_seed: reads an existing seed parquet (no rewrite), selects an ordered PartitionBlock when num_jobs > 1, and validates job_index bounds eagerly.
  • anonymizer/distributed.py:build_detection_builder: the in-process worker factory the external runtime imports; reconstitutes a live Anonymizer with placeholder providers (overridden by the runtime) and calls export_detection_builder_for_seed.

Confidence Score: 5/5

Safe to merge; both previously raised concerns are resolved and no new defects were found.

The refactoring is well-contained: _build_detection_spec is a pure extraction with no logic change, the run path is unmodified, and the two new export paths delegate cleanly to new adapter methods. The prior review concerns — missing detect key producing a confusing error, and job_index out of range being silently ignored for num_jobs=1 — are both addressed with explicit early checks. No new correctness, data-integrity, or security issues were found.

No files require special attention.

Important Files Changed

Filename Overview
src/anonymizer/distributed.py New distributed-worker factory; previous thread concerns (missing detect key check, silent no-op for out-of-range job_index) are both addressed. Logic is clean.
src/anonymizer/engine/detection/detection_workflow.py Extracts _build_detection_spec so the run path and both export paths share identical column/model construction; adds build_detection_config and build_detection_builder_for_seed with clean delegation to the adapter.
src/anonymizer/engine/ndd/adapter.py Adds build_config (submitter path, writes seed) and build_config_for_seed (worker path, reads existing seed); bounds-check on job_index/num_jobs correctly replaces the previous silent no-op.
src/anonymizer/interface/anonymizer.py Surfaces the two new export methods as public API; parameter threading to detection workflow is correct and consistent with the existing run path.

Sequence Diagram

sequenceDiagram
    participant S as Submitter
    participant A as Anonymizer
    participant DW as EntityDetectionWorkflow
    participant AD as NddAdapter
    participant FS as Filesystem (seed.parquet)

    note over S,FS: Submitter side (export_detection_config)
    S->>A: export_detection_config(config, data, seed_path)
    A->>DW: build_detection_config(dataframe, seed_path, ...)
    DW->>DW: _build_detection_spec(...)
    DW->>AD: build_config(df, model_configs, columns, seed_path)
    AD->>FS: LocalFileSeedSource.from_dataframe → write seed.parquet
    AD-->>DW: DataDesignerConfigBuilder
    DW-->>A: DataDesignerConfigBuilder
    A-->>S: DataDesignerConfigBuilder (+ spec dict for workers)

    note over W,FS: Worker side (distributed.py / SLURM)
    participant W as Worker (distributed.py)
    participant R as DataDesigner Runtime
    W->>W: build_detection_builder(seed_path, spec, job_index, num_jobs)
    W->>A: Anonymizer(model_configs_yaml, placeholder_providers)
    W->>A: export_detection_builder_for_seed(config, seed_path, job_index, num_jobs)
    A->>DW: build_detection_builder_for_seed(seed_path, ..., job_index, num_jobs)
    DW->>DW: _build_detection_spec(...)
    DW->>AD: build_config_for_seed(model_configs, columns, seed_path, job_index, num_jobs)
    AD->>FS: LocalFileSeedSource(path) — read only, no write
    AD-->>DW: "DataDesignerConfigBuilder (with PartitionBlock if num_jobs > 1)"
    DW-->>A: DataDesignerConfigBuilder
    A-->>W: DataDesignerConfigBuilder
    W-->>R: DataDesignerConfigBuilder (live callables intact)
    R->>R: inject real model providers, run workflow
Loading

Reviews (4): Last reviewed commit: "fix(detection): fail fast on missing 'de..." | Re-trigger Greptile

Comment thread src/anonymizer/distributed.py Outdated
Comment thread src/anonymizer/engine/ndd/adapter.py

@lipikaramaswamy lipikaramaswamy left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated this branch against a real Slurm/Big Iron POC on cw-pdx-cs-001.

Setup:

  • Anonymizer: feature/export-detection-config @ 42fad0efec9520515c6809a20adbb7970faa9dff
  • anonymizer-big-iron: main @ 678f905c2a06d2713b6284dee89c36e1bb6bf91f
  • big-iron: feature/gliner-service @ edce953075e4308b8cc6bf9ec291f56e435cffe3

Result:

  • Big Iron started both vLLM and GLiNER services successfully.
  • Runtime built the worker config through anonymizer.distributed:build_detection_builder.
  • Positive smoke job completed on Slurm: job 5744327, exit 0:0, elapsed 00:04:19.
  • Downstream scoring in the anonymizer-big-iron harness processed 8 records, including 2 GT-positive records, with recall/precision/F1 all 1.000 on the scored positives. We'll probably change anonymizer-big-iron to use measurement tools from #177 once it lands in main.

So I think this PR appears necessary and sufficient for the Anonymizer-side API/export surface needed by the Slurm path. The full end-to-end path still depends on the Big Iron GLiNER/factory support branch until that lands. But we should be ok to just use that branch for now. Just a few nits in the comments from greptile worth addressing. And of course, make format is required.

@mvansegbroeck mvansegbroeck force-pushed the feature/export-detection-config branch 2 times, most recently from 515f74d to 0d59ecc Compare June 12, 2026 18:52
Addresses review feedback on the distributed export path:
- distributed.build_detection_builder raises a clear KeyError when the
  'detect' section is omitted, instead of a misleading KeyError on
  'gliner_threshold' that points at the wrong config level.
- NddAdapter.build_config_for_seed validates num_jobs >= 1 and
  0 <= job_index < num_jobs, so a misconfigured shard fails fast before
  the distributed run starts instead of silently processing the full seed.
@mvansegbroeck mvansegbroeck force-pushed the feature/export-detection-config branch from 0d59ecc to b14e1e1 Compare June 12, 2026 18:55
@mvansegbroeck mvansegbroeck merged commit 4247433 into main Jun 12, 2026
11 checks passed
@mvansegbroeck mvansegbroeck deleted the feature/export-detection-config branch June 12, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants