feat(detection): export detection workflow for distributed/SLURM exec…#182
Conversation
…ution
Run the GLiNER + LLM detection workflow on an external at-scale DataDesigner runtime
(e.g. a SLURM orchestrator) instead of in-process.
- Anonymizer.export_detection_config() → EntityDetectionWorkflow.build_detection_config()
→ adapter.build_config(): assemble the DataDesignerConfigBuilder for the detection
workflow without executing it. Extract _build_detection_spec() so the run path
(detect_and_validate_entities) and the export path build identical (model_configs,
columns); no behavior change to run().
- In-process builder factory for distributed workers. The detection workflow uses
CustomColumnConfig columns whose generator_function is a live Python callable
(DataDesigner custom columns are "library only"), which can't survive JSON
serialization. So a distributed runtime rebuilds the live builder per worker rather
than shipping a serialized config:
- adapter.build_config_for_seed: assemble reading an EXISTING seed parquet (no
rewrite — workers may share it) with an optional ordered PartitionBlock(job_index/num_jobs).
- detection_workflow.build_detection_builder_for_seed + Anonymizer.export_detection_builder_for_seed.
- anonymizer/distributed.py:build_detection_builder(seed_path, spec, job_index, num_jobs):
the factory the runtime imports and calls in-process. Custom columns reference models
by alias and get facades injected by the DataDesigner runtime, so the runtime's
alias→server provider wiring routes their LLM calls.
Verified locally (DataDesigner installed): the factory builds a live builder from an
existing parquet, the seed is not rewritten, all custom generator_functions are callables
(not strings), and the ordered partition is set.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR adds an export path so the GLiNER + LLM detection workflow can be handed to an external distributed runtime (e.g. SLURM) instead of running in-process. The core detection logic is unchanged;
Confidence Score: 5/5Safe to merge; both previously raised concerns are resolved and no new defects were found. The refactoring is well-contained: No files require special attention. Important Files Changed
|
There was a problem hiding this comment.
Validated this branch against a real Slurm/Big Iron POC on cw-pdx-cs-001.
Setup:
- Anonymizer:
feature/export-detection-config@42fad0efec9520515c6809a20adbb7970faa9dff - anonymizer-big-iron:
main@678f905c2a06d2713b6284dee89c36e1bb6bf91f - big-iron:
feature/gliner-service@edce953075e4308b8cc6bf9ec291f56e435cffe3
Result:
- Big Iron started both vLLM and GLiNER services successfully.
- Runtime built the worker config through
anonymizer.distributed:build_detection_builder. - Positive smoke job completed on Slurm: job
5744327, exit0:0, elapsed00:04:19. - Downstream scoring in the
anonymizer-big-ironharness processed 8 records, including 2 GT-positive records, with recall/precision/F1 all1.000on the scored positives. We'll probably changeanonymizer-big-ironto use measurement tools from #177 once it lands in main.
So I think this PR appears necessary and sufficient for the Anonymizer-side API/export surface needed by the Slurm path. The full end-to-end path still depends on the Big Iron GLiNER/factory support branch until that lands. But we should be ok to just use that branch for now. Just a few nits in the comments from greptile worth addressing. And of course, make format is required.
515f74d to
0d59ecc
Compare
Addresses review feedback on the distributed export path: - distributed.build_detection_builder raises a clear KeyError when the 'detect' section is omitted, instead of a misleading KeyError on 'gliner_threshold' that points at the wrong config level. - NddAdapter.build_config_for_seed validates num_jobs >= 1 and 0 <= job_index < num_jobs, so a misconfigured shard fails fast before the distributed run starts instead of silently processing the full seed.
0d59ecc to
b14e1e1
Compare
Run the GLiNER + LLM detection workflow on an external at-scale DataDesigner runtime (e.g. a SLURM orchestrator) instead of in-process.
Anonymizer.export_detection_config() → EntityDetectionWorkflow.build_detection_config() → adapter.build_config(): assemble the DataDesignerConfigBuilder for the detection workflow without executing it. Extract _build_detection_spec() so the run path (detect_and_validate_entities) and the export path build identical (model_configs, columns); no behavior change to run().
In-process builder factory for distributed workers. The detection workflow uses CustomColumnConfig columns whose generator_function is a live Python callable (DataDesigner custom columns are "library only"), which can't survive JSON serialization. So a distributed runtime rebuilds the live builder per worker rather than shipping a serialized config:
Verified locally (DataDesigner installed): the factory builds a live builder from an existing parquet, the seed is not rewritten, all custom generator_functions are callables (not strings), and the ordered partition is set.
Summary
Type of Change
Testing
make testpasses locallymake checkpasses locally (format + lint + typecheck + lock-check)Documentation
make docs-buildpasses locallyRelated Issues