Skip to content

fix: prevent duplicate DAGTask names in foreach + split-switch flows#3204

Open
odncode wants to merge 4 commits into
Netflix:masterfrom
odncode:fix-argo-duplicate-dag-task
Open

fix: prevent duplicate DAGTask names in foreach + split-switch flows#3204
odncode wants to merge 4 commits into
Netflix:masterfrom
odncode:fix-argo-duplicate-dag-task

Conversation

@odncode
Copy link
Copy Markdown

@odncode odncode commented May 19, 2026

Summary

Fixes the duplicate DAGTask name bug reported in #3196.

When a split-switch's matching_conditional_join resolves to the same node as a foreach's matching_join, _visit in _dag_templates emitted two DAGTasks with the same name. Argo rejects this at submit time with:

templates.<flow> sorting failed: duplicated nodeName <step>

Root Cause

The foreach branch (around line 1918) appends a join_foreach_task to dag_tasks but does not add the matching_join name to the seen list.

When a later traversal arrives at the same node via the split-switch handler, the seen check on line 1240 does not fire, and a second DAGTask is emitted.

Fix

One-line change:

seen.append(self.graph[node.matching_join].name)

added immediately after:

dag_tasks.append(join_foreach_task)

Testing

Added test/unit/test_argo_foreach_dedup.py, a regression test that:

Existing tests in test/unit/test_argo_workflows_cli.py continue to pass (16/16, as confirmed by the original reporter).

Closes #3196

Disclosure

I used Claude (Anthropic) to help understand the codebase. The fix and rationale reflect my own understanding, and I’m happy to walk through the logic during review.

When a split-switch's matching_conditional_join resolves to the same node
as a foreach's matching_join, _visit in _dag_templates emitted two DAGTasks
with the same name. Argo rejects this at submit time with:

    templates.<flow> sorting failed: duplicated nodeName <step>

Root cause: the foreach branch appended a join DAGTask but did not add the
matching_join name to the seen list, allowing a later traversal via the
split-switch handler to emit a second DAGTask for the same node.

Fix: add seen.append(self.graph[node.matching_join].name) immediately after
the foreach join task is appended.

Includes a regression test that compiles a minimal repro flow and checks
for duplicate task names in the generated Argo JSON.

Closes Netflix#3196
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR fixes a duplicate DAGTask name bug in Argo Workflows by adding the foreach matching_join node to the seen list immediately after emitting join_foreach_task, so that any subsequent traversal arriving at the same node via a split-switch's matching_conditional_join finds it already visited and does not emit a second DAGTask.

  • argo_workflows.py: Single-line fix — seen.append(self.graph[node.matching_join].name) — inserted after dag_tasks.append(join_foreach_task), correctly closing the deduplication gap.
  • test_argo_foreach_dedup.py: New regression test that directly instantiates ArgoWorkflows (bypassing CLI/cloud requirements) and asserts no DAG template contains duplicate task names; has three unused imports (json, subprocess, sys) that should be removed.

Confidence Score: 4/5

The production fix is a precise, well-reasoned one-liner that correctly closes the deduplication gap; the test scaffolding works but the chosen flow topology does not actually trigger the bug path it is meant to guard against.

The core change to argo_workflows.py is correct and minimal. However, the accompanying regression test uses a flow graph where the split-switch matching_conditional_join resolves to end (not join_step), so the foreach and split-switch join pointers are different nodes and the duplicate-emission condition never fires.

test/unit/test_argo_foreach_dedup.py — the flow topology should be revised so that all branches of the split-switch converge on the foreach join node.

Important Files Changed

Filename Overview
metaflow/plugins/argo/argo_workflows.py One-line fix adds the foreach's matching_join name to seen immediately after emitting join_foreach_task, correctly preventing a second DAGTask from being emitted for the same node when a split-switch traversal later arrives at it.
test/unit/test_argo_foreach_dedup.py Regression test bypasses CLI and directly exercises _dag_templates(). Has three unused imports (json, subprocess, sys). The specific flow topology in REPRO_FLOW does not exercise the exact code path that triggers the bug, so the assertion passes with or without the fix.

Reviews (4): Last reviewed commit: "test: use FlowGraph to build real graph ..." | Re-trigger Greptile

Comment thread test/unit/test_argo_foreach_dedup.py Outdated
Comment thread test/unit/test_argo_foreach_dedup.py Outdated
Addresses review feedback: the subprocess approach always skipped in CI
due to cloud datastore requirements. The new test builds a minimal graph
stand-in and calls _dag_templates() directly, consistent with the
existing test patterns in test_argo_workflows_cli.py.
Comment thread test/unit/test_argo_foreach_dedup.py Outdated
Fixed dunder method dispatch issue (class-level not instance-level).
Added enable_heartbeat_daemon attribute needed by _dag_templates.
Verified locally: new test passes, all 16 existing tests still pass.
Comment thread test/unit/test_argo_foreach_dedup.py Outdated
Previous mock graph did not trigger the duplicate-emission condition
because _parse_conditional_branches requires real graph attributes.
Now uses Metaflow's FlowGraph to parse the repro flow directly,
bypassing CLI and cloud datastore requirements.

Verified: test FAILS without the fix (detects duplicate 'join-step'),
PASSES with it. All 16 existing tests unaffected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Argo compiler emits duplicate DAGTask for foreach matching_join

1 participant