Skip to content

JOIN/RACE inside a loop stalls on iteration ≥2 (continue_as_new resets sub-orchestration id counter → child id collision) #230

Description

@crprashant

Summary

A df.join() / df.race() placed inside a df.loop() stalls permanently once the loop reaches its second iteration. The instance never completes or fails — it simply hangs after iteration 1, sitting in running indefinitely.

This is independent of df.break(); it reproduces with any JOIN/RACE in a loop body that runs ≥ 2 iterations. It was discovered while implementing #148 / #229. Test 4 in tests/e2e/sql/22_break_in_join_race.sql is deliberately scoped to break on iteration 1 specifically to avoid tripping this bug.

Mechanism

  • The loop node calls ctx.continue_as_new(...) once per iteration (execute_loop_node, src/orchestrations/execute_function_graph.rs ~L644).
  • JOIN (execute_join_node, ~L894) and RACE (execute_race_node, ~L987–988) schedule their branches with ctx.schedule_sub_orchestration(SUBTREE_NAME, input) — no explicit child instance id and no per-iteration discriminator (branch inputs carry only graph, node_id, results, vars, label).
  • duroxide derives the child (sub-orchestration) instance id deterministically from the parent instance id plus a per-instance counter seeded from orchestration history. continue_as_new truncates/restarts history, so that counter resets to the same starting value on every iteration.
  • Result: iteration 2's JOIN/RACE derives the same child instance id(s) as iteration 1. Those ids already exist in the provider store as Completed, so duroxide does not re-run them or deliver a fresh completion signal — the parent's await on the sub-orchestration future never resolves → permanent stall.

Symptom / impact

  • Any workflow with a JOIN/RACE inside a loop that iterates ≥ 2 times hangs after the first iteration.
  • The instance remains running forever (no completion, no failure, no timeout).

Repro (illustrative)

A loop whose body contains a JOIN and is allowed to iterate at least twice, conceptually:

df.loop(
  df.seq(
    df.join(df.sql('SELECT 1'), df.sql('SELECT 2')),
    <advance / while-condition that permits a 2nd iteration>
  )
)

Concrete in-repo reference: tests/e2e/sql/22_break_in_join_race.sql Test 4 (break in IF-in-JOIN-in-loop) passes only because it breaks on iteration 1. Moving the break to iteration 2 — or removing it so the loop iterates again — reproduces the stall.

Suggested fix direction

Give each loop iteration's sub-orchestrations a unique, replay-stable instance id so they don't collide across continue_as_new. Options:

  • Thread a monotonic iteration counter (persisted in the loop's continue_as_new input / vars) into the child instance id or branch input so duroxide derives distinct ids per iteration; or
  • Schedule JOIN/RACE branches with an explicit instance id that includes the iteration ordinal; or
  • Confirm with duroxide whether sub-orchestration id derivation can be made stable across continue_as_new, and adopt the recommended pattern.

Any fix must remain deterministic / replay-safe (this file is orchestration code).
["bug"]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions