Summary
A df.join() / df.race() placed inside a df.loop() stalls permanently once the loop reaches its second iteration. The instance never completes or fails — it simply hangs after iteration 1, sitting in running indefinitely.
This is independent of df.break(); it reproduces with any JOIN/RACE in a loop body that runs ≥ 2 iterations. It was discovered while implementing #148 / #229. Test 4 in tests/e2e/sql/22_break_in_join_race.sql is deliberately scoped to break on iteration 1 specifically to avoid tripping this bug.
Mechanism
- The loop node calls
ctx.continue_as_new(...) once per iteration (execute_loop_node, src/orchestrations/execute_function_graph.rs ~L644).
- JOIN (
execute_join_node, ~L894) and RACE (execute_race_node, ~L987–988) schedule their branches with ctx.schedule_sub_orchestration(SUBTREE_NAME, input) — no explicit child instance id and no per-iteration discriminator (branch inputs carry only graph, node_id, results, vars, label).
- duroxide derives the child (sub-orchestration) instance id deterministically from the parent instance id plus a per-instance counter seeded from orchestration history.
continue_as_new truncates/restarts history, so that counter resets to the same starting value on every iteration.
- Result: iteration 2's JOIN/RACE derives the same child instance id(s) as iteration 1. Those ids already exist in the provider store as
Completed, so duroxide does not re-run them or deliver a fresh completion signal — the parent's await on the sub-orchestration future never resolves → permanent stall.
Symptom / impact
- Any workflow with a JOIN/RACE inside a loop that iterates ≥ 2 times hangs after the first iteration.
- The instance remains
running forever (no completion, no failure, no timeout).
Repro (illustrative)
A loop whose body contains a JOIN and is allowed to iterate at least twice, conceptually:
df.loop(
df.seq(
df.join(df.sql('SELECT 1'), df.sql('SELECT 2')),
<advance / while-condition that permits a 2nd iteration>
)
)
Concrete in-repo reference: tests/e2e/sql/22_break_in_join_race.sql Test 4 (break in IF-in-JOIN-in-loop) passes only because it breaks on iteration 1. Moving the break to iteration 2 — or removing it so the loop iterates again — reproduces the stall.
Suggested fix direction
Give each loop iteration's sub-orchestrations a unique, replay-stable instance id so they don't collide across continue_as_new. Options:
- Thread a monotonic iteration counter (persisted in the loop's
continue_as_new input / vars) into the child instance id or branch input so duroxide derives distinct ids per iteration; or
- Schedule JOIN/RACE branches with an explicit instance id that includes the iteration ordinal; or
- Confirm with duroxide whether sub-orchestration id derivation can be made stable across
continue_as_new, and adopt the recommended pattern.
Any fix must remain deterministic / replay-safe (this file is orchestration code).
["bug"]
Summary
A
df.join()/df.race()placed inside adf.loop()stalls permanently once the loop reaches its second iteration. The instance never completes or fails — it simply hangs after iteration 1, sitting inrunningindefinitely.This is independent of
df.break(); it reproduces with any JOIN/RACE in a loop body that runs ≥ 2 iterations. It was discovered while implementing #148 / #229. Test 4 intests/e2e/sql/22_break_in_join_race.sqlis deliberately scoped to break on iteration 1 specifically to avoid tripping this bug.Mechanism
ctx.continue_as_new(...)once per iteration (execute_loop_node,src/orchestrations/execute_function_graph.rs~L644).execute_join_node, ~L894) and RACE (execute_race_node, ~L987–988) schedule their branches withctx.schedule_sub_orchestration(SUBTREE_NAME, input)— no explicit child instance id and no per-iteration discriminator (branch inputs carry onlygraph,node_id,results,vars,label).continue_as_newtruncates/restarts history, so that counter resets to the same starting value on every iteration.Completed, so duroxide does not re-run them or deliver a fresh completion signal — the parent's await on the sub-orchestration future never resolves → permanent stall.Symptom / impact
runningforever (no completion, no failure, no timeout).Repro (illustrative)
A loop whose body contains a JOIN and is allowed to iterate at least twice, conceptually:
Concrete in-repo reference:
tests/e2e/sql/22_break_in_join_race.sqlTest 4 (break in IF-in-JOIN-in-loop) passes only because it breaks on iteration 1. Moving the break to iteration 2 — or removing it so the loop iterates again — reproduces the stall.Suggested fix direction
Give each loop iteration's sub-orchestrations a unique, replay-stable instance id so they don't collide across
continue_as_new. Options:continue_as_newinput / vars) into the child instance id or branch input so duroxide derives distinct ids per iteration; orcontinue_as_new, and adopt the recommended pattern.Any fix must remain deterministic / replay-safe (this file is orchestration code).
["bug"]