Fix supervisor exiting before all workers complete tests by ianks · Pull Request #384 · Shopify/ci-queue

ianks · 2026-03-26T20:36:35Z

Situation

The Summary/report step can pass and exit while test workers are still running, causing real failures to be missed. Two compounding bugs:

Heartbeat countdown: The heartbeat thread stopped sending heartbeats after config.timeout seconds (a countdown in base.rb), causing running ZSET entries to go stale for long-running tests.
Heartbeat poisoning: When a stale worker loses its lease via reserve_lost and then finishes its test, acknowledge.lua's unconditional sadd(processed) poisons the current owner's heartbeat — heartbeat.lua checked sismember(processed) before the ownership check, so the real owner's heartbeat stopped, the entry went stale, reserve_lost cleaned it up, and the running set emptied while the owner was still executing.

Execution

Lease-based authorization replaces worker-queue-key ownership checks. Each reserve / reserve_lost generates a monotonic lease ID stored in a new leases hash. heartbeat.lua, acknowledge.lua, and requeue.lua now check the lease instead of worker identity or processed-set membership.

This eliminates the poisoning vector entirely: when reserve_lost steals a test, it generates a new lease. The old worker's heartbeat and acknowledge are cleanly rejected by lease mismatch — no interaction with the processed set needed.

Key design decisions:

acknowledge.lua gates zrem/hdel/sadd on lease match, but sadd(processed) remains unconditional to avoid livelock when both workers take longer than timeout (first finisher wins)
The owners hash (entry → worker_queue_key) is preserved for release.lua and requeued-by routing, which identify entries by worker, not by lease
reserve_lost.lua still cleans up stale processed entries from running (crash recovery)
The heartbeat countdown removal from base.rb is kept — heartbeats tick indefinitely while a test is running

Tradeoff: stuck tests

Removing the heartbeat countdown means a truly stuck test will heartbeat forever. Existing mitigations: max_test_duration kills tests at the application level, and report_timeout / inactive_workers_timeout ensure the supervisor eventually exits. A dedicated heartbeat cap could be added as a follow-up.

The heartbeat thread countdown limited heartbeats to config.timeout seconds, causing running ZSET entries to go stale for long tests. Combined with acknowledge.lua's unconditional zrem, a stale worker whose test was stolen via reserve_lost could remove the running entry when acknowledging, making exhausted? return true prematurely. Three-layer fix: - Remove heartbeat countdown so heartbeats continue for full test duration - Guard acknowledge/requeue zrem on ownership so non-owners can't empty the running set - Clean up stale processed entries in reserve_lost

The heartbeat poisoning bug: when a stale worker's acknowledge adds an entry to the processed set (sadd is unconditional), heartbeat.lua's sismember(processed) check causes the current lease holder's heartbeat to stop, making the entry go stale and get cleaned up from running while the holder is still executing the test. Fix by introducing per-reservation lease IDs. Each reserve/reserve_lost generates a monotonic lease ID stored in a separate 'leases' hash. heartbeat.lua, acknowledge.lua, and requeue.lua now check the lease instead of worker_queue_key ownership or processed set membership. The owners hash (entry → worker_queue_key) is preserved for release.lua and requeued-by routing, which need to identify entries by worker, not by lease.

- Update Python distributed.py to pass leases_key and lease_counter_key to reserve, reserve_lost, acknowledge, and requeue Lua scripts - Parse [entry, lease] array return from reserve/reserve_lost - Store and pass lease IDs through acknowledge/requeue - Update heartbeat integration test: with lease-based heartbeat the entry stays alive for the full test duration, so the test is never stolen and no RESERVED_LOST_TEST warning is generated

The test_lost_test_with_heartbeat_monitor test was using --heartbeat 1, meaning a test would be considered 'lost' after just 1 second without a heartbeat tick. Since the heartbeat thread ticks at most every ~1 second, TruffleRuby's slower thread scheduling left essentially zero margin, causing the test to be stolen and a warning to be generated. Increasing to --heartbeat 5 gives ample margin for the heartbeat to tick while still testing the core behavior: heartbeat keeps the test alive so it is never stolen.

danielahrnsbrak · 2026-03-30T14:20:59Z

          test_id = CI::Queue::QueueEntry.test_id(entry)
          assert_reserved!(test_id)
          entry = reserved_entries.fetch(test_id, entry)
+          lease = @reserved_leases[test_id]


this reads @reserved_leases[test_id] but never deletes it, which could leak the old lease on successful requeue. Delete and restore on failure, matching acknowledge's pattern at line 189 maybe

done in 7ebc11c

danielahrnsbrak · 2026-03-30T14:21:48Z


          yield
        ensure
          heartbeat_state.set(:reset) if heartbeat_enabled?


can we kill this?

done in 7ebc11c

Tried removing it but it caused the subprocess to hang — the heartbeat thread keeps ticking the last entry indefinitely after the test finishes, which prevents the monitor process from exiting. Restored it and added a :reset handler in the heartbeat loop that clears the ticking state between tests.

…at state

…ost-test reclamation A stuck test would heartbeat forever (since PR #384 removed the countdown), preventing other workers from reclaiming it via reserve_lost. Add --heartbeat-max-test-duration N (defaults to timeout*10) so the heartbeat thread stops ticking after N seconds per test. Once ticking stops the ZSET score goes stale and reserve_lost can steal the entry. The started_at timestamp is passed through the tick state from with_heartbeat so elapsed is measured from when the test actually started, not from when the heartbeat thread woke up (which could be up to 1 second late due to the @cond.wait(1) timeout causing a skewed stale threshold). Fixes #395

ianks added 5 commits March 26, 2026 16:35

Fix Python decode error: _reserve returns raw bytes for poll to decode

feac86b

danielahrnsbrak reviewed Mar 30, 2026

View reviewed changes

ianks requested a review from danielahrnsbrak March 30, 2026 15:19

danielahrnsbrak approved these changes Mar 30, 2026

View reviewed changes

ianks force-pushed the fix-supervisor-premature-exhaustion branch from 7ebc11c to 7766b3a Compare March 30, 2026 15:49

Address review: fix lease leak in requeue, remove dead :reset heartbe…

1b88d74

…at state

ianks force-pushed the fix-supervisor-premature-exhaustion branch from 7766b3a to 1b88d74 Compare March 30, 2026 17:14

ianks merged commit fd9e0f6 into main Mar 30, 2026
22 checks passed

ianks mentioned this pull request Apr 9, 2026

Add max_test_duration cap to heartbeat thread #395

Closed

ianks mentioned this pull request Apr 9, 2026

Add heartbeat_max_test_duration to cap per-test heartbeat duration #397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix supervisor exiting before all workers complete tests#384

Fix supervisor exiting before all workers complete tests#384
ianks merged 6 commits intomainfrom
fix-supervisor-premature-exhaustion

ianks commented Mar 26, 2026 •

edited

Loading

Uh oh!

danielahrnsbrak Mar 30, 2026 •

edited

Loading

Uh oh!

ianks Mar 30, 2026

Uh oh!

danielahrnsbrak Mar 30, 2026

Uh oh!

ianks Mar 30, 2026

Uh oh!

ianks Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ianks commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Situation

Execution

Tradeoff: stuck tests

Uh oh!

danielahrnsbrak Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ianks Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

danielahrnsbrak Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

ianks Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

ianks Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ianks commented Mar 26, 2026 •

edited

Loading

danielahrnsbrak Mar 30, 2026 •

edited

Loading