Fix supervisor exiting before all workers complete tests#384
Merged
Conversation
The heartbeat thread countdown limited heartbeats to config.timeout seconds, causing running ZSET entries to go stale for long tests. Combined with acknowledge.lua's unconditional zrem, a stale worker whose test was stolen via reserve_lost could remove the running entry when acknowledging, making exhausted? return true prematurely. Three-layer fix: - Remove heartbeat countdown so heartbeats continue for full test duration - Guard acknowledge/requeue zrem on ownership so non-owners can't empty the running set - Clean up stale processed entries in reserve_lost
The heartbeat poisoning bug: when a stale worker's acknowledge adds an entry to the processed set (sadd is unconditional), heartbeat.lua's sismember(processed) check causes the current lease holder's heartbeat to stop, making the entry go stale and get cleaned up from running while the holder is still executing the test. Fix by introducing per-reservation lease IDs. Each reserve/reserve_lost generates a monotonic lease ID stored in a separate 'leases' hash. heartbeat.lua, acknowledge.lua, and requeue.lua now check the lease instead of worker_queue_key ownership or processed set membership. The owners hash (entry → worker_queue_key) is preserved for release.lua and requeued-by routing, which need to identify entries by worker, not by lease.
- Update Python distributed.py to pass leases_key and lease_counter_key to reserve, reserve_lost, acknowledge, and requeue Lua scripts - Parse [entry, lease] array return from reserve/reserve_lost - Store and pass lease IDs through acknowledge/requeue - Update heartbeat integration test: with lease-based heartbeat the entry stays alive for the full test duration, so the test is never stolen and no RESERVED_LOST_TEST warning is generated
The test_lost_test_with_heartbeat_monitor test was using --heartbeat 1, meaning a test would be considered 'lost' after just 1 second without a heartbeat tick. Since the heartbeat thread ticks at most every ~1 second, TruffleRuby's slower thread scheduling left essentially zero margin, causing the test to be stolen and a warning to be generated. Increasing to --heartbeat 5 gives ample margin for the heartbeat to tick while still testing the core behavior: heartbeat keeps the test alive so it is never stolen.
| test_id = CI::Queue::QueueEntry.test_id(entry) | ||
| assert_reserved!(test_id) | ||
| entry = reserved_entries.fetch(test_id, entry) | ||
| lease = @reserved_leases[test_id] |
Contributor
There was a problem hiding this comment.
this reads @reserved_leases[test_id] but never deletes it, which could leak the old lease on successful requeue. Delete and restore on failure, matching acknowledge's pattern at line 189 maybe
|
|
||
| yield | ||
| ensure | ||
| heartbeat_state.set(:reset) if heartbeat_enabled? |
Contributor
Author
There was a problem hiding this comment.
Tried removing it but it caused the subprocess to hang — the heartbeat thread keeps ticking the last entry indefinitely after the test finishes, which prevents the monitor process from exiting. Restored it and added a :reset handler in the heartbeat loop that clears the ticking state between tests.
danielahrnsbrak
approved these changes
Mar 30, 2026
7ebc11c to
7766b3a
Compare
7766b3a to
1b88d74
Compare
ianks
added a commit
that referenced
this pull request
Apr 9, 2026
…ost-test reclamation A stuck test would heartbeat forever (since PR #384 removed the countdown), preventing other workers from reclaiming it via reserve_lost. Add --heartbeat-max-test-duration N (defaults to timeout*10) so the heartbeat thread stops ticking after N seconds per test. Once ticking stops the ZSET score goes stale and reserve_lost can steal the entry. The started_at timestamp is passed through the tick state from with_heartbeat so elapsed is measured from when the test actually started, not from when the heartbeat thread woke up (which could be up to 1 second late due to the @cond.wait(1) timeout causing a skewed stale threshold). Fixes #395
ianks
added a commit
that referenced
this pull request
Apr 17, 2026
…ost-test reclamation A stuck test would heartbeat forever (since PR #384 removed the countdown), preventing other workers from reclaiming it via reserve_lost. Add --heartbeat-max-test-duration N (defaults to timeout*10) so the heartbeat thread stops ticking after N seconds per test. Once ticking stops the ZSET score goes stale and reserve_lost can steal the entry. The started_at timestamp is passed through the tick state from with_heartbeat so elapsed is measured from when the test actually started, not from when the heartbeat thread woke up (which could be up to 1 second late due to the @cond.wait(1) timeout causing a skewed stale threshold). Fixes #395
ianks
added a commit
that referenced
this pull request
Apr 17, 2026
…ost-test reclamation A stuck test would heartbeat forever (since PR #384 removed the countdown), preventing other workers from reclaiming it via reserve_lost. Add --heartbeat-max-test-duration N (defaults to timeout*10) so the heartbeat thread stops ticking after N seconds per test. Once ticking stops the ZSET score goes stale and reserve_lost can steal the entry. The started_at timestamp is passed through the tick state from with_heartbeat so elapsed is measured from when the test actually started, not from when the heartbeat thread woke up (which could be up to 1 second late due to the @cond.wait(1) timeout causing a skewed stale threshold). Fixes #395
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Situation
The Summary/report step can pass and exit while test workers are still running, causing real failures to be missed. Two compounding bugs:
Heartbeat countdown: The heartbeat thread stopped sending heartbeats after
config.timeoutseconds (a countdown inbase.rb), causing running ZSET entries to go stale for long-running tests.Heartbeat poisoning: When a stale worker loses its lease via
reserve_lostand then finishes its test,acknowledge.lua's unconditionalsadd(processed)poisons the current owner's heartbeat —heartbeat.luacheckedsismember(processed)before the ownership check, so the real owner's heartbeat stopped, the entry went stale,reserve_lostcleaned it up, and the running set emptied while the owner was still executing.Execution
Lease-based authorization replaces worker-queue-key ownership checks. Each
reserve/reserve_lostgenerates a monotonic lease ID stored in a newleaseshash.heartbeat.lua,acknowledge.lua, andrequeue.luanow check the lease instead of worker identity or processed-set membership.This eliminates the poisoning vector entirely: when
reserve_loststeals a test, it generates a new lease. The old worker's heartbeat and acknowledge are cleanly rejected by lease mismatch — no interaction with the processed set needed.Key design decisions:
acknowledge.luagateszrem/hdel/saddon lease match, butsadd(processed)remains unconditional to avoid livelock when both workers take longer than timeout (first finisher wins)ownershash (entry → worker_queue_key) is preserved forrelease.luaand requeued-by routing, which identify entries by worker, not by leasereserve_lost.luastill cleans up stale processed entries from running (crash recovery)base.rbis kept — heartbeats tick indefinitely while a test is runningTradeoff: stuck tests
Removing the heartbeat countdown means a truly stuck test will heartbeat forever. Existing mitigations:
max_test_durationkills tests at the application level, andreport_timeout/inactive_workers_timeoutensure the supervisor eventually exits. A dedicated heartbeat cap could be added as a follow-up.