Skip to content

Fix supervisor exiting before all workers complete tests#384

Merged
ianks merged 6 commits intomainfrom
fix-supervisor-premature-exhaustion
Mar 30, 2026
Merged

Fix supervisor exiting before all workers complete tests#384
ianks merged 6 commits intomainfrom
fix-supervisor-premature-exhaustion

Conversation

@ianks
Copy link
Copy Markdown
Contributor

@ianks ianks commented Mar 26, 2026

Situation

The Summary/report step can pass and exit while test workers are still running, causing real failures to be missed. Two compounding bugs:

  1. Heartbeat countdown: The heartbeat thread stopped sending heartbeats after config.timeout seconds (a countdown in base.rb), causing running ZSET entries to go stale for long-running tests.

  2. Heartbeat poisoning: When a stale worker loses its lease via reserve_lost and then finishes its test, acknowledge.lua's unconditional sadd(processed) poisons the current owner's heartbeat — heartbeat.lua checked sismember(processed) before the ownership check, so the real owner's heartbeat stopped, the entry went stale, reserve_lost cleaned it up, and the running set emptied while the owner was still executing.

Execution

Lease-based authorization replaces worker-queue-key ownership checks. Each reserve / reserve_lost generates a monotonic lease ID stored in a new leases hash. heartbeat.lua, acknowledge.lua, and requeue.lua now check the lease instead of worker identity or processed-set membership.

This eliminates the poisoning vector entirely: when reserve_lost steals a test, it generates a new lease. The old worker's heartbeat and acknowledge are cleanly rejected by lease mismatch — no interaction with the processed set needed.

Key design decisions:

  • acknowledge.lua gates zrem/hdel/sadd on lease match, but sadd(processed) remains unconditional to avoid livelock when both workers take longer than timeout (first finisher wins)
  • The owners hash (entry → worker_queue_key) is preserved for release.lua and requeued-by routing, which identify entries by worker, not by lease
  • reserve_lost.lua still cleans up stale processed entries from running (crash recovery)
  • The heartbeat countdown removal from base.rb is kept — heartbeats tick indefinitely while a test is running

Tradeoff: stuck tests

Removing the heartbeat countdown means a truly stuck test will heartbeat forever. Existing mitigations: max_test_duration kills tests at the application level, and report_timeout / inactive_workers_timeout ensure the supervisor eventually exits. A dedicated heartbeat cap could be added as a follow-up.

ianks added 5 commits March 26, 2026 16:35
The heartbeat thread countdown limited heartbeats to config.timeout
seconds, causing running ZSET entries to go stale for long tests.
Combined with acknowledge.lua's unconditional zrem, a stale worker
whose test was stolen via reserve_lost could remove the running entry
when acknowledging, making exhausted? return true prematurely.

Three-layer fix:
- Remove heartbeat countdown so heartbeats continue for full test duration
- Guard acknowledge/requeue zrem on ownership so non-owners can't empty
  the running set
- Clean up stale processed entries in reserve_lost
The heartbeat poisoning bug: when a stale worker's acknowledge adds an
entry to the processed set (sadd is unconditional), heartbeat.lua's
sismember(processed) check causes the current lease holder's heartbeat
to stop, making the entry go stale and get cleaned up from running
while the holder is still executing the test.

Fix by introducing per-reservation lease IDs. Each reserve/reserve_lost
generates a monotonic lease ID stored in a separate 'leases' hash.
heartbeat.lua, acknowledge.lua, and requeue.lua now check the lease
instead of worker_queue_key ownership or processed set membership.

The owners hash (entry → worker_queue_key) is preserved for release.lua
and requeued-by routing, which need to identify entries by worker, not
by lease.
- Update Python distributed.py to pass leases_key and lease_counter_key
  to reserve, reserve_lost, acknowledge, and requeue Lua scripts
- Parse [entry, lease] array return from reserve/reserve_lost
- Store and pass lease IDs through acknowledge/requeue
- Update heartbeat integration test: with lease-based heartbeat the entry
  stays alive for the full test duration, so the test is never stolen
  and no RESERVED_LOST_TEST warning is generated
The test_lost_test_with_heartbeat_monitor test was using --heartbeat 1,
meaning a test would be considered 'lost' after just 1 second without a
heartbeat tick. Since the heartbeat thread ticks at most every ~1 second,
TruffleRuby's slower thread scheduling left essentially zero margin,
causing the test to be stolen and a warning to be generated.

Increasing to --heartbeat 5 gives ample margin for the heartbeat to tick
while still testing the core behavior: heartbeat keeps the test alive so
it is never stolen.
Comment thread ruby/lib/ci/queue/redis/worker.rb Outdated
test_id = CI::Queue::QueueEntry.test_id(entry)
assert_reserved!(test_id)
entry = reserved_entries.fetch(test_id, entry)
lease = @reserved_leases[test_id]
Copy link
Copy Markdown
Contributor

@danielahrnsbrak danielahrnsbrak Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads @reserved_leases[test_id] but never deletes it, which could leak the old lease on successful requeue. Delete and restore on failure, matching acknowledge's pattern at line 189 maybe

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 7ebc11c


yield
ensure
heartbeat_state.set(:reset) if heartbeat_enabled?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we kill this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 7ebc11c

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried removing it but it caused the subprocess to hang — the heartbeat thread keeps ticking the last entry indefinitely after the test finishes, which prevents the monitor process from exiting. Restored it and added a :reset handler in the heartbeat loop that clears the ticking state between tests.

@ianks ianks requested a review from danielahrnsbrak March 30, 2026 15:19
@ianks ianks force-pushed the fix-supervisor-premature-exhaustion branch from 7ebc11c to 7766b3a Compare March 30, 2026 15:49
@ianks ianks force-pushed the fix-supervisor-premature-exhaustion branch from 7766b3a to 1b88d74 Compare March 30, 2026 17:14
@ianks ianks merged commit fd9e0f6 into main Mar 30, 2026
22 checks passed
ianks added a commit that referenced this pull request Apr 9, 2026
…ost-test reclamation

A stuck test would heartbeat forever (since PR #384 removed the countdown),
preventing other workers from reclaiming it via reserve_lost.

Add --heartbeat-max-test-duration N (defaults to timeout*10) so the heartbeat
thread stops ticking after N seconds per test. Once ticking stops the ZSET
score goes stale and reserve_lost can steal the entry.

The started_at timestamp is passed through the tick state from with_heartbeat
so elapsed is measured from when the test actually started, not from when the
heartbeat thread woke up (which could be up to 1 second late due to the
@cond.wait(1) timeout causing a skewed stale threshold).

Fixes #395
ianks added a commit that referenced this pull request Apr 17, 2026
…ost-test reclamation

A stuck test would heartbeat forever (since PR #384 removed the countdown),
preventing other workers from reclaiming it via reserve_lost.

Add --heartbeat-max-test-duration N (defaults to timeout*10) so the heartbeat
thread stops ticking after N seconds per test. Once ticking stops the ZSET
score goes stale and reserve_lost can steal the entry.

The started_at timestamp is passed through the tick state from with_heartbeat
so elapsed is measured from when the test actually started, not from when the
heartbeat thread woke up (which could be up to 1 second late due to the
@cond.wait(1) timeout causing a skewed stale threshold).

Fixes #395
ianks added a commit that referenced this pull request Apr 17, 2026
…ost-test reclamation

A stuck test would heartbeat forever (since PR #384 removed the countdown),
preventing other workers from reclaiming it via reserve_lost.

Add --heartbeat-max-test-duration N (defaults to timeout*10) so the heartbeat
thread stops ticking after N seconds per test. Once ticking stops the ZSET
score goes stale and reserve_lost can steal the entry.

The started_at timestamp is passed through the tick state from with_heartbeat
so elapsed is measured from when the test actually started, not from when the
heartbeat thread woke up (which could be up to 1 second late due to the
@cond.wait(1) timeout causing a skewed stale threshold).

Fixes #395
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants