fix(remediation): serialize concurrent fixes on a host + live status (no refresh)#607
Closed
remyluslosius wants to merge 2 commits into
Closed
fix(remediation): serialize concurrent fixes on a host + live status (no refresh)#607remyluslosius wants to merge 2 commits into
remyluslosius wants to merge 2 commits into
Conversation
…ling Clicking Fix on several findings on the same host enqueued multiple jobs that ran concurrently; the second collided on the per-host SSH guard (ErrHostBusy) and the remediation worker marked it failed. Now the worker treats a busy host as transient: it backs off and requeues (queue.EnqueueAfter) until the host is free, so the fixes apply one at a time. - queue: add a delayed-visibility column (migration 0039 available_at) + EnqueueAfter(delay); Dequeue skips not-yet-available rows so the requeue does not busy-loop the drain (job-queue AC-13). - remediation: HostHasExecuting + RevertToApproved primitives (api-remediation AC-08); worker processExecute/processRollback pre-check the host and revert+ requeue on an ErrHostBusy race instead of failing the request.
The Remediation tab required a manual refresh to see a fix finish. The worker already publishes remediation.completed on the event bus; useLiveEvents now subscribes to it and invalidates ['host', id, 'remediations'] + ['host', id], so the tab and the compliance score update automatically when a queued fix or rollback reaches its terminal state. frontend-live-events AC-09 + AC-01 (topic set grows to 6).
Contributor
Author
|
Folded into #609 (release: bundle 0.2.0-rc.11) and merged there to avoid the CHANGELOG rebase cascade. Content is on main. |
remyluslosius
added a commit
that referenced
this pull request
Jun 20, 2026
… 110) (#610) - CLAUDE.md: Last Updated 2026-06-20; Remediation row -> Complete (#601/#606/#607); scanning-status note -> v0.2.0-rc.11 incl. free-core remediation; spec count 108 -> 110 - BACKLOG.md: drop done rows (Remediation tab, specter 100%-all-tiers, -p 1 -> -p 4) - scan_remaining_work.md: Phase 7 first-slice shipped banner; remaining = licensed track - SESSION_LOG.md: 2026-06-20 entry (rc.11 cut, bundle mechanics, gotchas)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Serialize concurrent remediation on a host + live remediation status
Two issues found while testing the Remediation tab. Off
main, independent of #604/#605/#606.1. Concurrent "Fix" clicks failed instead of queueing
Clicking Fix on several findings on the same host enqueued multiple jobs that ran concurrently; the second collided on the per-host SSH guard (
ErrHostBusy) and the worker marked the request failed — unlike the scan worker, which already treats host-busy as transient.Fix: the remediation worker now treats a busy host as transient and requeues with a backoff until the host is free, so fixes apply one at a time.
available_at, migration 0039) +EnqueueAfter(delay).Dequeueskips not-yet-available rows, so the requeue does not busy-loop thedrainOnceloop. Backward-compatible —available_atdefaults tonow(), so scans/diagnostics are unchanged (system-job-queue/AC-13).HostHasExecuting+RevertToApprovedprimitives (api-remediation/AC-08);processExecute/processRollbackpre-check the host and revert+requeue on anErrHostBusyrace rather than failing.2. Remediation status needed a manual refresh
The worker already publishes
remediation.completedon the event bus, but the frontend SSE hook never subscribed to it.Fix:
useLiveEventsnow subscribes toremediation.completedand invalidates['host', id, 'remediations']+['host', id], so the Remediation tab and the compliance score update automatically when a fix or rollback finishes (frontend-live-events/AC-09;ALL_TOPICSgrows to 6).Verified locally
Full
queue+remediation+worker+serversuite green (exit 0 — scans unaffected by the queue change); frontend live-events + host-detail + remediation-tab tests green (35/35); Specter 110 specs valid, 100% structural coverage, 0 annotation-hygiene errors; gofmt clean.