fix(router): serialize backlog-manager + guard MoveWorkItem against parallel races#1257
Merged
zbigniewsobiecki merged 2 commits intodevfrom May 6, 2026
Merged
Conversation
…arallel races Live incident 2026-05-06 (ucho): two `backlog-manager` runs auto-chained in parallel — one from MNG-536's PR-merge, one from MNG-537's splitting auto-chain — both scanned the same backlog, both selected MNG-538, both moved it to TODO. The two `pm:status-changed` webhooks each fired the implementation trigger, producing duplicate PRs (#287 and #288). Two complementary defenses: 1. **Project-singleton lock for `backlog-manager`** (primary): the per-(projectId, workItemId, agentType) lock did NOT serialize the two runs because their nominal workItemId differed (MNG-536 vs MNG-537). `work-item-lock.ts` now collapses workItemId to a sentinel for project-singleton agents — both in-memory and the DB count — so a second backlog-manager dispatch on the same project is blocked while the first is in flight. 2. **MoveWorkItem `expectedSourceState` guard** (defense-in-depth): if a second run somehow proceeds (lock TTL expiry, restart, future regression), the gadget refuses to move an item whose current status doesn't match the caller's expectation, and treats already-at- destination as a silent no-op. The backlog-manager prompt now instructs the agent to pass `expectedSourceState: <%= backlogSourceLabel %>` on every move-to-TODO. The label is provider-correct (Trello list ID, JIRA/Linear status name) and case-insensitive matched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-existing dep drift on dev — `npm audit --omit=dev --audit-level=high` started failing CI when new axios advisories were published since the last dev push (2026-05-03). `npm audit fix` (no --force) bumps axios 1.15.0 → 1.16.0 inside the trello.js / jira.js subtrees via lockfile- only updates; no package.json changes, no breaking-change cascade. After: 5 moderate advisories remain (all transitive ip-address via @modelcontextprotocol/sdk → express-rate-limit) but the high-severity axios block is cleared, so the audit step exits 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Live incident 2026-05-06 on ucho: two
backlog-managerruns auto-chained in parallel — one from MNG-536's PR-merge, one from MNG-537's splitting auto-chain — both scanned the backlog, both picked MNG-538, both moved it to TODO. Each move fired the implementation trigger → duplicate PRs #287 and #288.The per-
(projectId, workItemId, agentType)work-item lock did not serialize them because the chainedworkItemIddiffered (MNG-536 vs MNG-537). ThePM_COALESCE_WINDOW_MSdebounce also did nothing — the second webhook arrived ~13s after the first job had already left the delayed queue.Two defenses
1. Project-singleton lock for
backlog-manager(primary)src/router/work-item-lock.tsnow collapsesworkItemIdto a sentinel for project-singleton agents (currently justbacklog-manager). Both the in-memory map key and the DBcountActiveRunsquery omitworkItemIdfor these agents — a second backlog-manager dispatch on the same project is blocked while the first is in flight, regardless of which parent work-item triggered the auto-chain.2.
MoveWorkItemexpectedSourceStateguard (defense-in-depth)New optional
expectedSourceStateparameter onMoveWorkItem. If set, the gadget fetches the current item, compares case-insensitively, and:The
backlog-managerprompt now instructs the agent to passexpectedSourceState: <%= backlogSourceLabel %>on every move-to-TODO.backlogSourceLabelis computed inpromptContext.tsper provider — Trello list ID, JIRA/Linear status name ('Backlog'default).This catches any future regression in the lock layer (TTL expiry, restart-induced amnesia, etc.) before the duplicate move actually fires the second
pm:status-changedtrigger.Test plan
npm test— 8793 unit tests pass (23 pre-existing skips), no new failurestests/unit/router/work-item-lock.test.ts— pin the singleton-lock semantics: blocks across different workItemIds in the same project, allows across projects, DB query omitsworkItemIdfor backlog-manager, regular agents unaffectedtests/unit/gadgets/pm/core/moveWorkItem.test.ts— pin every branch of theexpectedSourceStateguard: match → proceed, mismatch → abort, idempotent same-state → no-op, omitted → backwards-compatible, getWorkItem failure → structured error, case-insensitive match🤖 Generated with Claude Code