Skip to content

fix(router): serialize backlog-manager + guard MoveWorkItem against parallel races#1257

Merged
zbigniewsobiecki merged 2 commits intodevfrom
fix/backlog-manager-singleton-lock-and-idempotency
May 6, 2026
Merged

fix(router): serialize backlog-manager + guard MoveWorkItem against parallel races#1257
zbigniewsobiecki merged 2 commits intodevfrom
fix/backlog-manager-singleton-lock-and-idempotency

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Summary

Live incident 2026-05-06 on ucho: two backlog-manager runs auto-chained in parallel — one from MNG-536's PR-merge, one from MNG-537's splitting auto-chain — both scanned the backlog, both picked MNG-538, both moved it to TODO. Each move fired the implementation trigger → duplicate PRs #287 and #288.

The per-(projectId, workItemId, agentType) work-item lock did not serialize them because the chained workItemId differed (MNG-536 vs MNG-537). The PM_COALESCE_WINDOW_MS debounce also did nothing — the second webhook arrived ~13s after the first job had already left the delayed queue.

Two defenses

1. Project-singleton lock for backlog-manager (primary)

src/router/work-item-lock.ts now collapses workItemId to a sentinel for project-singleton agents (currently just backlog-manager). Both the in-memory map key and the DB countActiveRuns query omit workItemId for these agents — a second backlog-manager dispatch on the same project is blocked while the first is in flight, regardless of which parent work-item triggered the auto-chain.

2. MoveWorkItem expectedSourceState guard (defense-in-depth)

New optional expectedSourceState parameter on MoveWorkItem. If set, the gadget fetches the current item, compares case-insensitively, and:

  • aborts with a structured "likely already moved by a parallel agent" message on mismatch
  • silently no-ops if the item is already in the destination state
  • preserves prior behaviour when the parameter is omitted

The backlog-manager prompt now instructs the agent to pass expectedSourceState: <%= backlogSourceLabel %> on every move-to-TODO. backlogSourceLabel is computed in promptContext.ts per provider — Trello list ID, JIRA/Linear status name ('Backlog' default).

This catches any future regression in the lock layer (TTL expiry, restart-induced amnesia, etc.) before the duplicate move actually fires the second pm:status-changed trigger.

Test plan

  • npm test — 8793 unit tests pass (23 pre-existing skips), no new failures
  • New tests in tests/unit/router/work-item-lock.test.ts — pin the singleton-lock semantics: blocks across different workItemIds in the same project, allows across projects, DB query omits workItemId for backlog-manager, regular agents unaffected
  • New tests in tests/unit/gadgets/pm/core/moveWorkItem.test.ts — pin every branch of the expectedSourceState guard: match → proceed, mismatch → abort, idempotent same-state → no-op, omitted → backwards-compatible, getWorkItem failure → structured error, case-insensitive match
  • Lint + typecheck clean (lefthook pre-commit ran them)
  • Integration suite (will run in CI)

🤖 Generated with Claude Code

zbigniewsobiecki and others added 2 commits May 6, 2026 11:38
…arallel races

Live incident 2026-05-06 (ucho): two `backlog-manager` runs auto-chained
in parallel — one from MNG-536's PR-merge, one from MNG-537's splitting
auto-chain — both scanned the same backlog, both selected MNG-538, both
moved it to TODO. The two `pm:status-changed` webhooks each fired the
implementation trigger, producing duplicate PRs (#287 and #288).

Two complementary defenses:

1. **Project-singleton lock for `backlog-manager`** (primary): the
   per-(projectId, workItemId, agentType) lock did NOT serialize the two
   runs because their nominal workItemId differed (MNG-536 vs MNG-537).
   `work-item-lock.ts` now collapses workItemId to a sentinel for
   project-singleton agents — both in-memory and the DB count — so a
   second backlog-manager dispatch on the same project is blocked while
   the first is in flight.

2. **MoveWorkItem `expectedSourceState` guard** (defense-in-depth): if a
   second run somehow proceeds (lock TTL expiry, restart, future
   regression), the gadget refuses to move an item whose current status
   doesn't match the caller's expectation, and treats already-at-
   destination as a silent no-op. The backlog-manager prompt now
   instructs the agent to pass `expectedSourceState: <%= backlogSourceLabel %>`
   on every move-to-TODO. The label is provider-correct (Trello list
   ID, JIRA/Linear status name) and case-insensitive matched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-existing dep drift on dev — `npm audit --omit=dev --audit-level=high`
started failing CI when new axios advisories were published since the
last dev push (2026-05-03). `npm audit fix` (no --force) bumps axios
1.15.0 → 1.16.0 inside the trello.js / jira.js subtrees via lockfile-
only updates; no package.json changes, no breaking-change cascade.

After: 5 moderate advisories remain (all transitive ip-address via
@modelcontextprotocol/sdk → express-rate-limit) but the high-severity
axios block is cleared, so the audit step exits 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit 2234bbe into dev May 6, 2026
9 checks passed
@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 96.36364% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/gadgets/pm/core/moveWorkItem.ts 88.23% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@zbigniewsobiecki zbigniewsobiecki deleted the fix/backlog-manager-singleton-lock-and-idempotency branch May 6, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant