Skip to content

[Fix] recover stale external db requests#1428

Merged
nams1570 merged 1 commit into
devfrom
inspect-stuck-db-sync-req
May 13, 2026
Merged

[Fix] recover stale external db requests#1428
nams1570 merged 1 commit into
devfrom
inspect-stuck-db-sync-req

Conversation

@nams1570
Copy link
Copy Markdown
Collaborator

@nams1570 nams1570 commented May 12, 2026

Failures between claiming and the deletion of outgoing requests from the handler can leave requests stale and never clean them up. Some of these requests may also have duplicates that are fresh in the outgoing queue. These requests need to be deleted or retried.

It's important to still log the stale requests to sentry so the root cause can be investigated.

Summary by CodeRabbit

  • Bug Fixes

    • Improved detection and recovery of stale outgoing requests; telemetry now records precise reset/deleted counts and includes sampled affected IDs.
    • Added an early fast path to skip unnecessary external calls when there are no pending requests.
  • Refactor

    • Consolidated stale-request handling into a dedicated helper and optimized recovery logic; poller telemetry now includes claim-limit attributes.

Review Change Stack

Copilot AI review requested due to automatic review settings May 12, 2026 21:16
@vercel
Copy link
Copy Markdown

vercel Bot commented May 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
stack-auth-hosted-components Ready Ready Preview, Comment May 13, 2026 0:54am
stack-auth-mcp Ready Ready Preview, Comment May 13, 2026 0:54am
stack-backend Ready Ready Preview, Comment May 13, 2026 0:54am
stack-dashboard Ready Ready Preview, Comment May 13, 2026 0:54am
stack-demo Ready Ready Preview, Comment May 13, 2026 0:54am
stack-docs Ready Ready Preview, Comment May 13, 2026 0:54am
stack-preview-backend Ready Ready Preview, Comment May 13, 2026 0:54am
stack-preview-dashboard Ready Ready Preview, Comment May 13, 2026 0:54am

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: af016897-6495-4fb1-b928-f9edb027669a

📥 Commits

Reviewing files that changed from the base of the PR and between cd9431a and d2e3bd9.

📒 Files selected for processing (2)
  • apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts
  • apps/backend/src/lib/external-db-sync-queue.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts

📝 Walkthrough

Walkthrough

Consolidates stale outgoing-request recovery into an exported transactional recoverStaleOutgoingRequests function (with retry) and integrates it into the poller via handleStaleRequests(), updating telemetry to report reset and deleted counts and capturing assertion payloads when recoveries occur.

Changes

Stale Request Recovery

Layer / File(s) Summary
Core stale recovery implementation
apps/backend/src/lib/external-db-sync-queue.ts
Adds RecoverStaleResult type and recoverStaleOutgoingRequests function that uses a single raw SQL CTE to classify stale OutgoingRequest rows into reset vs delete, performs bulk UPDATE/DELETE with RETURNING id, and returns resetIds and deletedIds.
Poller integration and telemetry
apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts
Adds import/use of recoverStaleOutgoingRequests, introduces handleStaleRequests() helper that records telemetry attributes and conditionally captures a StackAssertionError payload with totals and sampled IDs, replaces prior inline stale-detection SQL, updates poller telemetry attributes, and clarifies a no-pending-requests fast path.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hopped through rows both tired and wary,
Reset a few, deleted the cherry,
Transactions retried, IDs set free,
Poller hums light — that's delight for me,
I munch on logs and sip telemetry tea 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: recovering stale external database requests from the handler.
Description check ✅ Passed The description explains the problem, context, and requirements, but lacks details about the solution approach and implementation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch inspect-stuck-db-sync-req

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR replaces the old read-only stale-row detection with an active CTE-based recovery that resets or deletes stale OutgoingRequest rows in the same transaction. A new handleStaleRequests helper wraps the recovery in a best-effort try/catch so that transient errors (including the documented P2010 unique-constraint race) never abort the rest of the poller iteration.

  • recoverStaleOutgoingRequests: Multi-CTE query in external-db-sync-queue.ts that classifies each stale row (null dedup key → reset; active sibling exists → delete; multiple stale siblings → reset oldest, delete rest) and returns precisely the IDs that were mutated by reading from the mutation CTEs rather than the planning CTE.
  • handleStaleRequests: Inlined helper in poller/route.ts wrapping recovery with a try/catch, rich Sentry telemetry (sampled IDs, counts), and graceful degradation; replaces the previous bare $queryRaw SELECT with no recovery action.
  • Early exit in processRequests: Skips the Upstash batchJSON call when no rows were claimed, avoiding an unnecessary external-API round-trip on idle iterations.

Confidence Score: 5/5

Safe to merge — the three previously flagged issues have all been resolved and the new recovery logic is correct.

The recovery CTE correctly guards against every concurrent-access scenario: mutation CTEs repeat the staleness predicate so EvalPlanQual skips rows modified between snapshot and execution; the try/catch in handleStaleRequests absorbs the documented P2010 unique-constraint race without crashing the iteration; and RETURNING from the mutation CTEs ensures accurate counts. No correctness or data-safety issues were found.

No files require special attention.

Important Files Changed

Filename Overview
apps/backend/src/lib/external-db-sync-queue.ts Adds recoverStaleOutgoingRequests: well-structured multi-CTE query with correct staleness predicates repeated in mutation CTEs, proper ROW_NUMBER logic for multi-stale-sibling groups, and RETURNING from mutation CTEs (not the planning CTE) for accurate counts. Concurrency edge-cases are documented and handled at the call site.
apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts Replaces bare stale-row SELECT with handleStaleRequests helper; adds best-effort try/catch, sampled Sentry telemetry, and removes the misleading staleClaimIntervalMinutes attribute. Early empty-batch fast-path for processRequests is a clean optimization.

Reviews (4): Last reviewed commit: "fix: recover stale external db requests" | Re-trigger Greptile

Comment thread apps/backend/src/lib/external-db-sync-queue.ts Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a reliability issue in the external DB sync poller where failures between “claiming” outgoing requests and deleting them can leave rows permanently stuck as in-flight. It adds a recovery step that resets or deletes stale claimed OutgoingRequest rows so the queue can make forward progress.

Changes:

  • Added recoverStaleOutgoingRequests() to classify stale claimed rows and either reset them (retry) or delete them (if a pending sibling exists / duplicates).
  • Integrated stale-recovery into the external-db-sync poller iteration loop and added telemetry + bounded ID sampling in error reporting.
  • Removed the previous “detect only” stale-query logic from the poller.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
apps/backend/src/lib/external-db-sync-queue.ts Introduces stale-claim recovery logic for OutgoingRequest, including retry handling for partial-unique-index races.
apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts Calls the new recovery logic each poll iteration and records recovery metrics / diagnostic error payloads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apps/backend/src/lib/external-db-sync-queue.ts Outdated
Comment thread apps/backend/src/lib/external-db-sync-queue.ts Outdated
Comment thread apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts Outdated
Comment thread apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts (1)

127-152: ⚡ Quick win

Emit one stale-recovery event per poller run instead of per batch.

recoverStaleOutgoingRequests() is capped per call, so a backlog can generate several near-identical captureError(...) events in one cron execution. Aggregating the counts and sampled IDs across iterations, then reporting once near the final response, will keep the signal useful without flooding error monitoring.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts`
around lines 127 - 152, The poller currently calls captureError inside each
batch of recoverStaleOutgoingRequests(), causing many near-duplicate
stale-recovery events; instead aggregate totals and ID samples across all
recoverStaleOutgoingRequests() calls within a single poller run and emit one
captureError at the end of the poller route. Modify the poller/route code that
calls recoverStaleOutgoingRequests() to accumulate total, resetIds and
deletedIds (keep sampling logic with ID_SAMPLE_LIMIT and the notes like
staleResetIdsSampleNote/staleDeletedIdsSampleNote) across iterations, then call
captureError once with the combined counts, combined sampled id slices, and
STALE_REQUEST_THRESHOLD_MS in the message; keep the current StackAssertionError
payload shape and keys so downstream tooling is unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/backend/src/lib/external-db-sync-queue.ts`:
- Around line 78-91: The CASE must also treat rows as duplicates when there
exists any other in-flight sibling for the same deduplicationKey so we don't
reset a stale row while a newer claim is active; update the CASE in the query
that computes action for OutgoingRequest (references o."deduplicationKey",
o."startedFulfillingAt", o."createdAt", o."id") to add an EXISTS clause that
checks for any other row with the same deduplicationKey and startedFulfillingAt
IS NOT NULL (and id != o.id) and return 'delete' in that branch before the
'reset' branch (keep the existing pending-sibling check too and still use
staleThresholdMs for the stale check).
- Around line 117-128: The error-handling branch in recoverStaleOutgoingRequests
is checking for Prisma.PrismaClientKnownRequestError with code "P2002" which
never matches unique-constraint errors thrown by tx.$queryRaw (they surface as
"P2010"); change the check to error.code === "P2010" (keep the existing
PARTIAL_UNIQUE_INDEX_NAME and MAX_PARTIAL_UNIQUE_RETRIES logic intact) so the
retry/StackAssertionError path executes correctly when $queryRaw hits the
partial unique index contention.

---

Nitpick comments:
In `@apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts`:
- Around line 127-152: The poller currently calls captureError inside each batch
of recoverStaleOutgoingRequests(), causing many near-duplicate stale-recovery
events; instead aggregate totals and ID samples across all
recoverStaleOutgoingRequests() calls within a single poller run and emit one
captureError at the end of the poller route. Modify the poller/route code that
calls recoverStaleOutgoingRequests() to accumulate total, resetIds and
deletedIds (keep sampling logic with ID_SAMPLE_LIMIT and the notes like
staleResetIdsSampleNote/staleDeletedIdsSampleNote) across iterations, then call
captureError once with the combined counts, combined sampled id slices, and
STALE_REQUEST_THRESHOLD_MS in the message; keep the current StackAssertionError
payload shape and keys so downstream tooling is unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 677f7488-bf39-4a88-a8f3-34270d413183

📥 Commits

Reviewing files that changed from the base of the PR and between d2030e8 and adf33e9.

📒 Files selected for processing (2)
  • apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts
  • apps/backend/src/lib/external-db-sync-queue.ts

Comment thread apps/backend/src/lib/external-db-sync-queue.ts Outdated
Comment thread apps/backend/src/lib/external-db-sync-queue.ts Outdated
@nams1570 nams1570 force-pushed the inspect-stuck-db-sync-req branch from adf33e9 to 638f49b Compare May 12, 2026 21:34
@nams1570
Copy link
Copy Markdown
Collaborator Author

@greptile review again please. Please check the sql query in detail and consider edge cases

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/backend/src/lib/external-db-sync-queue.ts`:
- Around line 101-118: The mutation CTEs deleted and reset currently only join
on id/action and must also enforce the staleness predicate so they become no-ops
if a row was reclaimed before mutation; update the deleted CTE (DELETE FROM
"OutgoingRequest" o USING stale s WHERE o."id" = s."id" AND s.action = 'delete')
and the reset CTE (UPDATE "OutgoingRequest" o SET "startedFulfillingAt" = NULL
FROM stale s WHERE o."id" = s."id" AND s.action = 'reset') to also include the
same stale condition used in the planning CTE (e.g., check the stale predicate
fields from stale) so the WHERE clauses guard by both id/action and staleness.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f3f26ec-3580-4091-84bf-0c01a59f6888

📥 Commits

Reviewing files that changed from the base of the PR and between adf33e9 and 638f49b.

📒 Files selected for processing (2)
  • apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts
  • apps/backend/src/lib/external-db-sync-queue.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts

Comment thread apps/backend/src/lib/external-db-sync-queue.ts
@nams1570 nams1570 force-pushed the inspect-stuck-db-sync-req branch 2 times, most recently from 255a197 to 720128e Compare May 12, 2026 21:46
@nams1570
Copy link
Copy Markdown
Collaborator Author

@greptile review again please. Please check the sql query in detail and consider edge cases

@nams1570
Copy link
Copy Markdown
Collaborator Author

@coderabbitai

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cat apps/backend/src/lib/external-db-sync-queue.ts
echo "---FILE_BREAK---"
cat apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts

Length of output: 21484


@nams1570 Here is a detailed review of the SQL query in recoverStaleOutgoingRequests, along with edge cases:


SQL Review: recoverStaleOutgoingRequests

✅ Correct Behaviours

  1. NOW() consistency: PostgreSQL fixes NOW() to the transaction start time, so all four usages of ${staleThresholdMs} * INTERVAL '1 millisecond' compare against the same instant. No drift.

  2. ROW_NUMBER() scope is all stale rows (pre-LIMIT): Window functions are evaluated over the full WHERE-filtered set before LIMIT 100 is applied. So ROW_NUMBER() = 1 correctly identifies the globally oldest stale row per deduplicationKey, not just the oldest among the limited 100.

  3. EXISTS doesn't accidentally match o itself: Row o has startedFulfillingAt IS NOT NULL AND < NOW() - threshold, so it can't match the EXISTS conditions (IS NULL or >= threshold). No self-match risk.

  4. EXISTS is per-key, not per-row: Since the EXISTS condition is keyed on deduplicationKey, all stale rows for the same key see the same result — either all get 'delete' (because an active sibling exists) or all fall through to the ROW_NUMBER tiebreaker. Consistent classification.

  5. Mutation CTE re-checks are necessary and correct: Under READ COMMITTED, a concurrent poller re-claiming a row (setting a fresh startedFulfillingAt) or a concurrent recovery resetting it (NULL) would cause EvalPlanQual to re-evaluate the predicate and skip the row. This prevents double-processing.

  6. DELETE FROM "OutgoingRequest" o USING stale s: Aliasing the target table in a DELETE … USING is valid PostgreSQL syntax.


⚠️ Edge Cases & Issues

1. Oldest stale row outside LIMIT 100 — all siblings get deleted

Scenario: Key X has 150 stale rows. The globally oldest by createdAt sits at position 101 in the startedFulfillingAt ASC ordering and is excluded by LIMIT 100. The 100 rows that ARE included all have ROW_NUMBER ≥ 2 in their partition. With no active siblings, they all get 'delete'.

In the next iteration, the oldest row is alone, gets ROW_NUMBER = 1, and is correctly reset.

Impact: Safe — work is never lost — but the oldest row can lag by one full iteration before being re-queued.


2. deduplicationKey IS NULL rows all get unconditional 'reset'

If multiple stale rows exist with deduplicationKey IS NULL, they all get reset to pending in one pass. Because NULL values are excluded from the partial unique index (WHERE "startedFulfillingAt" IS NULL), this is safe. However, it means N null-key stale rows → N new pending rows, potentially flooding the queue on a bad restart.

Suggestion (optional): consider capping null-key resets per recovery pass if NULL-keyed requests are a realistic pattern in practice.


3. 23505 unique index violation (documented but swallowed)

Between the EXISTS check in the planning CTE and the UPDATE … SET startedFulfillingAt = NULL in the reset CTE, a concurrent enqueueExternalDbSync for the same key can insert a new pending row. PostgreSQL raises SQLSTATE 23505, which Prisma surfaces as P2010. retryTransaction does not retry this (it's a constraint error, not a serialization failure). The error bubbles to handleStaleRequests, which swallows it and returns { resetIds: [], deletedIds: [] }.

Impact: The stale row survives until the next cron tick. Acceptable, and well-documented in the code comments. One thing to confirm: does retryTransaction inspect error codes and rethrow 23505 without retrying? If it retries indiscriminately, the second attempt would likely fail the same way.


4. stale CTE may be inlined twice under PostgreSQL 12+

PostgreSQL 12+ can inline non-recursive, non-data-modifying CTEs. If stale is inlined, the deleted and reset mutation CTEs each execute the planning query independently against the same transaction snapshot. Since NOW() is constant and no rows change between the two inlined executions (they operate on non-overlapping action sets), the results are identical. No correctness issue, but it doubles the planning query cost.

Suggestion: Add MATERIALIZED to the stale CTE to force single evaluation and guarantee consistent results:

WITH stale AS MATERIALIZED (
  ...
)

5. ORDER BY startedFulfillingAt ASC for LIMIT vs ORDER BY createdAt ASC for ROW_NUMBER

These two orderings are intentionally different and correct: the LIMIT drains oldest-claimed-first (fair processing order), while ROW_NUMBER identifies oldest-created-first (to preserve the original request semantics). No bug — just worth confirming this asymmetry is intentional.


6. Mutation CTEs operate on the same snapshot

PostgreSQL executes all data-modifying CTEs against the same snapshot, so deleted and reset cannot interfere with each other even though they target the same table. A row classified as 'delete' in stale will only be processed by deleted, and vice versa. No double-processing risk.


Summary Table

Concern Severity Status
Oldest row outside LIMIT deleted before reset Low Safe; multi-iteration drain
NULL-key stale rows all reset unconditionally Low Safe; potential queue flood edge case
23505 swallowed, stale row persists one tick Low Acceptable; well-documented
stale CTE potentially inlined twice Low Add MATERIALIZED to force single eval
ROW_NUMBER computed pre-LIMIT (correct) ✅ N/A Works as intended
Mutation CTE re-checks for READ COMMITTED ✅ N/A Correct concurrency guard

The most actionable fix is adding MATERIALIZED to the stale CTE (Issue 4). The rest are well-handled by the existing design or acceptable trade-offs given the poller's cron cadence.

Comment thread apps/backend/src/lib/external-db-sync-queue.ts
Comment thread apps/backend/src/lib/external-db-sync-queue.ts
Comment thread apps/backend/src/lib/external-db-sync-queue.ts Outdated
Failures between claiming and the deletion of outgoing requests from the handler can leave requests stale and never clean them up.
Some of these requests may also have duplicates that are fresh in the outgoing queue. These requests need to be deleted or retried.
@nams1570 nams1570 force-pushed the inspect-stuck-db-sync-req branch from cd9431a to d2e3bd9 Compare May 13, 2026 00:48
@nams1570
Copy link
Copy Markdown
Collaborator Author

@greptileai could you do a proper rereview? I feel like the materialized comment was a premature optimization so we'll wait on it for now. Correctness logic should still be the same I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants