[Fix] recover stale external db requests#1428
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughConsolidates stale outgoing-request recovery into an exported transactional ChangesStale Request Recovery
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR replaces the old read-only stale-row detection with an active CTE-based recovery that resets or deletes stale
Confidence Score: 5/5Safe to merge — the three previously flagged issues have all been resolved and the new recovery logic is correct. The recovery CTE correctly guards against every concurrent-access scenario: mutation CTEs repeat the staleness predicate so EvalPlanQual skips rows modified between snapshot and execution; the try/catch in handleStaleRequests absorbs the documented P2010 unique-constraint race without crashing the iteration; and RETURNING from the mutation CTEs ensures accurate counts. No correctness or data-safety issues were found. No files require special attention. Important Files Changed
Reviews (4): Last reviewed commit: "fix: recover stale external db requests" | Re-trigger Greptile |
There was a problem hiding this comment.
Pull request overview
This PR addresses a reliability issue in the external DB sync poller where failures between “claiming” outgoing requests and deleting them can leave rows permanently stuck as in-flight. It adds a recovery step that resets or deletes stale claimed OutgoingRequest rows so the queue can make forward progress.
Changes:
- Added
recoverStaleOutgoingRequests()to classify stale claimed rows and either reset them (retry) or delete them (if a pending sibling exists / duplicates). - Integrated stale-recovery into the external-db-sync poller iteration loop and added telemetry + bounded ID sampling in error reporting.
- Removed the previous “detect only” stale-query logic from the poller.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| apps/backend/src/lib/external-db-sync-queue.ts | Introduces stale-claim recovery logic for OutgoingRequest, including retry handling for partial-unique-index races. |
| apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts | Calls the new recovery logic each poll iteration and records recovery metrics / diagnostic error payloads. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts (1)
127-152: ⚡ Quick winEmit one stale-recovery event per poller run instead of per batch.
recoverStaleOutgoingRequests()is capped per call, so a backlog can generate several near-identicalcaptureError(...)events in one cron execution. Aggregating the counts and sampled IDs across iterations, then reporting once near the final response, will keep the signal useful without flooding error monitoring.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts` around lines 127 - 152, The poller currently calls captureError inside each batch of recoverStaleOutgoingRequests(), causing many near-duplicate stale-recovery events; instead aggregate totals and ID samples across all recoverStaleOutgoingRequests() calls within a single poller run and emit one captureError at the end of the poller route. Modify the poller/route code that calls recoverStaleOutgoingRequests() to accumulate total, resetIds and deletedIds (keep sampling logic with ID_SAMPLE_LIMIT and the notes like staleResetIdsSampleNote/staleDeletedIdsSampleNote) across iterations, then call captureError once with the combined counts, combined sampled id slices, and STALE_REQUEST_THRESHOLD_MS in the message; keep the current StackAssertionError payload shape and keys so downstream tooling is unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/backend/src/lib/external-db-sync-queue.ts`:
- Around line 78-91: The CASE must also treat rows as duplicates when there
exists any other in-flight sibling for the same deduplicationKey so we don't
reset a stale row while a newer claim is active; update the CASE in the query
that computes action for OutgoingRequest (references o."deduplicationKey",
o."startedFulfillingAt", o."createdAt", o."id") to add an EXISTS clause that
checks for any other row with the same deduplicationKey and startedFulfillingAt
IS NOT NULL (and id != o.id) and return 'delete' in that branch before the
'reset' branch (keep the existing pending-sibling check too and still use
staleThresholdMs for the stale check).
- Around line 117-128: The error-handling branch in recoverStaleOutgoingRequests
is checking for Prisma.PrismaClientKnownRequestError with code "P2002" which
never matches unique-constraint errors thrown by tx.$queryRaw (they surface as
"P2010"); change the check to error.code === "P2010" (keep the existing
PARTIAL_UNIQUE_INDEX_NAME and MAX_PARTIAL_UNIQUE_RETRIES logic intact) so the
retry/StackAssertionError path executes correctly when $queryRaw hits the
partial unique index contention.
---
Nitpick comments:
In `@apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts`:
- Around line 127-152: The poller currently calls captureError inside each batch
of recoverStaleOutgoingRequests(), causing many near-duplicate stale-recovery
events; instead aggregate totals and ID samples across all
recoverStaleOutgoingRequests() calls within a single poller run and emit one
captureError at the end of the poller route. Modify the poller/route code that
calls recoverStaleOutgoingRequests() to accumulate total, resetIds and
deletedIds (keep sampling logic with ID_SAMPLE_LIMIT and the notes like
staleResetIdsSampleNote/staleDeletedIdsSampleNote) across iterations, then call
captureError once with the combined counts, combined sampled id slices, and
STALE_REQUEST_THRESHOLD_MS in the message; keep the current StackAssertionError
payload shape and keys so downstream tooling is unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 677f7488-bf39-4a88-a8f3-34270d413183
📒 Files selected for processing (2)
apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.tsapps/backend/src/lib/external-db-sync-queue.ts
adf33e9 to
638f49b
Compare
|
@greptile review again please. Please check the sql query in detail and consider edge cases |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/backend/src/lib/external-db-sync-queue.ts`:
- Around line 101-118: The mutation CTEs deleted and reset currently only join
on id/action and must also enforce the staleness predicate so they become no-ops
if a row was reclaimed before mutation; update the deleted CTE (DELETE FROM
"OutgoingRequest" o USING stale s WHERE o."id" = s."id" AND s.action = 'delete')
and the reset CTE (UPDATE "OutgoingRequest" o SET "startedFulfillingAt" = NULL
FROM stale s WHERE o."id" = s."id" AND s.action = 'reset') to also include the
same stale condition used in the planning CTE (e.g., check the stale predicate
fields from stale) so the WHERE clauses guard by both id/action and staleness.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0f3f26ec-3580-4091-84bf-0c01a59f6888
📒 Files selected for processing (2)
apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.tsapps/backend/src/lib/external-db-sync-queue.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- apps/backend/src/app/api/latest/internal/external-db-sync/poller/route.ts
255a197 to
720128e
Compare
|
@greptile review again please. Please check the sql query in detail and consider edge cases |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 21484
SQL Review:
|
| Concern | Severity | Status |
|---|---|---|
| Oldest row outside LIMIT deleted before reset | Low | Safe; multi-iteration drain |
| NULL-key stale rows all reset unconditionally | Low | Safe; potential queue flood edge case |
| 23505 swallowed, stale row persists one tick | Low | Acceptable; well-documented |
stale CTE potentially inlined twice |
Low | Add MATERIALIZED to force single eval |
ROW_NUMBER computed pre-LIMIT (correct) |
✅ N/A | Works as intended |
| Mutation CTE re-checks for READ COMMITTED | ✅ N/A | Correct concurrency guard |
The most actionable fix is adding MATERIALIZED to the stale CTE (Issue 4). The rest are well-handled by the existing design or acceptable trade-offs given the poller's cron cadence.
720128e to
cd9431a
Compare
Failures between claiming and the deletion of outgoing requests from the handler can leave requests stale and never clean them up. Some of these requests may also have duplicates that are fresh in the outgoing queue. These requests need to be deleted or retried.
cd9431a to
d2e3bd9
Compare
|
@greptileai could you do a proper rereview? I feel like the materialized comment was a premature optimization so we'll wait on it for now. Correctness logic should still be the same I think |
Failures between claiming and the deletion of outgoing requests from the handler can leave requests stale and never clean them up. Some of these requests may also have duplicates that are fresh in the outgoing queue. These requests need to be deleted or retried.
It's important to still log the stale requests to sentry so the root cause can be investigated.
Summary by CodeRabbit
Bug Fixes
Refactor