fix(admin): bulk SQL upsert in Core sync phases — replace per-row $transaction that timed out at 5s in prod#846
Open
Kneesal wants to merge 3 commits into
Open
fix(admin): bulk SQL upsert in Core sync phases — replace per-row $transaction that timed out at 5s in prod#846Kneesal wants to merge 3 commits into
Kneesal wants to merge 3 commits into
Conversation
…ansaction that timed out at 5s in prod
All 5 Core sync phases (languages, countries, keywords, videos, dubs)
wrapped 500 sequential prisma.upsert() calls in
$transaction({ timeout: 5_000 }). Confirmed during the R1 prod smoke
on 2026-04-27: every page hit "Transaction API error: Transaction not
found", logged 5 identical errors over ~16s, and wrote zero rows. 500
upserts × ~10ms over Railway's proxy = 5s with zero margin.
Replaces the per-row transaction-wrapped upsert with one bulk
INSERT ... ON CONFLICT DO UPDATE statement per entity per page:
- Single round-trip instead of N (~100x faster).
- Atomic without a transaction wrapper (single statement).
- Not subject to the interactive-transaction timeout at all.
Per-phase notes:
- languages: page-loop, 1 bulk INSERT per page.
- keywords: single batch, 1 bulk INSERT.
- countries: 2 bulk INSERTs per run — deduped continent set first
(RETURNING the (id, core_id) map), then countries with continent_id
resolved from that map.
- videos: 2 bulk INSERTs per page — Video bulk upsert with the
source='manager' protection moved into the ON CONFLICT WHERE clause
("WHERE \"video\".\"source\" != 'manager'::\"SourceTier\""), then a
single bulk VideoLocale upsert flattened across (videoId, locale)
tuples for the videos that survived the WHERE filter (manager-
protected videos are absent from the RETURNING set, so their locale
rows are also untouched).
- dubs: 1 bulk INSERT per page with the same MANAGER WHERE clause.
Variants whose video FK can't be resolved are now logged + skipped
(event=core-sync.video-dub.skipped, reason=missing_video_fk) rather
than throwing — one stale FK shouldn't poison the rest of the page.
Behaviour preserved:
- Per-page error handling (catch, increment stats.errors, log
structured event, continue paging).
- Soft-delete sweep at end of each phase (still uses prisma.updateMany,
unchanged).
- Loop termination: still break when rawRows.length < PAGE_SIZE.
- Structured log event names unchanged.
- MANAGER protection guarantee unchanged (semantically equivalent;
enforced by the SQL WHERE clause instead of a JS pre-pass findUnique).
Helpers extracted into src/services/core-sync/bulk-upsert.ts:
- newRowId() — wraps @paralleldrive/cuid2's createId() to mirror
Prisma's @default(cuid()) since DB columns have no Postgres-side
default (verified via 0001_init migration).
- jsonbParam(value) — `${JSON.stringify(value)}::jsonb` — required
cast for jsonb columns under bound-parameter INSERT (cf. PG18 cast
rules called out in apps/admin/CLAUDE.md "Common pitfalls").
Tests:
- sync-languages.test.ts: existing 3 tests updated for the new bulk
shape; added 2 (bulk-INSERT SQL invariant + bulk-error path). Uses
mockReset (not clearAllMocks) so queued mockResolvedValueOnce
values don't leak across tests — caught while writing the new
cases.
- sync-keywords / sync-countries / sync-videos / sync-dubs: NEW test
files (3-5 cases each) covering happy path, SQL invariant
(INSERT shape, ON CONFLICT DO UPDATE, no $transaction), edge cases
(empty batch, bulk-error), and entity-specific concerns:
* countries: deduped continent INSERT + RETURNING-driven Country
INSERT; skip continent INSERT when none in batch.
* videos: MANAGER protection — when RETURNING omits a coreId, no
VideoLocale upsert fires for it. Empty-locale-set short-circuit.
* dubs: missing-video-FK skip path.
Verification expected after deploy:
- triggerSync(scope: "languages", incremental: false) →
phases: [{ phase: "languages", updated: 10000+, errors: 0,
durationMs: a few seconds }]
- Then triggerSync(scope: "all", incremental: false) to populate the
full catalog.
- Then re-fire triggerSceneEmbeddingBackfill(coreIds:
["2_FabioStory"], locales: ["en"]) — should now produce non-zero
targets.
Refs the R1 prod smoke handoff
(docs/handoffs/2026-04-21-admin-migration-r1-smoke-and-r2-handoff.md)
and apps/admin/CLAUDE.md R1 runbook "R0 prereq" paragraph (added
2026-04-27 alongside PR #845).
|
🚅 Deployed to the forge-pr-846 environment in forge 4 services not affected by this PR
|
P1 (resilience):
- Tighten CoreVideoSchema and CoreDubSchema `updatedAt` from
`z.string().min(1)` to `z.string().datetime()`. Without the strict
parse, a malformed Core timestamp produced `Invalid Date` at
`new Date(...)` time, which then bound to TIMESTAMPTZ and aborted
the entire 500-row bulk INSERT. Now bad rows are dropped at parse
time and counted as `core-sync.<entity>.parse-error` per phase,
preserving the rest of the batch.
- Fix `lengthInMilliseconds === 0` silently round-tripping as NULL.
The truthiness check `variant.lengthInMilliseconds ? BigInt(...) :
null` treated 0 as falsy. Replaced with `== null` check. Latent in
legacy too; fixing here since the bulk path is the new canonical
writer.
- Restore `mapLabel` typed enum return type. Now returns
`VideoLabelDb | null` instead of bare `string | null`, so a typo
in the MAP fails at compile time rather than at SQL bind time
(where the `?::"VideoLabel"` cast would reject the value and abort
the page).
P2 (observability + atomicity):
- Add `skipped` counter to SyncStats; sync-dubs increments it on
each missing-video-FK skip. Distinct from `errors` (which gates
soft-delete) since skipping is an expected outcome of inter-phase
ordering, not a failure. Operators can now alert on
`skipped > threshold` instead of mining logs.
- Wrap the two-statement sequences in sync-countries (Continent →
Country) and sync-videos (Video → VideoLocale) in
`$transaction({ timeout: 30_000, maxWait: 5_000 })`. Avoids the
partial-state window where step 1 commits before step 2 fails. Safe
vs the prod-failure that motivated this PR — each statement is a
single bulk INSERT (sub-second), not a 500-iteration per-row loop;
30s has plenty of headroom.
- Pre-bucket each video's localized arrays into Map<bcp47, value>
once instead of running four `.find()` linear scans per locale
per video. Removes an O(L × N × 4) hot-path cost. New
`bucketByLocale` helper at the bottom of sync-videos.ts.
- Add `bulkErrorLogFields(err)` helper in bulk-upsert.ts that
surfaces Prisma's `code` / `meta` (Postgres SQLSTATE + constraint
name on UNIQUE/CHECK violations) in the structured catch logs,
alongside `firstCoreId` / `lastCoreId` of the failing page. Lets
operators identify the bad row range without bisecting the Core
payload.
P3 (docs):
- Update newRowId() JSDoc to acknowledge cuid v2 (this helper) vs
Prisma `@default(cuid())` v1 mismatch and document why the
divergence is intentional.
Tests:
- 5 new test cases across sync-videos.test.ts and sync-countries.
test.ts — including the cross-statement atomicity case (step 1
succeeds, step 2 fails → wrapped transaction rolls back step 1)
flagged as a coverage gap by the testing reviewer.
- Existing sync-videos / sync-countries tests refactored to share a
`makePrismaStub` helper whose `$transaction(async tx => …)`
invokes the callback with a `tx` proxy that delegates to the
mocked $queryRaw / $executeRaw. The existing per-statement mocks
keep working unchanged.
- sync-dubs.test.ts orphan-FK case asserts `stats.skipped === 1`.
- All five $transaction-wrapping assertions also pin the deliberate
`{ timeout: 30_000, maxWait: 5_000 }` options so anyone re-tightens
the timeout has to touch the test on purpose.
Verification:
- pnpm --filter @forge/admin typecheck: clean.
- pnpm --filter @forge/admin test: 953 passed (up from 951 — net +2
cases counting the dubs/test-mock refactors).
- pnpm --filter @forge/admin lint: clean.
Deferred (out of this PR's scope; documented for follow-up):
- Soft-delete `notIn: [...seenCoreIds]` will exceed Postgres 65535
bind-param limit at scale (perf-001). Real risk for videos/dubs at
~50K+ rows. Fix is to invert the sweep to a `synced_at < runStart`
watermark — touches every phase + watermark.ts. Follow-up issue.
- No real-Postgres integration tests covering the actual SQL
semantics. Separate test-infra work.
- `bulkUpsertByCoreId` helper extraction to collapse the five
near-identical per-phase scaffolds. Defer until the 6th phase
(personalization) needs it.
…s predecessor Four new docs from PR #846 (admin Core sync bulk-upsert refactor) + one refresh of the cms predecessor. New: - docs/solutions/database-issues/prisma-transaction-timeout-wrong-tool-for-per-row-bulk-20260428.md — the failure mode: $transaction({timeout:5_000}) wrapping a per-row upsert loop reliably hits "Transaction API error: Transaction not found" at scale. Surfaced during admin's R1 prod smoke 2026-04-27 (5 errors / 0 rows written / 24s per page). - docs/solutions/best-practices/prisma-bulk-upsert-pattern-20260428.md — the replacement: $executeRaw + Prisma.join + INSERT ... ON CONFLICT DO UPDATE. Includes the helpers (newRowId via @paralleldrive/cuid2, jsonbParam for the PG18 cast, bulkErrorLogFields for catch logging) and the multi-statement-in-$transaction variant with rationale for why a 30s timeout is safe (each statement is sub-second, vs the 500-iteration loop that motivated this PR). - docs/solutions/best-practices/prisma-on-conflict-where-row-protection-20260428.md — per-row protection at the SQL layer via ON CONFLICT ... WHERE + RETURNING absence-as-signal. Replaces the legacy JS pre-pass findUnique pattern; admin uses this for MANAGER protection on Video and VideoDub. Documents the absence-as-signal trick so dependent code can correctly skip protected rows. - docs/solutions/performance-issues/soft-delete-notIn-watermark-anti-pattern-20260428.md — deferred follow-up identified during PR #846 review: prisma.<entity>.updateMany({where:{coreId:{notIn:[...seenCoreIds]}}}) exceeds Postgres's 65535 bind-param limit at ~50K+ rows. Status: documented but not yet implemented (admin's catalog is empty so the failure isn't yet observable). Recommended fix: invert to synced_at < runStartedAt watermark, mirroring the manager backfill worker pattern. Refreshed: - docs/solutions/cms/core-sync-per-page-upsert-pattern.md — added "Sibling implementation: admin (PR #846)" section with cross- references to the 3 new admin-side docs above; added the soft-delete notIn watermark caveat to the existing "Gotchas" section (the cms pattern is fine at cms scale but doesn't generalize to admin's eventual catalog); bumped frontmatter last_updated and added PR #846 to related_issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) — Claude Opus 4.7 (1M context) via Compound Engineering ce:compound + ce:compound-refresh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
All 5 Core sync phases (
languages,countries,keywords,videos,dubs) wrapped 500 sequentialprisma.upsert()calls in$transaction({ timeout: 5_000 }). Confirmed during the R1 prod smoke on 2026-04-27: every page hitTransaction API error: Transaction not found, logged 5 identical errors over ~16s, and wrote zero rows. 500 upserts × ~10ms over Railway's proxy = 5s with zero margin.This PR replaces the per-row pattern with one bulk
INSERT ... ON CONFLICT DO UPDATEstatement per entity per page:Per-phase notes
languageskeywordscountriescontinent_idvideosON CONFLICT WHERE \"video\".\"source\" != 'manager'; videos that survive the WHERE drive the flattened (videoId, locale) tuples for the locale upsertdubsBehaviour preserved
stats.errors, structured log, continue paging)prisma.updateMany, unchanged)rawRows.length < PAGE_SIZEfindUniqueHelpers
src/services/core-sync/bulk-upsert.ts:newRowId()— wraps@paralleldrive/cuid2'screateId()to mirror Prisma's@default(cuid())since the DB columns have no Postgres-side default (verified via0001_initmigration)jsonbParam(value)— `${JSON.stringify(value)}::jsonb` cast required for jsonb columns under bound-parameter INSERT (cf. PG18 cast rules inapps/admin/CLAUDE.md"Common pitfalls")Testing
sync-languages.test.ts: 3 tests updated for the new bulk shape; +2 new (bulk-INSERT SQL invariant, bulk-error path). Switchedvi.clearAllMocks→mockResetso queuedmockResolvedValueOncevalues don't leak across testssync-keywords.test.ts,sync-countries.test.ts,sync-videos.test.ts,sync-dubs.test.ts. Each covers happy path + SQL invariant (INSERT INTO,ON CONFLICT ... DO UPDATE, no\$transaction) + edge cases. Entity-specific tests:Lint + typecheck clean.
Post-Deploy Monitoring & Validation
service=forge-admin event=core-sync.phase.complete(existing structured event) — successful runs should show non-zerocreated/updatedanddurationMsof seconds, not the previous 24s/5-error pattern.core-sync.<entity>.errorevent with message matchingTransaction API error(the old failure mode) — should be impossible after this fix; if it appears, escalate.```js
const r = await fetch('/api/graphql', { method:'POST', headers:{'content-type':'application/json'}, credentials:'include', body: JSON.stringify({ query: 'mutation { triggerSync(scope: "languages", incremental: false) }' })});
console.log(JSON.stringify(await r.json(), null, 2));
```
phases: [{ phase: \"languages\", updated: 10000+, errors: 0, durationMs: <few seconds> }]— replaces the previouserrors: 5, updated: 0, durationMs: 24588shape.SELECT count(*) FROM languageagainst admin prod Postgres → matches Core's language count.triggerSync(scope: \"all\", incremental: false)to populate the full catalog.triggerSceneEmbeddingBackfill(coreIds: [\"2_FabioStory\"], locales: [\"en\"])— should now produce non-zerototalTargets(was 0 because admin's video table was empty).triggerSync(scope: \"languages\")returnserrors: > 0post-deploy, capture the structuredcore-sync.<entity>.errorlog line, revert this PR, and investigate. The fallback is to ship the simpler timeout bump (5_000→60_000) discussed in the PR thread.Refs
apps/admin/CLAUDE.mdR1 runbook "R0 prereq" paragraph (added 2026-04-27 alongside PR feat(admin): split S3 client — read manager artifacts from manager's bucket #845)docs/handoffs/2026-04-21-admin-migration-r1-smoke-and-r2-handoff.md🤖 Generated with Claude Code — Claude Opus 4.7 (1M context) via Compound Engineering ce:work