Skip to content

fix(admin): bulk SQL upsert in Core sync phases — replace per-row $transaction that timed out at 5s in prod#846

Open
Kneesal wants to merge 3 commits into
mainfrom
fix/admin-core-sync-bulk-upsert
Open

fix(admin): bulk SQL upsert in Core sync phases — replace per-row $transaction that timed out at 5s in prod#846
Kneesal wants to merge 3 commits into
mainfrom
fix/admin-core-sync-bulk-upsert

Conversation

@Kneesal
Copy link
Copy Markdown
Member

@Kneesal Kneesal commented Apr 28, 2026

Summary

All 5 Core sync phases (languages, countries, keywords, videos, dubs) wrapped 500 sequential prisma.upsert() calls in $transaction({ timeout: 5_000 }). Confirmed during the R1 prod smoke on 2026-04-27: every page hit Transaction API error: Transaction not found, logged 5 identical errors over ~16s, and wrote zero rows. 500 upserts × ~10ms over Railway's proxy = 5s with zero margin.

This PR replaces the per-row pattern with one bulk INSERT ... ON CONFLICT DO UPDATE statement per entity per page:

  • Single round-trip instead of N (~100x faster)
  • Atomic without a transaction wrapper (single statement is implicit-tx)
  • Not subject to the interactive-transaction timeout at all

Per-phase notes

Phase Bulk statements per page Notes
languages 1 (Language INSERT) page-loop, 500/page
keywords 1 (Keyword INSERT) single batch (no paging in Core query)
countries 2 (Continent then Country) deduped continent set first via RETURNING; Country uses the resulting (core_id → id) map for continent_id
videos 2 (Video then VideoLocale) MANAGER protection moved into ON CONFLICT WHERE \"video\".\"source\" != 'manager'; videos that survive the WHERE drive the flattened (videoId, locale) tuples for the locale upsert
dubs 1 (VideoDub INSERT) same MANAGER WHERE clause; orphaned dubs (no video FK) are now logged + skipped instead of throwing — one stale FK shouldn't poison the rest of a 500-row batch

Behaviour preserved

  • Per-page error handling (catch, increment stats.errors, structured log, continue paging)
  • Soft-delete sweep at end of each phase (still prisma.updateMany, unchanged)
  • Loop termination: still break when rawRows.length < PAGE_SIZE
  • Structured log event names unchanged
  • MANAGER protection guarantee unchanged — semantically equivalent, enforced by SQL WHERE instead of a JS pre-pass findUnique

Helpers

src/services/core-sync/bulk-upsert.ts:

  • newRowId() — wraps @paralleldrive/cuid2's createId() to mirror Prisma's @default(cuid()) since the DB columns have no Postgres-side default (verified via 0001_init migration)
  • jsonbParam(value) — `${JSON.stringify(value)}::jsonb` cast required for jsonb columns under bound-parameter INSERT (cf. PG18 cast rules in apps/admin/CLAUDE.md "Common pitfalls")

Testing

  • Existing sync-languages.test.ts: 3 tests updated for the new bulk shape; +2 new (bulk-INSERT SQL invariant, bulk-error path). Switched vi.clearAllMocksmockReset so queued mockResolvedValueOnce values don't leak across tests
  • New test files for the previously-uncovered phases: sync-keywords.test.ts, sync-countries.test.ts, sync-videos.test.ts, sync-dubs.test.ts. Each covers happy path + SQL invariant (INSERT INTO, ON CONFLICT ... DO UPDATE, no \$transaction) + edge cases. Entity-specific tests:
    • countries: deduped continent INSERT + RETURNING-driven Country INSERT; skip continent INSERT when none in batch
    • videos: MANAGER protection — when RETURNING omits a coreId, no VideoLocale upsert fires for it; empty-locale-set short-circuit
    • dubs: missing-video-FK skip path
Test Files  71 passed (71)
     Tests  951 passed | 1 todo (952)   ← +24 from baseline

Lint + typecheck clean.

Post-Deploy Monitoring & Validation

  • What to monitor
    • Logs: service=forge-admin event=core-sync.phase.complete (existing structured event) — successful runs should show non-zero created/updated and durationMs of seconds, not the previous 24s/5-error pattern.
    • Failure signal: any core-sync.<entity>.error event with message matching Transaction API error (the old failure mode) — should be impossible after this fix; if it appears, escalate.
  • Validation checks (post-deploy, against admin prod)
    • Fire from a logged-in admin browser console:
      ```js
      const r = await fetch('/api/graphql', { method:'POST', headers:{'content-type':'application/json'}, credentials:'include', body: JSON.stringify({ query: 'mutation { triggerSync(scope: "languages", incremental: false) }' })});
      console.log(JSON.stringify(await r.json(), null, 2));
      ```
    • Expected: phases: [{ phase: \"languages\", updated: 10000+, errors: 0, durationMs: <few seconds> }] — replaces the previous errors: 5, updated: 0, durationMs: 24588 shape.
    • Then SELECT count(*) FROM language against admin prod Postgres → matches Core's language count.
    • Then triggerSync(scope: \"all\", incremental: false) to populate the full catalog.
    • Then re-fire triggerSceneEmbeddingBackfill(coreIds: [\"2_FabioStory\"], locales: [\"en\"]) — should now produce non-zero totalTargets (was 0 because admin's video table was empty).
  • Failure signal / rollback trigger
    • If triggerSync(scope: \"languages\") returns errors: > 0 post-deploy, capture the structured core-sync.<entity>.error log line, revert this PR, and investigate. The fallback is to ship the simpler timeout bump (5_00060_000) discussed in the PR thread.
  • Validation window & owner

Refs

🤖 Generated with Claude Code — Claude Opus 4.7 (1M context) via Compound Engineering ce:work

…ansaction that timed out at 5s in prod

All 5 Core sync phases (languages, countries, keywords, videos, dubs)
wrapped 500 sequential prisma.upsert() calls in
$transaction({ timeout: 5_000 }). Confirmed during the R1 prod smoke
on 2026-04-27: every page hit "Transaction API error: Transaction not
found", logged 5 identical errors over ~16s, and wrote zero rows. 500
upserts × ~10ms over Railway's proxy = 5s with zero margin.

Replaces the per-row transaction-wrapped upsert with one bulk
INSERT ... ON CONFLICT DO UPDATE statement per entity per page:

- Single round-trip instead of N (~100x faster).
- Atomic without a transaction wrapper (single statement).
- Not subject to the interactive-transaction timeout at all.

Per-phase notes:
- languages: page-loop, 1 bulk INSERT per page.
- keywords: single batch, 1 bulk INSERT.
- countries: 2 bulk INSERTs per run — deduped continent set first
  (RETURNING the (id, core_id) map), then countries with continent_id
  resolved from that map.
- videos: 2 bulk INSERTs per page — Video bulk upsert with the
  source='manager' protection moved into the ON CONFLICT WHERE clause
  ("WHERE \"video\".\"source\" != 'manager'::\"SourceTier\""), then a
  single bulk VideoLocale upsert flattened across (videoId, locale)
  tuples for the videos that survived the WHERE filter (manager-
  protected videos are absent from the RETURNING set, so their locale
  rows are also untouched).
- dubs: 1 bulk INSERT per page with the same MANAGER WHERE clause.
  Variants whose video FK can't be resolved are now logged + skipped
  (event=core-sync.video-dub.skipped, reason=missing_video_fk) rather
  than throwing — one stale FK shouldn't poison the rest of the page.

Behaviour preserved:
- Per-page error handling (catch, increment stats.errors, log
  structured event, continue paging).
- Soft-delete sweep at end of each phase (still uses prisma.updateMany,
  unchanged).
- Loop termination: still break when rawRows.length < PAGE_SIZE.
- Structured log event names unchanged.
- MANAGER protection guarantee unchanged (semantically equivalent;
  enforced by the SQL WHERE clause instead of a JS pre-pass findUnique).

Helpers extracted into src/services/core-sync/bulk-upsert.ts:
- newRowId() — wraps @paralleldrive/cuid2's createId() to mirror
  Prisma's @default(cuid()) since DB columns have no Postgres-side
  default (verified via 0001_init migration).
- jsonbParam(value) — `${JSON.stringify(value)}::jsonb` — required
  cast for jsonb columns under bound-parameter INSERT (cf. PG18 cast
  rules called out in apps/admin/CLAUDE.md "Common pitfalls").

Tests:
- sync-languages.test.ts: existing 3 tests updated for the new bulk
  shape; added 2 (bulk-INSERT SQL invariant + bulk-error path). Uses
  mockReset (not clearAllMocks) so queued mockResolvedValueOnce
  values don't leak across tests — caught while writing the new
  cases.
- sync-keywords / sync-countries / sync-videos / sync-dubs: NEW test
  files (3-5 cases each) covering happy path, SQL invariant
  (INSERT shape, ON CONFLICT DO UPDATE, no $transaction), edge cases
  (empty batch, bulk-error), and entity-specific concerns:
  * countries: deduped continent INSERT + RETURNING-driven Country
    INSERT; skip continent INSERT when none in batch.
  * videos: MANAGER protection — when RETURNING omits a coreId, no
    VideoLocale upsert fires for it. Empty-locale-set short-circuit.
  * dubs: missing-video-FK skip path.

Verification expected after deploy:
- triggerSync(scope: "languages", incremental: false) →
  phases: [{ phase: "languages", updated: 10000+, errors: 0,
  durationMs: a few seconds }]
- Then triggerSync(scope: "all", incremental: false) to populate the
  full catalog.
- Then re-fire triggerSceneEmbeddingBackfill(coreIds:
  ["2_FabioStory"], locales: ["en"]) — should now produce non-zero
  targets.

Refs the R1 prod smoke handoff
(docs/handoffs/2026-04-21-admin-migration-r1-smoke-and-r2-handoff.md)
and apps/admin/CLAUDE.md R1 runbook "R0 prereq" paragraph (added
2026-04-27 alongside PR #845).
@railway-app
Copy link
Copy Markdown

railway-app Bot commented Apr 28, 2026

🚅 Deployed to the forge-pr-846 environment in forge

4 services not affected by this PR
  • @forge/cms/db
  • @forge/cms
  • @forge/manager
  • @forge/web

Kneesal added 2 commits April 28, 2026 00:21
P1 (resilience):
- Tighten CoreVideoSchema and CoreDubSchema `updatedAt` from
  `z.string().min(1)` to `z.string().datetime()`. Without the strict
  parse, a malformed Core timestamp produced `Invalid Date` at
  `new Date(...)` time, which then bound to TIMESTAMPTZ and aborted
  the entire 500-row bulk INSERT. Now bad rows are dropped at parse
  time and counted as `core-sync.<entity>.parse-error` per phase,
  preserving the rest of the batch.
- Fix `lengthInMilliseconds === 0` silently round-tripping as NULL.
  The truthiness check `variant.lengthInMilliseconds ? BigInt(...) :
  null` treated 0 as falsy. Replaced with `== null` check. Latent in
  legacy too; fixing here since the bulk path is the new canonical
  writer.
- Restore `mapLabel` typed enum return type. Now returns
  `VideoLabelDb | null` instead of bare `string | null`, so a typo
  in the MAP fails at compile time rather than at SQL bind time
  (where the `?::"VideoLabel"` cast would reject the value and abort
  the page).

P2 (observability + atomicity):
- Add `skipped` counter to SyncStats; sync-dubs increments it on
  each missing-video-FK skip. Distinct from `errors` (which gates
  soft-delete) since skipping is an expected outcome of inter-phase
  ordering, not a failure. Operators can now alert on
  `skipped > threshold` instead of mining logs.
- Wrap the two-statement sequences in sync-countries (Continent →
  Country) and sync-videos (Video → VideoLocale) in
  `$transaction({ timeout: 30_000, maxWait: 5_000 })`. Avoids the
  partial-state window where step 1 commits before step 2 fails. Safe
  vs the prod-failure that motivated this PR — each statement is a
  single bulk INSERT (sub-second), not a 500-iteration per-row loop;
  30s has plenty of headroom.
- Pre-bucket each video's localized arrays into Map<bcp47, value>
  once instead of running four `.find()` linear scans per locale
  per video. Removes an O(L × N × 4) hot-path cost. New
  `bucketByLocale` helper at the bottom of sync-videos.ts.
- Add `bulkErrorLogFields(err)` helper in bulk-upsert.ts that
  surfaces Prisma's `code` / `meta` (Postgres SQLSTATE + constraint
  name on UNIQUE/CHECK violations) in the structured catch logs,
  alongside `firstCoreId` / `lastCoreId` of the failing page. Lets
  operators identify the bad row range without bisecting the Core
  payload.

P3 (docs):
- Update newRowId() JSDoc to acknowledge cuid v2 (this helper) vs
  Prisma `@default(cuid())` v1 mismatch and document why the
  divergence is intentional.

Tests:
- 5 new test cases across sync-videos.test.ts and sync-countries.
  test.ts — including the cross-statement atomicity case (step 1
  succeeds, step 2 fails → wrapped transaction rolls back step 1)
  flagged as a coverage gap by the testing reviewer.
- Existing sync-videos / sync-countries tests refactored to share a
  `makePrismaStub` helper whose `$transaction(async tx => …)`
  invokes the callback with a `tx` proxy that delegates to the
  mocked $queryRaw / $executeRaw. The existing per-statement mocks
  keep working unchanged.
- sync-dubs.test.ts orphan-FK case asserts `stats.skipped === 1`.
- All five $transaction-wrapping assertions also pin the deliberate
  `{ timeout: 30_000, maxWait: 5_000 }` options so anyone re-tightens
  the timeout has to touch the test on purpose.

Verification:
- pnpm --filter @forge/admin typecheck: clean.
- pnpm --filter @forge/admin test: 953 passed (up from 951 — net +2
  cases counting the dubs/test-mock refactors).
- pnpm --filter @forge/admin lint: clean.

Deferred (out of this PR's scope; documented for follow-up):
- Soft-delete `notIn: [...seenCoreIds]` will exceed Postgres 65535
  bind-param limit at scale (perf-001). Real risk for videos/dubs at
  ~50K+ rows. Fix is to invert the sweep to a `synced_at < runStart`
  watermark — touches every phase + watermark.ts. Follow-up issue.
- No real-Postgres integration tests covering the actual SQL
  semantics. Separate test-infra work.
- `bulkUpsertByCoreId` helper extraction to collapse the five
  near-identical per-phase scaffolds. Defer until the 6th phase
  (personalization) needs it.
…s predecessor

Four new docs from PR #846 (admin Core sync bulk-upsert refactor) +
one refresh of the cms predecessor.

New:
- docs/solutions/database-issues/prisma-transaction-timeout-wrong-tool-for-per-row-bulk-20260428.md
  — the failure mode: $transaction({timeout:5_000}) wrapping a per-row
  upsert loop reliably hits "Transaction API error: Transaction not
  found" at scale. Surfaced during admin's R1 prod smoke 2026-04-27
  (5 errors / 0 rows written / 24s per page).
- docs/solutions/best-practices/prisma-bulk-upsert-pattern-20260428.md
  — the replacement: $executeRaw + Prisma.join + INSERT ... ON CONFLICT
  DO UPDATE. Includes the helpers (newRowId via @paralleldrive/cuid2,
  jsonbParam for the PG18 cast, bulkErrorLogFields for catch logging)
  and the multi-statement-in-$transaction variant with rationale for
  why a 30s timeout is safe (each statement is sub-second, vs the
  500-iteration loop that motivated this PR).
- docs/solutions/best-practices/prisma-on-conflict-where-row-protection-20260428.md
  — per-row protection at the SQL layer via ON CONFLICT ... WHERE +
  RETURNING absence-as-signal. Replaces the legacy JS pre-pass
  findUnique pattern; admin uses this for MANAGER protection on Video
  and VideoDub. Documents the absence-as-signal trick so dependent
  code can correctly skip protected rows.
- docs/solutions/performance-issues/soft-delete-notIn-watermark-anti-pattern-20260428.md
  — deferred follow-up identified during PR #846 review:
  prisma.<entity>.updateMany({where:{coreId:{notIn:[...seenCoreIds]}}})
  exceeds Postgres's 65535 bind-param limit at ~50K+ rows. Status:
  documented but not yet implemented (admin's catalog is empty so the
  failure isn't yet observable). Recommended fix: invert to
  synced_at < runStartedAt watermark, mirroring the manager backfill
  worker pattern.

Refreshed:
- docs/solutions/cms/core-sync-per-page-upsert-pattern.md — added
  "Sibling implementation: admin (PR #846)" section with cross-
  references to the 3 new admin-side docs above; added the soft-delete
  notIn watermark caveat to the existing "Gotchas" section (the cms
  pattern is fine at cms scale but doesn't generalize to admin's
  eventual catalog); bumped frontmatter last_updated and added PR #846
  to related_issues.

🤖 Generated with [Claude Code](https://claude.com/claude-code) — Claude Opus 4.7 (1M context) via Compound Engineering ce:compound + ce:compound-refresh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant