Skip to content

fix: chunk byline IN clauses to stay within D1 SQL variable limit#223

Open
baezor wants to merge 4 commits intoemdash-cms:mainfrom
baezor:fix/byline-sql-batch-chunking
Open

fix: chunk byline IN clauses to stay within D1 SQL variable limit#223
baezor wants to merge 4 commits intoemdash-cms:mainfrom
baezor:fix/byline-sql-batch-chunking

Conversation

@baezor
Copy link
Copy Markdown

@baezor baezor commented Apr 4, 2026

What does this PR do?

Fixes unbounded IN (?, ?, …) clauses in byline hydration that exceed Cloudflare D1's SQL bound-parameter limit when querying large collections.

Adds a chunks() utility (utils/chunks.ts) and applies it defense-in-depth at the repository level so any caller is protected:

  • BylineRepository.getContentBylinesMany — deduplicates IDs, then chunks content_id IN (…)
  • BylineRepository.findByUserIds — chunks user_id IN (…)
  • getAuthorIds (bylines/index.ts) — chunks id IN (…) raw SQL

Each batch is capped at 50 IDs, well within D1's limit. Content IDs are also deduplicated before chunking to prevent duplicate credits when the same ID spans multiple chunks.

Closes #219

Type of change

  • Bug fix
  • Feature (requires approved Discussion)
  • Refactor (no behavior change)
  • Documentation
  • Performance improvement
  • Tests
  • Chore (dependencies, CI, tooling)

Checklist

  • I have read CONTRIBUTING.md
  • pnpm typecheck passes
  • pnpm --silent lint:json | jq '.diagnostics | length' returns 0
  • pnpm test passes (or targeted tests for my change)
  • pnpm format has been run
  • I have added/updated tests for my changes (if applicable)
  • I have added a changeset (if this PR changes a published package)
  • New features link to an approved Discussion: https://github.com/emdash-cms/emdash/discussions/...

AI-generated code disclosure

  • This PR includes AI-generated code

Screenshots / test output

All 26 tests pass across 3 test files (10 new):

 ✓ chunks > returns empty array for empty input
 ✓ chunks > returns single chunk when array fits within size
 ✓ chunks > splits array into even chunks
 ✓ chunks > handles remainder in last chunk
 ✓ chunks > handles chunk size of 1
 ✓ chunks > handles array exactly equal to chunk size
 ✓ SQL_BATCH_SIZE > is 50
 ✓ BylineRepository > getContentBylinesMany handles more IDs than SQL_BATCH_SIZE
 ✓ BylineRepository > getContentBylinesMany does not duplicate credits for repeated content IDs
 ✓ BylineRepository > findByUserIds handles more IDs than SQL_BATCH_SIZE
 ✓ getBylinesForEntries > handles batches larger than SQL_BATCH_SIZE across explicit and inferred bylines

baezor added 3 commits April 4, 2026 01:26
Fixes emdash-cms#219. hydrateEntryBylines builds unbounded IN (?, ?, …) clauses
that exceed Cloudflare D1's bound-parameter limit on large collections.

Adds a chunks() utility and applies it defense-in-depth at the
repository level: getContentBylinesMany, findByUserIds, and
getAuthorIds now batch IDs in groups of 50.
Deduplicates contentIds in getContentBylinesMany to prevent duplicate
credits when the same ID appears across chunk boundaries. Adds tests
for the duplication edge case and an end-to-end getBylinesForEntries
test spanning both explicit and inferred byline paths.
@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 4, 2026

🦋 Changeset detected

Latest commit: fddf1ee

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 9 packages
Name Type
emdash Patch
@emdash-cms/cloudflare Patch
@emdash-cms/plugin-ai-moderation Patch
@emdash-cms/plugin-atproto Patch
@emdash-cms/plugin-audit-log Patch
@emdash-cms/plugin-color Patch
@emdash-cms/plugin-embeds Patch
@emdash-cms/plugin-forms Patch
@emdash-cms/plugin-webhook-notifier Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@baezor
Copy link
Copy Markdown
Author

baezor commented Apr 4, 2026

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Apr 4, 2026
@baezor
Copy link
Copy Markdown
Author

baezor commented Apr 9, 2026

Confirming this addresses a real production bug — we just hit it at prepain.mx with emdash@0.1.1 on Cloudflare D1, with 124 posts in the collection. When the admin bundle loads /admin and issues GET /_emdash/api/content/posts?limit=100, the worker logs:

Content list error: Error: D1_ERROR: too many SQL variables at offset 369: SQLITE_ERROR

Minimal standalone repro of the D1 limit itself (just to isolate the constraint from emdash):

wrangler d1 execute <db> --remote --command \
  "SELECT 1 WHERE 'x' = ? AND 1 IN (?,?,…×100)"
# → too many SQL variables at offset 231: SQLITE_ERROR [code: 7500]

Per the D1 limits docs, D1 caps bound parameters at 100 per query — far below SQLite upstream's default of 32,766. Any WHERE col = ? AND id IN (?×100) clause produces 101 params and trips the limit. That's why this bug only reproduces on D1 at scale, and not on local SQLite fixtures.

Downstream (React #300 "rendered fewer hooks than expected") is a symptom — the admin bundle throws on the 500 response and the hook count goes inconsistent during the re-render. Fixing the query fixes the crash.

Heads-up: one uncovered site

While tracing the failure in prod we noticed that SeoRepository.getMany (packages/core/src/database/repositories/seo.ts:67-78) has the exact same unbounded IN pattern:

const rows = await this.db
  .selectFrom("_emdash_seo")
  .selectAll()
  .where("collection", "=", collection)
  .where("content_id", "in", contentIds)  // ← 100 ids + 1 collection = 101 params
  .execute();

And it's called from handleContentList before hydrateBylinesMany:

// packages/core/src/api/handlers/content.ts:257-259
const hasSeo = await collectionHasSeo(db, collection);
await hydrateSeoMany(db, collection, result.items, hasSeo);
await hydrateBylinesMany(db, collection, result.items);

So any collection with has_seo = 1 and ≥99 items will still 500 on the content list endpoint even after this PR merges. Our own posts collection has has_seo = 1, and it's the first function to fail in our trace.

Happy to open a follow-up PR that reuses the new chunks() helper + SQL_BATCH_SIZE for SeoRepository.getMany, or you can fold it into this PR if you'd prefer to land both together — whichever you prefer.

(As a short-term unblock we shipped a local patch-package hot-patch against our node_modules/emdash/dist/** that chunks all three functions — SEO, bylines, and findByUserIds — in batches of 90. We'll drop it the moment a release with the upstream fix ships.)

@github-actions
Copy link
Copy Markdown
Contributor

Overlapping PRs

This PR modifies files that are also changed by other open PRs:

This may cause merge conflicts or duplicated work. A maintainer will coordinate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hydrateEntryBylines exceeds D1 SQL variable limit on large collections

1 participant