diff --git a/docs/pipeline/indesign-pdf-fidelity.md b/docs/pipeline/indesign-pdf-fidelity.md
new file mode 100644
index 0000000..bde3efe
--- /dev/null
+++ b/docs/pipeline/indesign-pdf-fidelity.md
@@ -0,0 +1,103 @@
+# InDesign PDF fallback: fidelity guide
+
+The InDesign-to-WordPress pipeline prefers **IDML** (`.idml`) as its input. When
+IDML isn't available — a client only has the exported PDF, or you want to
+cross-check IDML output — the pipeline can parse a **PDF exported from InDesign**
+instead.
+
+PDF is intentionally a *fallback*. A PDF has no named styles, no text frames,
+and no swatch palette; it's a bag of absolutely-positioned glyph runs, fill
+colors, and placed images. The parser reconstructs an approximate version of the
+same [intermediate representation](../../packages/pipeline/src/indesign/ir.js)
+the IDML parser produces, so the downstream mapper and generator can consume
+either source. **We never aim for pixel-perfect reconstruction — we aim for a
+usable, styled IR that the generator can turn into WordPress patterns with
+manual touch-ups.**
+
+## Usage
+
+```bash
+# Print the reconstructed IR as JSON; fidelity warnings go to stderr.
+node packages/pipeline/bin/parse-pdf.mjs brochure.pdf > ir.json
+
+# Also extract embedded images to an asset cache directory (as PNG).
+node packages/pipeline/bin/parse-pdf.mjs brochure.pdf --asset-dir ./assets > ir.json
+```
+
+```js
+import { parsePdf } from '@flavian/pipeline';
+
+const ir = await parsePdf('./brochure.pdf', {
+  dpi: 96,                       // unit normalization (default 96)
+  assetCacheDir: './assets',     // optional; write extracted images here
+  swatchPalette: idml.swatches,  // optional; snap detected colors to IDML swatches
+});
+```
+
+## How reconstruction works
+
+| IR element | How it's derived from the PDF |
+| --- | --- |
+| `Spread` (one per page) | One spread per PDF page; page size from the MediaBox. |
+| `TextFrame` | Glyph runs are grouped into lines (shared baseline), then lines into frames (vertically adjacent + horizontally overlapping). A wide horizontal gap on a shared baseline is treated as a **column gutter**, so side-by-side columns become separate frames. |
+| `Story` / `TextRun` | One story per text frame; each run carries the paragraph-style reference of its font-size bucket. |
+| `Style` | Synthesized from the font-size distribution: the most-used size is **Body**, larger sizes become **Heading 1..6** (largest first), smaller sizes become **Caption**. Each bucket records its dominant font and fill color. |
+| `Font` | Resolved from each run's PostScript name (subset prefixes like `ABCDEF+` stripped); family/style split on the `-` and refined with pdfjs bold/italic flags. |
+| `Swatch` | Distinct fill colors found in the content stream, normalized to hex. With a `swatchPalette`, each color snaps to the nearest IDML swatch (so PDF and IDML produce aligned token names). |
+| `ImageFrame` | Image XObjects, placed via the current transform matrix. Pixels are PNG-encoded into the asset cache; `href` points at the cache-relative path. |
+| `MasterSpread` | Always empty — PDF has no master pages. |
+
+## Fidelity warnings
+
+Every PDF parse attaches warnings describing the approximations made. They appear
+on the CLI's stderr and in `ir.warnings`. Treat them as a checklist of things to
+verify by eye.
+
+| Code | Meaning | When |
+| --- | --- | --- |
+| `pdf-fallback` | The whole IR is approximate; prefer IDML if you have it. | Always |
+| `text-reconstructed-from-glyphs` | Text came from positioned glyph runs; ligatures, hidden text, and reading order may differ. | Always |
+| `styles-synthesized` | Paragraph styles are font-size buckets, not real named styles. | When any text exists |
+| `no-embedded-fonts` | No fonts are embedded; family/style mapping is best-effort from PostScript names. | No embedded fonts found |
+| `color-attribution-approximate` | Colors are bucketed by font size, not resolved per run. | When any colored text exists |
+| `vector-paths-dropped` | Vector paths / image masks were detected but aren't represented in the IR. | When the page draws vector fills/strokes |
+| `multi-column-layout` | A page was split into N columns / separate frames. | When >1 column is detected |
+| `image-extract-failed` | An image couldn't be decoded (e.g. an unsupported filter). | Per failed image |
+| `empty-page` | A page produced no text or image frames. | Per empty page |
+| `asset-write-failed` | An extracted image couldn't be written to the asset cache. | Per failed write |
+
+## Round-trip tolerances
+
+The test suite builds the *same logical document* as both IDML and PDF and
+asserts the two IRs agree within these tolerances (see
+`packages/pipeline/tests/indesign/pdf-roundtrip.test.mjs`):
+
+| Quantity | Tolerance |
+| --- | --- |
+| Page / spread count | Exact |
+| Image frame count | Exact |
+| Text frame count | Within ±1 |
+| Style bucket count | Within ±1 |
+| Swatch identity | Detected PDF colors snap onto the IDML swatch palette |
+
+These are deliberately loose on text-frame and style counts: where InDesign knows
+a frame is one frame, the PDF only shows glyph positions, so a heading and its
+body paragraph may merge or split by ±1 depending on spacing.
+
+## Known limitations
+
+- **Geometry is approximate.** Frame rectangles are derived from glyph baselines
+  using nominal ascent/descent ratios (0.8 / 0.2 of font size), not true font
+  metrics. Rotated or skewed text is flattened to its axis-aligned bounding box.
+- **Per-run color is not resolved.** Color is attributed at the style-bucket
+  (font-size) level, because the IR carries color on `Style`, not `TextRun`.
+- **Vector art is dropped.** Backgrounds, rules, and shapes drawn as vector paths
+  are noted via `vector-paths-dropped` but not reconstructed.
+- **Image masks aren't extracted.** Stencil-masked images paint the current fill
+  through a 1-bit mask and have no extractable raster; they're treated as vector
+  content.
+- **Leading/tracking are omitted.** The fallback doesn't infer line spacing or
+  tracking; the mapper applies its own defaults.
+
+When fidelity matters, export IDML from InDesign and use the
+[IDML parser](../../packages/pipeline/README.md) instead.
diff --git a/packages/pipeline/README.md b/packages/pipeline/README.md
index 1c1ebec..144fb50 100644
--- a/packages/pipeline/README.md
+++ b/packages/pipeline/README.md
@@ -4,26 +4,39 @@ Conversion pipeline for InDesign (and future) sources into WordPress FSE themes.
 
 ## Status
 
-This package currently ships the **IDML parser and intermediate representation** (sub-issue #62 of the InDesign-to-WordPress epic). Downstream stages — PDF fallback (#63), style + token mapper (#64), output generator (#65) — will land as separate PRs. The IR shape produced here is the contract those stages consume.
+This package ships the **IDML parser** (sub-issue #62) and the **PDF fallback parser** (sub-issue #63) of the InDesign-to-WordPress epic. Both emit the same intermediate representation. Downstream stages — style + token mapper (#64), output generator (#65) — will land as separate PRs. The IR shape produced here is the contract those stages consume.
+
+IDML is the primary path (full access to stories, frames, styles, swatches, masters). PDF is a lossy fallback for when only the exported PDF is available, or as a verification source against IDML output — see [`docs/pipeline/indesign-pdf-fidelity.md`](../../docs/pipeline/indesign-pdf-fidelity.md).
 
 ## Layout
 
 ```
 packages/pipeline/
-├── bin/parse-idml.mjs        CLI entry; prints validated IR JSON on stdout
+├── bin/
+│   ├── parse-idml.mjs        CLI: IDML → validated IR JSON on stdout
+│   └── parse-pdf.mjs         CLI: PDF → reconstructed IR JSON on stdout
 └── src/
     ├── index.js              Re-exports the InDesign surface
     └── indesign/
         ├── ir.js             zod schemas + JSDoc typedefs for the IR
-        ├── parse-idml.js     Main entry: unzips + orchestrates + cross-refs + validates
+        ├── parse-idml.js     IDML entry: unzips + orchestrates + cross-refs + validates
+        ├── parse-pdf.js      PDF entry: extracts + clusters + classifies + validates
         ├── units.js          pt/pc/mm/cm/in → px at configurable DPI
         ├── warnings.js       Non-fatal warning collector
-        └── parsers/
-            ├── xml.js        fast-xml-parser wrapper
-            ├── designmap.js  designmap.xml → manifest with paths
-            ├── resources.js  Graphic.xml + Fonts.xml + Styles.xml
-            ├── stories.js    Stories/Story_*.xml → text runs
-            └── spreads.js    Spreads/*.xml + MasterSpreads/*.xml
+        ├── parsers/          IDML XML decoders
+        │   ├── xml.js        fast-xml-parser wrapper
+        │   ├── designmap.js  designmap.xml → manifest with paths
+        │   ├── resources.js  Graphic.xml + Fonts.xml + Styles.xml
+        │   ├── stories.js    Stories/Story_*.xml → text runs
+        │   └── spreads.js    Spreads/*.xml + MasterSpreads/*.xml
+        └── pdf/              PDF reconstruction modules
+            ├── pdfjs.js      Lazy pdfjs-dist loader (headless, extraction-only)
+            ├── extract.js    Per-page: text runs, fonts, colors, images, vector flag
+            ├── cluster.js    Glyph runs → lines → frames; column detection (pure)
+            ├── classify.js   Font-size buckets → heading/body/caption styles (pure)
+            ├── color.js      RGB/gray/CMYK → hex; nearest-swatch matching (pure)
+            ├── png.js        Decoded pixels → PNG via node:zlib (pure)
+            └── assets.js     Write extracted images to the asset cache
 ```
 
 ## Quick start
@@ -47,6 +60,26 @@ Or from the command line:
 node packages/pipeline/bin/parse-idml.mjs my-document.idml > ir.json
 ```
 
+### PDF fallback
+
+When you only have a PDF exported from InDesign, use the fallback parser. It
+emits the same IR, plus fidelity warnings describing every approximation it made.
+
+```js
+import { parsePdf } from '@flavian/pipeline';
+
+const ir = await parsePdf('./brochure.pdf', {
+  assetCacheDir: './assets',     // optional: write extracted images (PNG) here
+  swatchPalette: idml?.swatches, // optional: snap detected colors to IDML swatches
+});
+```
+
+```bash
+node packages/pipeline/bin/parse-pdf.mjs brochure.pdf --asset-dir ./assets > ir.json
+```
+
+PDF reconstruction is lossy by design. See [`docs/pipeline/indesign-pdf-fidelity.md`](../../docs/pipeline/indesign-pdf-fidelity.md) for how each IR element is derived, the full list of fidelity-warning codes, and the round-trip tolerances against IDML.
+
 ## IR shape
 
 The intermediate representation is described in [`src/indesign/ir.js`](src/indesign/ir.js). At the top level:
@@ -70,10 +103,12 @@ Geometry (`Page.bounds`, `Frame.bounds`) is normalized to pixels at `dpi` (defau
 
 ## Failure mode
 
-- **Throws** on structural problems that make the IR meaningless: missing `designmap.xml`, malformed zip, a `<Spread>` element that lacks `Self`.
-- **Warns and continues** on everything else: missing optional resource files, dangling style references, unknown color spaces, empty stories, unrecognized unit suffixes.
+Both parsers share the same philosophy: throw only when the document can't be read at all; otherwise emit a partial IR with warnings.
+
+- **IDML throws** on missing `designmap.xml`, a malformed zip, or a `<Spread>` lacking `Self`; **warns** on missing optional resources, dangling references, unknown color spaces, empty stories, unrecognized units.
+- **PDF throws** only when the file can't be opened as a PDF; **warns** on every approximation (text reconstructed from glyphs, synthesized styles, dropped vector paths, undecodable images, …). PDF parses always carry fidelity warnings — that's expected.
 
-The CLI surfaces warnings on stderr and exits 0 unless the IR itself failed to build.
+Each CLI surfaces warnings on stderr and exits 0 unless the IR itself failed to build.
 
 ## Testing
 
@@ -81,7 +116,9 @@ The CLI surfaces warnings on stderr and exits 0 unless the IR itself failed to b
 pnpm --filter @flavian/pipeline test
 ```
 
-Tests build minimal IDML zips programmatically (see `tests/indesign/helpers/build-idml.js`) — no binary fixtures in git. The fixture builder mirrors the IDML XML grammar the parser reads, so adding a new test case is usually one option flag.
+Tests build minimal fixtures programmatically — no binary fixtures in git. `tests/indesign/helpers/build-idml.js` emits IDML zips; `tests/indesign/helpers/build-pdf.js` emits PDFs (positioned text in base-14 fonts, FlateDecode image XObjects, vector fills). Building the *same logical document* both ways powers the IDML↔PDF round-trip test.
+
+The PDF heuristics (clustering, classification, color, PNG encoding) are split into pure modules under `src/indesign/pdf/` and unit-tested without a PDF engine; only `extract.js` and the orchestrator touch pdfjs.
 
 ## Adding a new input format
 
diff --git a/packages/pipeline/bin/parse-pdf.mjs b/packages/pipeline/bin/parse-pdf.mjs
new file mode 100644
index 0000000..8c052cf
--- /dev/null
+++ b/packages/pipeline/bin/parse-pdf.mjs
@@ -0,0 +1,85 @@
+#!/usr/bin/env node
+// CLI: print the reconstructed IR as JSON on stdout, fidelity warnings on stderr.
+//
+//   flavian-parse-pdf <path.pdf> [--dpi <n>] [--asset-dir <dir>] [--quiet]
+//
+// PDF is the fallback path. Expect fidelity warnings on every run — that's the
+// parser telling you which parts are approximate.
+
+import { parsePdf } from '../src/indesign/parse-pdf.js';
+
+const args = process.argv.slice(2);
+let inputPath;
+let dpi;
+let assetCacheDir;
+let quiet = false;
+
+for (let i = 0; i < args.length; i += 1) {
+	const arg = args[i];
+	if (arg === '--dpi') {
+		const next = args[i + 1];
+		if (!next || Number.isNaN(Number(next))) {
+			console.error('--dpi requires a positive number');
+			process.exit(2);
+		}
+		dpi = Number(next);
+		i += 1;
+	} else if (arg === '--asset-dir') {
+		const next = args[i + 1];
+		if (!next) {
+			console.error('--asset-dir requires a directory path');
+			process.exit(2);
+		}
+		assetCacheDir = next;
+		i += 1;
+	} else if (arg === '--quiet') {
+		quiet = true;
+	} else if (arg === '-h' || arg === '--help') {
+		printUsage();
+		process.exit(0);
+	} else if (!inputPath && !arg.startsWith('-')) {
+		inputPath = arg;
+	} else {
+		console.error(`Unknown argument: ${arg}`);
+		printUsage();
+		process.exit(2);
+	}
+}
+
+if (!inputPath) {
+	printUsage();
+	process.exit(2);
+}
+
+try {
+	const options = {};
+	if (dpi !== undefined) options.dpi = dpi;
+	if (assetCacheDir !== undefined) options.assetCacheDir = assetCacheDir;
+	const ir = await parsePdf(inputPath, options);
+	if (!quiet && ir.warnings.length > 0) {
+		for (const w of ir.warnings) {
+			const where = w.context?.file ? ` (${w.context.file}${w.context.id ? `#${w.context.id}` : ''})` : '';
+			process.stderr.write(`[${w.code}] ${w.message}${where}\n`);
+		}
+		process.stderr.write(`\n${ir.warnings.length} warning(s).\n`);
+	}
+	process.stdout.write(JSON.stringify(ir, null, 2) + '\n');
+} catch (err) {
+	process.stderr.write(`error: ${err.message}\n`);
+	process.exit(1);
+}
+
+function printUsage() {
+	process.stderr.write(
+		[
+			'Usage: flavian-parse-pdf <path.pdf> [options]',
+			'',
+			'Options:',
+			'  --dpi <n>          Pixels per inch for unit normalization (default 96)',
+			'  --asset-dir <dir>  Write extracted images (PNG) under this directory',
+			'  --quiet            Suppress fidelity warnings on stderr',
+			'  -h, --help         Show this help',
+			'',
+		].join('\n'),
+	);
+}
diff --git a/packages/pipeline/package.json b/packages/pipeline/package.json
index 4f79a4c..ba2dcf6 100644
--- a/packages/pipeline/package.json
+++ b/packages/pipeline/package.json
@@ -9,7 +9,8 @@
     "./indesign": "./src/indesign/index.js"
   },
   "bin": {
-    "flavian-parse-idml": "./bin/parse-idml.mjs"
+    "flavian-parse-idml": "./bin/parse-idml.mjs",
+    "flavian-parse-pdf": "./bin/parse-pdf.mjs"
   },
   "scripts": {
     "test": "node --test \"tests/**/*.test.mjs\""
@@ -17,6 +18,7 @@
   "dependencies": {
     "fast-xml-parser": "^5.7.0",
     "fflate": "^0.8.2",
+    "pdfjs-dist": "^4.10.38",
     "zod": "^3.23.8"
   },
   "engines": {
diff --git a/packages/pipeline/src/indesign/index.js b/packages/pipeline/src/indesign/index.js
index 8fa1f9b..e76cd88 100644
--- a/packages/pipeline/src/indesign/index.js
+++ b/packages/pipeline/src/indesign/index.js
@@ -1,4 +1,5 @@
 export { parseIdml, parseIdmlBuffer } from './parse-idml.js';
+export { parsePdf, parsePdfBuffer } from './parse-pdf.js';
 export * as ir from './ir.js';
 export { WarningCollector } from './warnings.js';
 export { lengthToPx, ptToPx, roundPx } from './units.js';
diff --git a/packages/pipeline/src/indesign/parse-pdf.js b/packages/pipeline/src/indesign/parse-pdf.js
new file mode 100644
index 0000000..79c1d84
--- /dev/null
+++ b/packages/pipeline/src/indesign/parse-pdf.js
@@ -0,0 +1,327 @@
+// PDF fallback parser. Reads a PDF exported from InDesign and reconstructs an
+// approximation of the same IR the IDML parser emits, so the downstream mapper
+// and generator can consume either source.
+//
+// PDF is intentionally lossy: there are no named styles, no text frames, no
+// swatch palette — just absolutely-positioned glyph runs, fill colors, and
+// placed images. We rebuild a *usable, styled* IR (not a pixel-perfect one) and
+// attach fidelity warnings describing every approximation we made. Use IDML
+// when you have it; use this when you don't, or to cross-check IDML output.
+//
+// Failure philosophy mirrors parse-idml.js: throw only when the document can't
+// be opened at all; everything else becomes a warning and a partial IR.
+
+import { promises as fs } from 'node:fs';
+
+import { Document } from './ir.js';
+import { WarningCollector } from './warnings.js';
+import { ptToPx, roundPx } from './units.js';
+import { openDocument, loadPdfjs } from './pdf/pdfjs.js';
+import { extractPage } from './pdf/extract.js';
+import { clusterIntoFrames, detectColumns } from './pdf/cluster.js';
+import { classifyStyles } from './pdf/classify.js';
+import { nearestSwatch } from './pdf/color.js';
+import { assetHref, writeAsset } from './pdf/assets.js';
+
+const DEFAULT_DPI = 96;
+
+/**
+ * @typedef {Object} ParsePdfOptions
+ * @property {number} [dpi]                 Pixels-per-inch for unit normalization. Default 96.
+ * @property {string} [name]                Override the document name (defaults to PDF /Title or the file basename).
+ * @property {string} [assetCacheDir]       If set, extracted images are PNG-encoded and written here.
+ * @property {Array<import('./ir.js').SwatchIR>} [swatchPalette]  IDML-derived swatches to snap detected colors to.
+ */
+
+/**
+ * Parse a PDF file from disk.
+ *
+ * @param {string} path
+ * @param {ParsePdfOptions} [options]
+ * @returns {Promise<import('./ir.js').DocumentIR>}
+ */
+export async function parsePdf(path, options = {}) {
+	const bytes = await fs.readFile(path);
+	const fallbackName = path.split(/[\\/]/).pop()?.replace(/\.pdf$/i, '');
+	return parsePdfBuffer(bytes, { ...options, name: options.name ?? fallbackName });
+}
+
+/**
+ * Parse PDF bytes already in memory.
+ *
+ * @param {Uint8Array} bytes
+ * @param {ParsePdfOptions} [options]
+ * @returns {Promise<import('./ir.js').DocumentIR>}
+ */
+export async function parsePdfBuffer(bytes, options = {}) {
+	const dpi = options.dpi ?? DEFAULT_DPI;
+	const palette = options.swatchPalette ?? [];
+	const warnings = new WarningCollector();
+
+	let doc;
+	try {
+		doc = await openDocument(bytes);
+	} catch (err) {
+		throw new Error(`PDF could not be opened: ${err.message}`);
+	}
+
+	let title;
+	try {
+		const md = await doc.getMetadata();
+		title = md?.info?.Title || undefined;
+	} catch {
+		// Metadata is optional; ignore.
+	}
+
+	const pdfjs = await loadPdfjs();
+
+	// --- Pass 1: extract every page, normalizing font keys to a global identity.
+	const fontsById = new Map(); // fontId -> { id, family, style, postScriptName }
+	const pages = [];
+	const allItems = [];
+	const allColorSamples = [];
+	let anyVector = false;
+	let anyEmbeddedFont = false;
+	let sawFonts = false;
+
+	for (let p = 0; p < doc.numPages; p += 1) {
+		const page = await doc.getPage(p + 1);
+		const extracted = await extractPage(page, pdfjs);
+		// Note: we deliberately don't page.cleanup() here — decoded image bytes
+		// are PNG-encoded later, and cleanup can clear page.objs out from under us.
+		// doc.cleanup() at the end releases everything.
+
+		// loader font key (page-scoped, e.g. "g_d0_f1") -> stable global font id.
+		const localToGlobal = new Map();
+		for (const [localKey, font] of extracted.fonts) {
+			sawFonts = true;
+			if (font.embedded) anyEmbeddedFont = true;
+			const id = fontId(font.name);
+			if (!fontsById.has(id)) {
+				fontsById.set(id, { id, family: font.family, style: font.style, postScriptName: font.name });
+			}
+			localToGlobal.set(localKey, id);
+		}
+
+		const items = extracted.textItems.map((it) => ({
+			...it,
+			fontKey: localToGlobal.get(it.fontKey) ?? it.fontKey,
+		}));
+		allItems.push(...items);
+		allColorSamples.push(...extracted.colorSamples);
+		if (extracted.hasVector) anyVector = true;
+
+		pages.push({ index: p, ...extracted, items });
+	}
+
+	// --- Swatches: distinct detected colors, snapped to the IDML palette if given.
+	const swatches = [];
+	const hexToSwatchId = new Map();
+	for (const sample of allColorSamples) {
+		if (hexToSwatchId.has(sample.hex)) continue;
+		const matched = palette.length ? nearestSwatch(sample.hex, palette) : null;
+		if (matched) {
+			hexToSwatchId.set(sample.hex, matched.id);
+			if (!swatches.some((s) => s.id === matched.id)) swatches.push(matched);
+		} else {
+			const id = `pdf-color-${sample.hex.slice(1)}`;
+			hexToSwatchId.set(sample.hex, id);
+			swatches.push({ id, name: sample.hex.toUpperCase(), color: { hex: sample.hex, space: 'RGB' } });
+		}
+	}
+
+	// --- Styles: synthesize buckets from font-size distribution.
+	const { buckets, styleIdForSize } = classifyStyles({ items: allItems, colorSamples: allColorSamples, dpi });
+	const styles = buckets.map((b) => ({
+		id: b.id,
+		name: b.name,
+		kind: 'paragraph',
+		fontSize: b.fontSizePx,
+		fontRef: b.dominantFontKey && fontsById.has(b.dominantFontKey) ? b.dominantFontKey : undefined,
+		fillColorRef: b.dominantHex ? hexToSwatchId.get(b.dominantHex) : undefined,
+		properties: { role: b.role, sourceSizePt: b.sizePt },
+	}));
+
+	// --- Spreads: one per page, with reconstructed text + image frames.
+	const stories = [];
+	const spreads = [];
+	const pendingWrites = []; // { href, image } to persist if assetCacheDir is set
+	for (const page of pages) {
+		const pageNum = page.index + 1;
+		const frames = [];
+
+		const blocks = clusterIntoFrames(page.items);
+		const columns = detectColumns(blocks);
+		if (columns > 1) {
+			warnings.add(
+				'multi-column-layout',
+				`Page ${pageNum}: detected ${columns} columns; emitted as ${blocks.length} separate text frames`,
+				{ file: `pdf:page:${pageNum}` },
+			);
+		}
+
+		blocks.forEach((block, idx) => {
+			const storyId = `pdf-story-p${pageNum}-${idx + 1}`;
+			stories.push({ id: storyId, source: `pdf:page:${pageNum}`, runs: blockToRuns(block, styleIdForSize) });
+			frames.push({
+				kind: 'text',
+				id: `pdf-frame-p${pageNum}-t${idx + 1}`,
+				bounds: rectToPx(block.bounds, dpi),
+				storyRef: storyId,
+			});
+		});
+
+		page.images.forEach((img, idx) => {
+			const href = assetHref(page.index, idx);
+			frames.push({
+				kind: 'image',
+				id: `pdf-frame-p${pageNum}-i${idx + 1}`,
+				bounds: boxToPx(img, dpi),
+				href,
+				embedded: true,
+			});
+			if (img.failed) {
+				warnings.add('image-extract-failed', `Page ${pageNum}: could not decode image ${href}`, {
+					file: `pdf:page:${pageNum}`,
+				});
+			} else if (options.assetCacheDir) {
+				// Defer the actual write (collected below) so failures don't abort.
+				pendingWrites.push({ href, image: img.image });
+			}
+		});
+
+		if (frames.length === 0) {
+			warnings.add('empty-page', `Page ${pageNum} produced no text or image frames`, {
+				file: `pdf:page:${pageNum}`,
+			});
+		}
+
+		spreads.push({
+			id: `pdf-spread-${pageNum}`,
+			source: `pdf:page:${pageNum}`,
+			pages: [{ id: `pdf-page-${pageNum}`, bounds: { x: 0, y: 0, width: roundPx(ptToPx(page.widthPt, dpi)), height: roundPx(ptToPx(page.heightPt, dpi)) } }],
+			frames,
+			appliedMasterRef: undefined,
+		});
+	}
+
+	// --- Persist extracted images, if a cache dir was given.
+	if (options.assetCacheDir) {
+		for (const w of pendingWrites) {
+			try {
+				await writeAsset(options.assetCacheDir, w.href, w.image);
+			} catch (err) {
+				warnings.add('asset-write-failed', `Failed writing ${w.href}: ${err.message}`);
+			}
+		}
+	}
+
+	// --- Fidelity warnings: describe every approximation.
+	addFidelityWarnings(warnings, {
+		sawFonts,
+		anyEmbeddedFont,
+		anyColor: allColorSamples.length > 0,
+		anyVector,
+		hasStyles: styles.length > 0,
+	});
+
+	const document = Document.parse({
+		irVersion: 1,
+		// Embedded /Title wins over the caller-supplied/filename fallback, mirroring
+		// how the IDML parser prefers designmap's @Name.
+		meta: { name: title ?? options.name },
+		dpi,
+		swatches,
+		fonts: [...fontsById.values()],
+		styles,
+		stories,
+		spreads,
+		masterSpreads: [],
+		warnings: warnings.list(),
+	});
+
+	await doc.cleanup();
+	return document;
+}
+
+/**
+ * @param {import('./pdf/cluster.js').TextBlock} block
+ * @param {(sizePt: number) => string | undefined} styleIdForSize
+ * @returns {Array<import('./ir.js').TextRun>}
+ */
+function blockToRuns(block, styleIdForSize) {
+	const runs = [];
+	for (const line of block.lines) {
+		line.items.forEach((item, idx) => {
+			let text = item.text;
+			if (idx < line.items.length - 1) text += ' ';
+			runs.push({ text, paragraphStyleRef: styleIdForSize(item.fontSize) });
+		});
+		// Close each line with a newline so prose re-flows downstream.
+		if (runs.length > 0 && !runs[runs.length - 1].text.endsWith('\n')) {
+			runs[runs.length - 1].text += '\n';
+		}
+	}
+	return runs;
+}
+
+function rectToPx(b, dpi) {
+	return {
+		x: roundPx(ptToPx(b.minX, dpi)),
+		y: roundPx(ptToPx(b.minY, dpi)),
+		width: roundPx(ptToPx(Math.max(0, b.maxX - b.minX), dpi)),
+		height: roundPx(ptToPx(Math.max(0, b.maxY - b.minY), dpi)),
+	};
+}
+
+function boxToPx(box, dpi) {
+	return {
+		x: roundPx(ptToPx(box.x, dpi)),
+		y: roundPx(ptToPx(box.y, dpi)),
+		width: roundPx(ptToPx(Math.max(0, box.width), dpi)),
+		height: roundPx(ptToPx(Math.max(0, box.height), dpi)),
+	};
+}
+
+function fontId(psName) {
+	const slug = psName
+		.toLowerCase()
+		.replace(/[^a-z0-9]+/g, '-')
+		.replace(/^-+|-+$/g, '');
+	return `pdf-font-${slug || 'unknown'}`;
+}
+
+function addFidelityWarnings(warnings, flags) {
+	warnings.add(
+		'pdf-fallback',
+		'IR reconstructed from PDF; layout, frames, and styles are approximate. Prefer IDML when available.',
+	);
+	warnings.add(
+		'text-reconstructed-from-glyphs',
+		'Text was recovered from positioned glyph runs; ligatures, hidden text, and reading order may differ from the source.',
+	);
+	if (flags.hasStyles) {
+		warnings.add(
+			'styles-synthesized',
+			'Paragraph styles were inferred from font-size buckets, not real named styles.',
+		);
+	}
+	if (flags.sawFonts && !flags.anyEmbeddedFont) {
+		warnings.add(
+			'no-embedded-fonts',
+			'No embedded fonts found; font family/style mapping is best-effort from PostScript names.',
+		);
+	}
+	if (flags.anyColor) {
+		warnings.add(
+			'color-attribution-approximate',
+			'Swatch attribution is heuristic: colors are bucketed by font size, not resolved per run.',
+		);
+	}
+	if (flags.anyVector) {
+		warnings.add(
+			'vector-paths-dropped',
+			'Vector paths and image masks were detected but are not represented in the IR.',
+		);
+	}
+}
diff --git a/packages/pipeline/src/indesign/pdf/assets.js b/packages/pipeline/src/indesign/pdf/assets.js
new file mode 100644
index 0000000..8765f58
--- /dev/null
+++ b/packages/pipeline/src/indesign/pdf/assets.js
@@ -0,0 +1,41 @@
+// Asset cache writer. Extracted images are PNG-encoded and written under a
+// caller-provided directory; the IR's ImageFrame.href points at the
+// cache-relative path so the downstream media importer can find them.
+//
+// Writing is opt-in: with no assetCacheDir the parser still records image
+// frames and their hrefs (so the IR is complete and addressable), it just
+// doesn't persist bytes — useful for verification runs and tests that only
+// assert structure.
+
+import { promises as fs } from 'node:fs';
+import path from 'node:path';
+
+import { encodePng } from './png.js';
+
+/**
+ * Stable, collision-free href for an extracted image.
+ * @param {number} pageIndex 0-based
+ * @param {number} imageIndex 0-based
+ * @returns {string}
+ */
+export function assetHref(pageIndex, imageIndex) {
+	const p = String(pageIndex + 1).padStart(3, '0');
+	const n = String(imageIndex + 1).padStart(3, '0');
+	return `assets/pdf-p${p}-img${n}.png`;
+}
+
+/**
+ * Encode + write one image to the cache. Returns the byte length written.
+ *
+ * @param {string} cacheDir
+ * @param {string} href cache-relative path (from assetHref)
+ * @param {{ width: number, height: number, kind: number, data: Uint8Array }} image
+ * @returns {Promise<number>}
+ */
+export async function writeAsset(cacheDir, href, image) {
+	const png = encodePng(image);
+	const dest = path.join(cacheDir, href);
+	await fs.mkdir(path.dirname(dest), { recursive: true });
+	await fs.writeFile(dest, png);
+	return png.length;
+}
diff --git a/packages/pipeline/src/indesign/pdf/classify.js b/packages/pipeline/src/indesign/pdf/classify.js
new file mode 100644
index 0000000..63c773a
--- /dev/null
+++ b/packages/pipeline/src/indesign/pdf/classify.js
@@ -0,0 +1,140 @@
+// Heuristic style synthesis. A PDF carries no named paragraph styles, so we
+// infer them from the only signal we have: font size. The most-used size is
+// "body"; larger sizes become headings (largest = Heading 1); smaller sizes
+// become captions. Each synthesized bucket also remembers the font and fill
+// color most associated with that size, so the token mapper downstream can turn
+// it into a theme.json preset.
+//
+// This is deliberately coarse. The IDML parser reports real styles; this is the
+// fallback's best approximation and is flagged as such in the IR warnings.
+
+import { ptToPx, roundPx } from '../units.js';
+
+// Round sizes so 11.999pt and 12.001pt land in the same bucket.
+const SIZE_QUANTUM = 0.5;
+const MAX_HEADING_LEVEL = 6;
+
+function roundSize(pt) {
+	return Math.round(pt / SIZE_QUANTUM) * SIZE_QUANTUM;
+}
+
+/**
+ * @param {Map<string, number>} tally
+ * @returns {string | undefined} key with the highest count
+ */
+function argmax(tally) {
+	let best;
+	let bestN = -Infinity;
+	for (const [key, n] of tally) {
+		if (n > bestN) {
+			bestN = n;
+			best = key;
+		}
+	}
+	return best;
+}
+
+/**
+ * @typedef {Object} StyleBucket
+ * @property {string} id
+ * @property {string} name
+ * @property {'heading'|'body'|'caption'} role
+ * @property {number} sizePt
+ * @property {number} fontSizePx
+ * @property {string} [dominantFontKey]
+ * @property {string} [dominantHex]
+ *
+ * @typedef {Object} ClassifyResult
+ * @property {StyleBucket[]} buckets
+ * @property {(sizePt: number) => string | undefined} styleIdForSize
+ */
+
+/**
+ * @param {{
+ *   items: Array<{ fontSize: number, fontKey?: string, text: string }>,
+ *   colorSamples?: Array<{ fontSizePt: number, hex: string, glyphs: number }>,
+ *   dpi: number,
+ * }} input
+ * @returns {ClassifyResult}
+ */
+export function classifyStyles({ items, colorSamples = [], dpi }) {
+	// chars-per-size, plus per-size font and color tallies.
+	const charsBySize = new Map();
+	const fontBySize = new Map(); // size -> Map(fontKey -> chars)
+	for (const item of items) {
+		const size = roundSize(item.fontSize);
+		const len = item.text.length;
+		charsBySize.set(size, (charsBySize.get(size) ?? 0) + len);
+		if (item.fontKey) {
+			const fonts = fontBySize.get(size) ?? new Map();
+			fonts.set(item.fontKey, (fonts.get(item.fontKey) ?? 0) + len);
+			fontBySize.set(size, fonts);
+		}
+	}
+
+	const colorBySize = new Map(); // size -> Map(hex -> glyphs)
+	for (const sample of colorSamples) {
+		const size = roundSize(sample.fontSizePt);
+		const colors = colorBySize.get(size) ?? new Map();
+		colors.set(sample.hex, (colors.get(sample.hex) ?? 0) + sample.glyphs);
+		colorBySize.set(size, colors);
+	}
+
+	const sizes = [...charsBySize.keys()];
+	if (sizes.length === 0) {
+		return { buckets: [], styleIdForSize: () => undefined };
+	}
+
+	// Body = most-used size. Ties resolve to the smaller size (body text usually
+	// outnumbers display text, and the smaller of two equal counts is the safer
+	// "body" pick for sparse pages).
+	let bodySize = sizes[0];
+	for (const size of sizes) {
+		const n = charsBySize.get(size);
+		const bestN = charsBySize.get(bodySize);
+		if (n > bestN || (n === bestN && size < bodySize)) {
+			bodySize = size;
+		}
+	}
+
+	const larger = sizes.filter((s) => s > bodySize).sort((a, b) => b - a);
+	const smaller = sizes.filter((s) => s < bodySize).sort((a, b) => b - a);
+
+	/** @type {Map<number, StyleBucket>} */
+	const bySize = new Map();
+	const makeBucket = (size, role, id, name) => {
+		const fonts = fontBySize.get(size);
+		const colors = colorBySize.get(size);
+		const bucket = {
+			id,
+			name,
+			role,
+			sizePt: size,
+			fontSizePx: roundPx(ptToPx(size, dpi)),
+			dominantFontKey: fonts ? argmax(fonts) : undefined,
+			dominantHex: colors ? argmax(colors) : undefined,
+		};
+		bySize.set(size, bucket);
+		return bucket;
+	};
+
+	const buckets = [];
+	larger.forEach((size, i) => {
+		const level = Math.min(i + 1, MAX_HEADING_LEVEL);
+		buckets.push(makeBucket(size, 'heading', `pdf-style-h${level}`, `Heading ${level}`));
+	});
+	buckets.push(makeBucket(bodySize, 'body', 'pdf-style-body', 'Body'));
+	smaller.forEach((size, i) => {
+		const suffix = i === 0 ? '' : `-${i + 1}`;
+		const name = i === 0 ? 'Caption' : `Caption ${i + 1}`;
+		buckets.push(makeBucket(size, 'caption', `pdf-style-caption${suffix}`, name));
+	});
+
+	// Headings first (largest → smallest), then body, then captions.
+	buckets.sort((a, b) => b.sizePt - a.sizePt);
+
+	return {
+		buckets,
+		styleIdForSize: (sizePt) => bySize.get(roundSize(sizePt))?.id,
+	};
+}
diff --git a/packages/pipeline/src/indesign/pdf/cluster.js b/packages/pipeline/src/indesign/pdf/cluster.js
new file mode 100644
index 0000000..ba7b1ae
--- /dev/null
+++ b/packages/pipeline/src/indesign/pdf/cluster.js
@@ -0,0 +1,194 @@
+// Positional clustering: turn a flat bag of positioned text runs into logical
+// text frames, the way a reader would group them.
+//
+// PDF has no concept of a "text frame" — InDesign flattens everything to
+// absolutely-positioned glyph runs. We reconstruct frames in two passes:
+//   1. Runs sharing a baseline (within tolerance) become a line.
+//   2. Lines that are vertically adjacent AND horizontally overlapping become a
+//      block (frame). Processing lines top-to-bottom while keeping several
+//      blocks "open" lets side-by-side columns fall out naturally — a line only
+//      joins a block in its own column.
+//
+// All geometry here is in points with a top-left origin (y grows downward),
+// which is what the orchestrator converts to px when emitting the IR.
+
+// Fractions of font size used to approximate a glyph run's vertical box from
+// its baseline. Real ascent/descent vary per font; these are good enough to
+// cluster and to draw a frame rectangle a human would accept.
+const ASCENT_RATIO = 0.8;
+const DESCENT_RATIO = 0.2;
+
+// Two runs are on the same line if their baselines are within this fraction of
+// the smaller font size.
+const LINE_BASELINE_TOL = 0.5;
+
+// A line joins a block only if the vertical gap to the block's last line is no
+// more than this multiple of the line height — enough for paragraph spacing,
+// not enough to swallow a separate block further down the page.
+const BLOCK_GAP_FACTOR = 1.8;
+
+// Runs on the same baseline but separated by more than this multiple of the
+// font size are treated as different columns, not one wide line. Print columns
+// commonly share baselines, so baseline proximity alone can't tell them apart —
+// the horizontal gutter does.
+const GUTTER_FACTOR = 2.5;
+
+/**
+ * @typedef {Object} TextItem
+ * @property {string} text
+ * @property {number} x          Left edge (pt).
+ * @property {number} baseline   Baseline y, top-left origin (pt).
+ * @property {number} width      Run advance width (pt).
+ * @property {number} fontSize   Font size (pt).
+ * @property {string} fontKey    Stable key into the page's font table.
+ *
+ * @typedef {Object} Line
+ * @property {number} baseline
+ * @property {number} top
+ * @property {number} bottom
+ * @property {number} left
+ * @property {number} right
+ * @property {number} lineHeight
+ * @property {TextItem[]} items
+ *
+ * @typedef {Object} TextBlock
+ * @property {Line[]} lines
+ * @property {TextItem[]} items
+ * @property {{minX: number, minY: number, maxX: number, maxY: number}} bounds
+ */
+
+function itemTop(item) {
+	return item.baseline - item.fontSize * ASCENT_RATIO;
+}
+
+function itemBottom(item) {
+	return item.baseline + item.fontSize * DESCENT_RATIO;
+}
+
+function rangesOverlap(aMin, aMax, bMin, bMax) {
+	return aMin < bMax && bMin < aMax;
+}
+
+/**
+ * Group runs sharing a baseline into lines.
+ *
+ * @param {TextItem[]} items
+ * @returns {Line[]}
+ */
+export function groupLines(items) {
+	const sorted = [...items].sort((a, b) => a.baseline - b.baseline || a.x - b.x);
+	/** @type {Line[]} */
+	const lines = [];
+	for (const item of sorted) {
+		const last = lines[lines.length - 1];
+		const tol = Math.min(item.fontSize, last ? last.lineHeight : item.fontSize) * LINE_BASELINE_TOL;
+		const sameBaseline = last && Math.abs(item.baseline - last.baseline) <= tol;
+		// A wide horizontal gap on the same baseline is a column gutter, not a
+		// space — start a new line so the two columns cluster apart.
+		const gutter = Math.max(last ? last.lineHeight : item.fontSize, item.fontSize) * GUTTER_FACTOR;
+		const acrossGutter = last && item.x - last.right > gutter;
+		if (sameBaseline && !acrossGutter) {
+			last.items.push(item);
+			last.left = Math.min(last.left, item.x);
+			last.right = Math.max(last.right, item.x + item.width);
+			last.top = Math.min(last.top, itemTop(item));
+			last.bottom = Math.max(last.bottom, itemBottom(item));
+			last.lineHeight = Math.max(last.lineHeight, item.fontSize);
+		} else {
+			lines.push({
+				baseline: item.baseline,
+				top: itemTop(item),
+				bottom: itemBottom(item),
+				left: item.x,
+				right: item.x + item.width,
+				lineHeight: item.fontSize,
+				items: [item],
+			});
+		}
+	}
+	// Reading order within each line.
+	for (const line of lines) {
+		line.items.sort((a, b) => a.x - b.x);
+	}
+	return lines;
+}
+
+/**
+ * Group lines into blocks (frames). Multiple blocks stay open at once so that
+ * two columns processed in interleaved vertical order don't merge.
+ *
+ * @param {Line[]} lines
+ * @returns {TextBlock[]}
+ */
+export function groupBlocks(lines) {
+	/** @type {Array<{lines: Line[], left: number, right: number, lastBottom: number}>} */
+	const open = [];
+	for (const line of [...lines].sort((a, b) => a.top - b.top)) {
+		let target = null;
+		for (const block of open) {
+			const gap = line.top - block.lastBottom;
+			const tol = BLOCK_GAP_FACTOR * line.lineHeight;
+			if (gap <= tol && rangesOverlap(block.left, block.right, line.left, line.right)) {
+				target = block;
+				break;
+			}
+		}
+		if (!target) {
+			target = { lines: [], left: line.left, right: line.right, lastBottom: -Infinity };
+			open.push(target);
+		}
+		target.lines.push(line);
+		target.left = Math.min(target.left, line.left);
+		target.right = Math.max(target.right, line.right);
+		target.lastBottom = Math.max(target.lastBottom, line.bottom);
+	}
+
+	return open.map((block) => {
+		const items = block.lines.flatMap((l) => l.items);
+		const minY = Math.min(...block.lines.map((l) => l.top));
+		const maxY = Math.max(...block.lines.map((l) => l.bottom));
+		return {
+			lines: block.lines,
+			items,
+			bounds: { minX: block.left, minY, maxX: block.right, maxY },
+		};
+	});
+}
+
+/**
+ * Full pipeline: positioned runs → text frames, in top-to-bottom reading order.
+ *
+ * @param {TextItem[]} items
+ * @returns {TextBlock[]}
+ */
+export function clusterIntoFrames(items) {
+	const withText = items.filter((it) => it.text && it.text.trim().length > 0);
+	if (withText.length === 0) return [];
+	const blocks = groupBlocks(groupLines(withText));
+	return blocks.sort((a, b) => a.bounds.minY - b.bounds.minY || a.bounds.minX - b.bounds.minX);
+}
+
+/**
+ * Count distinct columns: clusters of frames whose horizontal x-ranges don't
+ * overlap. Used for the multi-column fidelity check and round-trip reporting.
+ *
+ * @param {TextBlock[]} blocks
+ * @returns {number}
+ */
+export function detectColumns(blocks) {
+	if (blocks.length === 0) return 0;
+	const intervals = blocks
+		.map((b) => ({ min: b.bounds.minX, max: b.bounds.maxX }))
+		.sort((a, b) => a.min - b.min);
+	let columns = 1;
+	let currentMax = intervals[0].max;
+	for (let i = 1; i < intervals.length; i += 1) {
+		if (intervals[i].min >= currentMax) {
+			columns += 1;
+			currentMax = intervals[i].max;
+		} else {
+			currentMax = Math.max(currentMax, intervals[i].max);
+		}
+	}
+	return columns;
+}
diff --git a/packages/pipeline/src/indesign/pdf/color.js b/packages/pipeline/src/indesign/pdf/color.js
new file mode 100644
index 0000000..9be97bb
--- /dev/null
+++ b/packages/pipeline/src/indesign/pdf/color.js
@@ -0,0 +1,93 @@
+// Color normalization for PDF fill operators, plus nearest-match against an
+// IDML-derived swatch palette.
+//
+// PDF content streams set fill color via three operator families:
+//   rg  -> DeviceRGB   (pdfjs hands us 0..255 ints)
+//   g   -> DeviceGray  (pdfjs hands us a single 0..255 int)
+//   k   -> DeviceCMYK  (pdfjs hands us 0..1 floats)
+// We collapse all of them to "#rrggbb". When a palette from a sibling IDML
+// parse is available, we snap each detected color to the closest swatch so the
+// PDF and IDML pipelines produce aligned token names downstream.
+
+/**
+ * @param {number} n
+ * @returns {string} two-digit lowercase hex
+ */
+function hexByte(n) {
+	return Math.max(0, Math.min(255, Math.round(n))).toString(16).padStart(2, '0');
+}
+
+/**
+ * @param {[number, number, number]} rgb 0..255 per channel
+ * @returns {string}
+ */
+export function rgbToHex([r, g, b]) {
+	return `#${hexByte(r)}${hexByte(g)}${hexByte(b)}`;
+}
+
+/**
+ * @param {number} gray 0..255
+ * @returns {string}
+ */
+export function grayToHex(gray) {
+	return rgbToHex([gray, gray, gray]);
+}
+
+/**
+ * DeviceCMYK (0..1) → hex via the same naive conversion the IDML graphic
+ * parser uses, so identical CMYK swatches land on identical hex in both pipelines.
+ *
+ * @param {[number, number, number, number]} cmyk 0..1 per channel
+ * @returns {string}
+ */
+export function cmykToHex([c, m, y, k]) {
+	const r = 255 * (1 - c) * (1 - k);
+	const g = 255 * (1 - m) * (1 - k);
+	const b = 255 * (1 - y) * (1 - k);
+	return rgbToHex([r, g, b]);
+}
+
+/**
+ * @param {string} hex "#rrggbb"
+ * @returns {[number, number, number]}
+ */
+export function hexToRgb(hex) {
+	const m = /^#?([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})$/i.exec(hex);
+	if (!m) return [0, 0, 0];
+	return [parseInt(m[1], 16), parseInt(m[2], 16), parseInt(m[3], 16)];
+}
+
+/**
+ * Squared Euclidean distance in RGB. Squared is enough for "which is closest"
+ * and avoids a sqrt per comparison.
+ *
+ * @param {string} a "#rrggbb"
+ * @param {string} b "#rrggbb"
+ * @returns {number}
+ */
+export function colorDistance(a, b) {
+	const [r1, g1, b1] = hexToRgb(a);
+	const [r2, g2, b2] = hexToRgb(b);
+	return (r1 - r2) ** 2 + (g1 - g2) ** 2 + (b1 - b2) ** 2;
+}
+
+/**
+ * Find the closest swatch in a palette, within a tolerance.
+ *
+ * @param {string} hex "#rrggbb"
+ * @param {Array<import('../ir.js').SwatchIR>} palette
+ * @param {number} [maxDistance] Squared-distance cutoff (default ~24/channel).
+ * @returns {import('../ir.js').SwatchIR | null}
+ */
+export function nearestSwatch(hex, palette, maxDistance = 24 * 24 * 3) {
+	let best = null;
+	let bestDist = Infinity;
+	for (const swatch of palette) {
+		const dist = colorDistance(hex, swatch.color.hex);
+		if (dist < bestDist) {
+			bestDist = dist;
+			best = swatch;
+		}
+	}
+	return best && bestDist <= maxDistance ? best : null;
+}
diff --git a/packages/pipeline/src/indesign/pdf/extract.js b/packages/pipeline/src/indesign/pdf/extract.js
new file mode 100644
index 0000000..7134551
--- /dev/null
+++ b/packages/pipeline/src/indesign/pdf/extract.js
@@ -0,0 +1,228 @@
+// The one module that talks to pdfjs. Everything pdfjs-shaped is converted here
+// into plain data the pure modules (cluster, classify, color, png) can consume,
+// so the heuristics stay unit-testable without a PDF engine.
+//
+// Per page we pull four things:
+//   - positioned text runs (getTextContent) → geometry + font key per run
+//   - font metadata (commonObjs) → real PostScript name + embedded flag
+//   - a walk of the operator list → fill-color-per-size samples, placed images,
+//     and whether any vector paths were drawn (which we cannot represent)
+//
+// Coordinates are converted from PDF's bottom-left origin to a top-left origin
+// (points). The orchestrator scales points → px.
+
+import { rgbToHex, grayToHex, cmykToHex } from './color.js';
+
+const SUBSET_PREFIX = /^[A-Z]{6}\+/;
+
+/**
+ * @param {[number, number, number, number, number, number]} m
+ * @param {number} x
+ * @param {number} y
+ * @returns {[number, number]}
+ */
+function applyMatrix(m, x, y) {
+	return [m[0] * x + m[2] * y + m[4], m[1] * x + m[3] * y + m[5]];
+}
+
+/** Concatenate `cm` onto the current matrix (PDF row-vector convention). */
+function multiply(cm, ctm) {
+	return [
+		cm[0] * ctm[0] + cm[1] * ctm[2],
+		cm[0] * ctm[1] + cm[1] * ctm[3],
+		cm[2] * ctm[0] + cm[3] * ctm[2],
+		cm[2] * ctm[1] + cm[3] * ctm[3],
+		cm[4] * ctm[0] + cm[5] * ctm[2] + ctm[4],
+		cm[4] * ctm[1] + cm[5] * ctm[3] + ctm[5],
+	];
+}
+
+function parseFontName(psName, fontObj) {
+	const clean = psName.replace(SUBSET_PREFIX, '');
+	const dash = clean.indexOf('-');
+	let family = dash >= 0 ? clean.slice(0, dash) : clean;
+	let style = dash >= 0 ? clean.slice(dash + 1) : '';
+	if (!style) {
+		if (fontObj?.bold && fontObj?.italic) style = 'Bold Italic';
+		else if (fontObj?.bold) style = 'Bold';
+		else if (fontObj?.italic) style = 'Italic';
+		else style = 'Regular';
+	}
+	// "Times-Roman" reads better as family Times, style Regular for web mapping.
+	if (style === 'Roman') style = 'Regular';
+	return { family: family || clean, style };
+}
+
+function getImageObject(page, name) {
+	return new Promise((resolve) => {
+		try {
+			if (page.objs.has(name)) {
+				resolve(page.objs.get(name));
+			} else {
+				page.objs.get(name, resolve);
+			}
+		} catch {
+			resolve(null);
+		}
+	});
+}
+
+/**
+ * @param {import('pdfjs-dist/legacy/build/pdf.mjs').PDFPageProxy} page
+ * @param {typeof import('pdfjs-dist/legacy/build/pdf.mjs')} pdfjs
+ * @returns {Promise<{
+ *   widthPt: number,
+ *   heightPt: number,
+ *   textItems: Array<{ text: string, x: number, baseline: number, width: number, fontSize: number, fontKey: string }>,
+ *   fonts: Map<string, { name: string, family: string, style: string, embedded: boolean, type: string }>,
+ *   colorSamples: Array<{ fontSizePt: number, hex: string, glyphs: number }>,
+ *   images: Array<{ x: number, y: number, width: number, height: number, image: object | null, failed: boolean }>,
+ *   hasVector: boolean,
+ * }>}
+ */
+export async function extractPage(page, pdfjs) {
+	const { OPS } = pdfjs;
+	const [x0, y0, x1, y1] = page.view;
+	const widthPt = x1 - x0;
+	const heightPt = y1 - y0;
+
+	// Operator list first: it populates objs/commonObjs and gives us colors,
+	// images, and vector presence.
+	const opList = await page.getOperatorList();
+
+	let ctm = [1, 0, 0, 1, 0, 0];
+	const ctmStack = [];
+	let fillHex = '#000000';
+	let currentSize = 0;
+	let hasVector = false;
+	const colorSamples = [];
+	const images = [];
+
+	const countGlyphs = (glyphs) =>
+		Array.isArray(glyphs) ? glyphs.filter((g) => g && typeof g === 'object' && 'unicode' in g).length : 0;
+
+	for (let i = 0; i < opList.fnArray.length; i += 1) {
+		const fn = opList.fnArray[i];
+		const args = opList.argsArray[i];
+		switch (fn) {
+			case OPS.save:
+				ctmStack.push(ctm);
+				break;
+			case OPS.restore:
+				ctm = ctmStack.pop() ?? ctm;
+				break;
+			case OPS.transform:
+				ctm = multiply(args, ctm);
+				break;
+			case OPS.setFillRGBColor:
+				fillHex = rgbToHex([args[0], args[1], args[2]]);
+				break;
+			case OPS.setFillGray:
+				fillHex = grayToHex(args[0]);
+				break;
+			case OPS.setFillCMYKColor:
+				fillHex = cmykToHex([args[0], args[1], args[2], args[3]]);
+				break;
+			case OPS.setFont:
+				currentSize = Math.abs(args[1]);
+				break;
+			case OPS.showText:
+			case OPS.showSpacedText: {
+				const glyphs = countGlyphs(args[0]);
+				if (glyphs > 0 && currentSize > 0) {
+					colorSamples.push({ fontSizePt: currentSize, hex: fillHex, glyphs });
+				}
+				break;
+			}
+			case OPS.fill:
+			case OPS.eoFill:
+			case OPS.stroke:
+			case OPS.fillStroke:
+			case OPS.eoFillStroke:
+			case OPS.closeFillStroke:
+			case OPS.closeEOFillStroke:
+			case OPS.closeStroke:
+				hasVector = true;
+				break;
+			case OPS.paintImageXObject:
+			case OPS.paintImageXObjectRepeat: {
+				const name = args[0];
+				// Unit square mapped through the CTM gives the placed rectangle.
+				const c0 = applyMatrix(ctm, 0, 0);
+				const c1 = applyMatrix(ctm, 1, 1);
+				const left = Math.min(c0[0], c1[0]);
+				const right = Math.max(c0[0], c1[0]);
+				const bottom = Math.min(c0[1], c1[1]);
+				const top = Math.max(c0[1], c1[1]);
+				const obj = await getImageObject(page, name);
+				images.push({
+					x: left,
+					y: heightPt - top, // flip to top-left origin
+					width: right - left,
+					height: top - bottom,
+					image: obj && obj.data ? obj : null,
+					failed: !(obj && obj.data),
+				});
+				break;
+			}
+			case OPS.paintInlineImageXObject: {
+				const obj = args[0];
+				const c0 = applyMatrix(ctm, 0, 0);
+				const c1 = applyMatrix(ctm, 1, 1);
+				const left = Math.min(c0[0], c1[0]);
+				const right = Math.max(c0[0], c1[0]);
+				const bottom = Math.min(c0[1], c1[1]);
+				const top = Math.max(c0[1], c1[1]);
+				images.push({
+					x: left,
+					y: heightPt - top,
+					width: right - left,
+					height: top - bottom,
+					image: obj && obj.data ? obj : null,
+					failed: !(obj && obj.data),
+				});
+				break;
+			}
+			case OPS.paintImageMaskXObject:
+				// Stencil masks paint the current fill through a 1-bit mask; there's
+				// no extractable raster, so we note it as vector-like content.
+				hasVector = true;
+				break;
+			default:
+				break;
+		}
+	}
+
+	// Text geometry + font keys.
+	const textContent = await page.getTextContent();
+	const fonts = new Map();
+	const textItems = [];
+	for (const item of textContent.items) {
+		if (!('str' in item) || item.str.trim().length === 0) continue;
+		const t = item.transform; // [a,b,c,d,e,f]
+		const fontSize = Math.hypot(t[2], t[3]) || item.height || 0;
+		if (fontSize === 0) continue;
+		textItems.push({
+			text: item.str,
+			x: t[4],
+			baseline: heightPt - t[5],
+			width: item.width,
+			fontSize,
+			fontKey: item.fontName,
+		});
+		if (item.fontName && !fonts.has(item.fontName)) {
+			const obj = page.commonObjs.has(item.fontName) ? page.commonObjs.get(item.fontName) : null;
+			const psName = obj?.name ?? item.fontName;
+			const { family, style } = parseFontName(psName, obj);
+			fonts.set(item.fontName, {
+				name: psName,
+				family,
+				style,
+				embedded: !!obj && obj.missingFile === false,
+				type: obj?.type ?? 'unknown',
+			});
+		}
+	}
+
+	return { widthPt, heightPt, textItems, fonts, colorSamples, images, hasVector };
+}
diff --git a/packages/pipeline/src/indesign/pdf/pdfjs.js b/packages/pipeline/src/indesign/pdf/pdfjs.js
new file mode 100644
index 0000000..c5ab833
--- /dev/null
+++ b/packages/pipeline/src/indesign/pdf/pdfjs.js
@@ -0,0 +1,56 @@
+// Thin loader around pdfjs-dist's legacy build (the one that runs in Node
+// without a DOM). We import it lazily so consumers that only ever parse IDML
+// don't pull pdfjs — a multi-megabyte dependency — into their bundle/startup.
+
+let pdfjsPromise;
+
+// pdfjs-dist 4.x calls Promise.withResolvers, which only exists on Node 22+.
+// This package supports Node >=20, so polyfill it (guarded) before pdfjs loads.
+if (typeof Promise.withResolvers !== 'function') {
+	Promise.withResolvers = function withResolvers() {
+		let resolve;
+		let reject;
+		const promise = new Promise((res, rej) => {
+			resolve = res;
+			reject = rej;
+		});
+		return { promise, resolve, reject };
+	};
+}
+
+/**
+ * Resolve the pdfjs module once and cache it.
+ * @returns {Promise<typeof import('pdfjs-dist/legacy/build/pdf.mjs')>}
+ */
+export function loadPdfjs() {
+	if (!pdfjsPromise) {
+		pdfjsPromise = import('pdfjs-dist/legacy/build/pdf.mjs');
+	}
+	return pdfjsPromise;
+}
+
+/**
+ * Open a PDF document from raw bytes with settings tuned for headless,
+ * extraction-only use:
+ *   - no worker / no eval (we never render to a canvas)
+ *   - verbosity errors-only (base-14 fonts otherwise spam "standard font data"
+ *     warnings we handle ourselves via fidelity warnings)
+ *   - a private copy of the bytes, because pdfjs transfers/detaches the buffer
+ *
+ * @param {Uint8Array} bytes
+ * @returns {Promise<import('pdfjs-dist/legacy/build/pdf.mjs').PDFDocumentProxy>}
+ */
+export async function openDocument(bytes) {
+	const pdfjs = await loadPdfjs();
+	// pdfjs requires a *plain* Uint8Array and detaches the buffer it's given.
+	// Node's fs.readFile returns a Buffer (a Uint8Array subclass) which pdfjs
+	// rejects, so always copy into a fresh Uint8Array we can safely hand over.
+	const data = new Uint8Array(bytes.byteLength);
+	data.set(bytes);
+	return pdfjs.getDocument({
+		data,
+		isEvalSupported: false,
+		useSystemFonts: false,
+		verbosity: 0,
+	}).promise;
+}
diff --git a/packages/pipeline/src/indesign/pdf/png.js b/packages/pipeline/src/indesign/pdf/png.js
new file mode 100644
index 0000000..7ea0562
--- /dev/null
+++ b/packages/pipeline/src/indesign/pdf/png.js
@@ -0,0 +1,117 @@
+// Minimal PNG encoder for extracted image data.
+//
+// pdfjs hands back *decoded* pixels (RGB/RGBA/grayscale), not the original
+// encoded stream, so we re-encode to a real image file the downstream media
+// importer can use. PNG is lossless and dependency-free here: deflate comes
+// from node:zlib, and CRC-32 is a tiny table we build once. We deliberately
+// avoid pulling in an image library — a fallback parser shouldn't add weight.
+
+import { deflateSync } from 'node:zlib';
+
+// pdfjs ImageKind values (stable across pdfjs 4.x). Re-declared so we don't
+// depend on importing the enum from the lazily-loaded pdfjs module.
+export const ImageKind = {
+	GRAYSCALE_1BPP: 1,
+	RGB_24BPP: 2,
+	RGBA_32BPP: 3,
+};
+
+const PNG_SIGNATURE = Buffer.from([0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a]);
+
+const CRC_TABLE = (() => {
+	const table = new Uint32Array(256);
+	for (let n = 0; n < 256; n += 1) {
+		let c = n;
+		for (let k = 0; k < 8; k += 1) {
+			c = c & 1 ? 0xedb88320 ^ (c >>> 1) : c >>> 1;
+		}
+		table[n] = c >>> 0;
+	}
+	return table;
+})();
+
+function crc32(buf) {
+	let c = 0xffffffff;
+	for (let i = 0; i < buf.length; i += 1) {
+		c = CRC_TABLE[(c ^ buf[i]) & 0xff] ^ (c >>> 8);
+	}
+	return (c ^ 0xffffffff) >>> 0;
+}
+
+function chunk(type, data) {
+	const typeBuf = Buffer.from(type, 'latin1');
+	const len = Buffer.alloc(4);
+	len.writeUInt32BE(data.length, 0);
+	const crc = Buffer.alloc(4);
+	crc.writeUInt32BE(crc32(Buffer.concat([typeBuf, data])), 0);
+	return Buffer.concat([len, typeBuf, data, crc]);
+}
+
+/**
+ * Normalize pdfjs pixel data to packed RGB or RGBA scanlines.
+ *
+ * @param {{width: number, height: number, kind: number, data: Uint8Array|Uint8ClampedArray}} image
+ * @returns {{channels: 3|4, pixels: Uint8Array}}
+ */
+function toPixels(image) {
+	const { width, height, kind, data } = image;
+	if (kind === ImageKind.RGBA_32BPP) {
+		return { channels: 4, pixels: data instanceof Uint8Array ? data : Uint8Array.from(data) };
+	}
+	if (kind === ImageKind.RGB_24BPP) {
+		return { channels: 3, pixels: data instanceof Uint8Array ? data : Uint8Array.from(data) };
+	}
+	if (kind === ImageKind.GRAYSCALE_1BPP) {
+		// 1 bit per pixel, MSB-first, rows padded to whole bytes. Expand to RGB.
+		const rowBytes = (width + 7) >> 3;
+		const pixels = new Uint8Array(width * height * 3);
+		for (let y = 0; y < height; y += 1) {
+			for (let x = 0; x < width; x += 1) {
+				const bit = (data[y * rowBytes + (x >> 3)] >> (7 - (x & 7))) & 1;
+				const v = bit ? 255 : 0;
+				const o = (y * width + x) * 3;
+				pixels[o] = v;
+				pixels[o + 1] = v;
+				pixels[o + 2] = v;
+			}
+		}
+		return { channels: 3, pixels };
+	}
+	throw new Error(`unsupported image kind ${kind}`);
+}
+
+/**
+ * Encode decoded pixel data as a PNG buffer.
+ *
+ * @param {{width: number, height: number, kind: number, data: Uint8Array|Uint8ClampedArray}} image
+ * @returns {Buffer}
+ */
+export function encodePng(image) {
+	const { width, height } = image;
+	const { channels, pixels } = toPixels(image);
+
+	// Prefix each scanline with a filter byte (0 = none). Keeps the encoder
+	// trivial; deflate still compresses solid regions well.
+	const stride = width * channels;
+	const raw = Buffer.alloc((stride + 1) * height);
+	for (let y = 0; y < height; y += 1) {
+		raw[y * (stride + 1)] = 0;
+		Buffer.from(pixels.buffer, pixels.byteOffset + y * stride, stride).copy(raw, y * (stride + 1) + 1);
+	}
+
+	const ihdr = Buffer.alloc(13);
+	ihdr.writeUInt32BE(width, 0);
+	ihdr.writeUInt32BE(height, 4);
+	ihdr[8] = 8; // bit depth
+	ihdr[9] = channels === 4 ? 6 : 2; // color type: 6 = RGBA, 2 = RGB
+	ihdr[10] = 0; // compression
+	ihdr[11] = 0; // filter
+	ihdr[12] = 0; // interlace
+
+	return Buffer.concat([
+		PNG_SIGNATURE,
+		chunk('IHDR', ihdr),
+		chunk('IDAT', deflateSync(raw)),
+		chunk('IEND', Buffer.alloc(0)),
+	]);
+}
diff --git a/packages/pipeline/tests/indesign/helpers/build-pdf.js b/packages/pipeline/tests/indesign/helpers/build-pdf.js
new file mode 100644
index 0000000..593850a
--- /dev/null
+++ b/packages/pipeline/tests/indesign/helpers/build-pdf.js
@@ -0,0 +1,298 @@
+// In-memory PDF fixture builder. Tests call buildPdf({...}) and get back a
+// Uint8Array they can hand straight to parsePdfBuffer().
+//
+// Mirrors the philosophy of build-idml.js: no binary blobs in git, fixtures are
+// generated from a readable spec. We emit the small subset of PDF the fallback
+// parser reads — positioned text runs in base-14 (non-embedded) fonts, placed
+// FlateDecode/DeviceRGB image XObjects, and optional vector fills.
+//
+// Coordinates in the spec use a TOP-LEFT origin measured in points (y grows
+// downward), because that matches how the parser reports geometry and how a
+// designer thinks about a page. We flip to PDF's native bottom-left origin
+// while writing.
+
+import { deflateSync } from 'node:zlib';
+
+const DEFAULT_PAGE = { width: 612, height: 792 }; // US Letter, points.
+
+/**
+ * @typedef {Object} TextSpec
+ * @property {string} text
+ * @property {number} x       Left edge, points from page left.
+ * @property {number} y       Baseline, points from page top.
+ * @property {number} size    Font size in points.
+ * @property {string} [font]  Base-14 font name (default 'Helvetica').
+ * @property {[number, number, number]} [color] Fill color, 0..1 RGB (default black).
+ *
+ * @typedef {Object} ImageSpec
+ * @property {number} x        Left edge, points from page left.
+ * @property {number} y        Top edge, points from page top.
+ * @property {number} width    Display width in points.
+ * @property {number} height   Display height in points.
+ * @property {{width: number, height: number, data: Uint8Array}} rgb  Raw 8bpc RGB pixels (width*height*3 bytes).
+ *
+ * @typedef {Object} RectSpec
+ * @property {number} x
+ * @property {number} y
+ * @property {number} width
+ * @property {number} height
+ * @property {[number, number, number]} [color]
+ *
+ * @typedef {Object} PageSpec
+ * @property {number} [width]
+ * @property {number} [height]
+ * @property {TextSpec[]} [texts]
+ * @property {ImageSpec[]} [images]
+ * @property {RectSpec[]} [rects]
+ *
+ * @typedef {Object} BuildPdfOptions
+ * @property {string} [title]   Document title (/Info /Title).
+ * @property {PageSpec[]} pages
+ */
+
+/**
+ * @param {BuildPdfOptions} options
+ * @returns {Uint8Array}
+ */
+export function buildPdf(options) {
+	const pages = (options.pages ?? []).map((p) => ({
+		width: p.width ?? DEFAULT_PAGE.width,
+		height: p.height ?? DEFAULT_PAGE.height,
+		texts: p.texts ?? [],
+		images: p.images ?? [],
+		rects: p.rects ?? [],
+	}));
+
+	const writer = new ObjectWriter();
+
+	// Object 1 is the catalog, object 2 the pages tree. We reserve them up front
+	// so child page objects can reference the parent by a known id.
+	const catalogId = writer.reserve();
+	const pagesId = writer.reserve();
+
+	const pageIds = [];
+	for (const page of pages) {
+		pageIds.push(buildPageObjects(writer, page, pagesId));
+	}
+
+	writer.define(catalogId, `<< /Type /Catalog /Pages ${pagesId} 0 R >>`);
+	writer.define(
+		pagesId,
+		`<< /Type /Pages /Kids [ ${pageIds.map((id) => `${id} 0 R`).join(' ')} ] /Count ${pageIds.length} >>`,
+	);
+
+	let infoId;
+	if (options.title) {
+		infoId = writer.add(`<< /Title (${escapePdfString(options.title)}) /Producer (flavian-test) >>`);
+	}
+
+	return writer.serialize(catalogId, infoId);
+}
+
+/**
+ * Emit the content stream + page dict + its resource objects.
+ * Returns the page object id.
+ */
+function buildPageObjects(writer, page, pagesId) {
+	const fontResources = new Map(); // base-font name -> resource key (F1, F2…)
+	const xobjectResources = new Map(); // image object id -> resource key (Im1…)
+
+	const ops = [];
+
+	// Vector fills first (drawn underneath).
+	for (const rect of page.rects) {
+		const [r, g, b] = rect.color ?? [0, 0, 0];
+		const yPdf = page.height - rect.y - rect.height;
+		ops.push(`${fmt(r)} ${fmt(g)} ${fmt(b)} rg`);
+		ops.push(`${fmt(rect.x)} ${fmt(yPdf)} ${fmt(rect.width)} ${fmt(rect.height)} re f`);
+	}
+
+	// Images.
+	for (const image of page.images) {
+		const imgId = writer.add(imageXObject(image.rgb));
+		let key = xobjectResources.get(imgId);
+		if (!key) {
+			key = `Im${xobjectResources.size + 1}`;
+			xobjectResources.set(imgId, key);
+		}
+		const yPdf = page.height - image.y - image.height;
+		// cm maps the unit square to the placement rectangle.
+		ops.push('q');
+		ops.push(`${fmt(image.width)} 0 0 ${fmt(image.height)} ${fmt(image.x)} ${fmt(yPdf)} cm`);
+		ops.push(`/${key} Do`);
+		ops.push('Q');
+	}
+
+	// Text runs.
+	for (const t of page.texts) {
+		const fontName = t.font ?? 'Helvetica';
+		let fontKey = fontResources.get(fontName);
+		if (!fontKey) {
+			fontKey = `F${fontResources.size + 1}`;
+			fontResources.set(fontName, fontKey);
+		}
+		const [r, g, b] = t.color ?? [0, 0, 0];
+		const yPdf = page.height - t.y;
+		ops.push('BT');
+		ops.push(`${fmt(r)} ${fmt(g)} ${fmt(b)} rg`);
+		ops.push(`/${fontKey} ${fmt(t.size)} Tf`);
+		ops.push(`${fmt(t.x)} ${fmt(yPdf)} Td`);
+		ops.push(`(${escapePdfString(t.text)}) Tj`);
+		ops.push('ET');
+	}
+
+	const contentStream = ops.join('\n') + '\n';
+	const contentId = writer.add(streamObject('<< /Length LEN >>', Buffer.from(contentStream, 'latin1')));
+
+	// Font objects.
+	const fontEntries = [];
+	for (const [baseFont, key] of fontResources) {
+		const fontId = writer.add(
+			`<< /Type /Font /Subtype /Type1 /BaseFont /${baseFont} /Encoding /WinAnsiEncoding >>`,
+		);
+		fontEntries.push(`/${key} ${fontId} 0 R`);
+	}
+
+	const xobjectEntries = [];
+	for (const [imgId, key] of xobjectResources) {
+		xobjectEntries.push(`/${key} ${imgId} 0 R`);
+	}
+
+	const resourceParts = ['/ProcSet [ /PDF /Text /ImageC ]'];
+	if (fontEntries.length > 0) {
+		resourceParts.push(`/Font << ${fontEntries.join(' ')} >>`);
+	}
+	if (xobjectEntries.length > 0) {
+		resourceParts.push(`/XObject << ${xobjectEntries.join(' ')} >>`);
+	}
+
+	return writer.add(
+		`<< /Type /Page /Parent ${pagesId} 0 R /MediaBox [ 0 0 ${fmt(page.width)} ${fmt(page.height)} ] ` +
+			`/Resources << ${resourceParts.join(' ')} >> /Contents ${contentId} 0 R >>`,
+	);
+}
+
+/**
+ * A FlateDecode/DeviceRGB 8-bit image XObject. pdfjs decodes this to raw RGB
+ * without needing a canvas, which is exactly what the parser's extractor reads.
+ */
+function imageXObject(rgb) {
+	const raw = Buffer.from(rgb.data.buffer ?? rgb.data, rgb.data.byteOffset ?? 0, rgb.data.byteLength ?? rgb.data.length);
+	const compressed = deflateSync(raw);
+	const dict =
+		`<< /Type /XObject /Subtype /Image /Width ${rgb.width} /Height ${rgb.height} ` +
+		`/ColorSpace /DeviceRGB /BitsPerComponent 8 /Filter /FlateDecode /Length LEN >>`;
+	return streamObject(dict, compressed);
+}
+
+/** Marker so the writer knows this object carries a binary stream payload. */
+function streamObject(dict, payload) {
+	return { dict, payload };
+}
+
+class ObjectWriter {
+	constructor() {
+		/** @type {Array<string | {dict: string, payload: Buffer} | null>} */
+		this.objects = [];
+	}
+
+	/** Reserve an id, to be filled in later with define(). */
+	reserve() {
+		this.objects.push(null);
+		return this.objects.length;
+	}
+
+	define(id, body) {
+		this.objects[id - 1] = body;
+	}
+
+	/** Append a fully-formed object and return its id. */
+	add(body) {
+		this.objects.push(body);
+		return this.objects.length;
+	}
+
+	/**
+	 * Assemble the file with a correct classic xref table.
+	 * @param {number} rootId
+	 * @param {number} [infoId]
+	 * @returns {Uint8Array}
+	 */
+	serialize(rootId, infoId) {
+		const chunks = [];
+		let offset = 0;
+		const offsets = new Array(this.objects.length + 1).fill(0);
+
+		const push = (buf) => {
+			chunks.push(buf);
+			offset += buf.length;
+		};
+
+		// Binary marker comment keeps tools (and pdfjs heuristics) treating the
+		// file as binary.
+		push(Buffer.from('%PDF-1.7\n%\xE2\xE3\xCF\xD3\n', 'latin1'));
+
+		for (let i = 0; i < this.objects.length; i += 1) {
+			const id = i + 1;
+			const body = this.objects[i];
+			if (body === null) {
+				throw new Error(`PDF object ${id} was reserved but never defined`);
+			}
+			offsets[id] = offset;
+			push(Buffer.from(`${id} 0 obj\n`, 'latin1'));
+			if (typeof body === 'string') {
+				push(Buffer.from(body + '\n', 'latin1'));
+			} else {
+				const dict = body.dict.replace('LEN', String(body.payload.length));
+				push(Buffer.from(dict + '\nstream\n', 'latin1'));
+				push(body.payload);
+				push(Buffer.from('\nendstream\n', 'latin1'));
+			}
+			push(Buffer.from('endobj\n', 'latin1'));
+		}
+
+		const xrefOffset = offset;
+		const count = this.objects.length + 1;
+		let xref = `xref\n0 ${count}\n0000000000 65535 f \n`;
+		for (let id = 1; id < count; id += 1) {
+			xref += `${String(offsets[id]).padStart(10, '0')} 00000 n \n`;
+		}
+		push(Buffer.from(xref, 'latin1'));
+
+		const trailerParts = [`/Size ${count}`, `/Root ${rootId} 0 R`];
+		if (infoId) {
+			trailerParts.push(`/Info ${infoId} 0 R`);
+		}
+		push(Buffer.from(`trailer\n<< ${trailerParts.join(' ')} >>\nstartxref\n${xrefOffset}\n%%EOF\n`, 'latin1'));
+
+		return new Uint8Array(Buffer.concat(chunks));
+	}
+}
+
+/** Format a number for PDF content streams: trim trailing zeros, no exponent. */
+function fmt(n) {
+	if (!Number.isFinite(n)) return '0';
+	return (Math.round(n * 1000) / 1000).toString();
+}
+
+function escapePdfString(s) {
+	return String(s).replace(/\\/g, '\\\\').replace(/\(/g, '\\(').replace(/\)/g, '\\)');
+}
+
+/**
+ * Convenience: a solid-color RGB image buffer for image fixtures.
+ *
+ * @param {number} width
+ * @param {number} height
+ * @param {[number, number, number]} rgb 0..255 per channel.
+ * @returns {{width: number, height: number, data: Uint8Array}}
+ */
+export function solidRgbImage(width, height, [r, g, b]) {
+	const data = new Uint8Array(width * height * 3);
+	for (let i = 0; i < width * height; i += 1) {
+		data[i * 3] = r;
+		data[i * 3 + 1] = g;
+		data[i * 3 + 2] = b;
+	}
+	return { width, height, data };
+}
diff --git a/packages/pipeline/tests/indesign/parse-pdf.test.mjs b/packages/pipeline/tests/indesign/parse-pdf.test.mjs
new file mode 100644
index 0000000..cb68b68
--- /dev/null
+++ b/packages/pipeline/tests/indesign/parse-pdf.test.mjs
@@ -0,0 +1,215 @@
+// Integration tests: drive the whole PDF parser against programmatically-built
+// fixtures (text-heavy, image-heavy, multi-column, single-page brochure) and
+// assert the reconstructed IR validates and matches expectations.
+
+import { test } from 'node:test';
+import assert from 'node:assert/strict';
+import { promises as fs } from 'node:fs';
+import os from 'node:os';
+import path from 'node:path';
+import { parsePdf, parsePdfBuffer } from '../../src/indesign/parse-pdf.js';
+import { Document } from '../../src/indesign/ir.js';
+import { buildPdf, solidRgbImage } from './helpers/build-pdf.js';
+
+const warningCodes = (ir) => ir.warnings.map((w) => w.code);
+const textFrames = (ir) => ir.spreads.flatMap((s) => s.frames.filter((f) => f.kind === 'text'));
+const imageFrames = (ir) => ir.spreads.flatMap((s) => s.frames.filter((f) => f.kind === 'image'));
+
+function bodyLines(count, { x = 72, startY = 120, size = 11, leading = 15, font = 'Helvetica' } = {}) {
+	return Array.from({ length: count }, (_, i) => ({
+		text: `Body copy line number ${i + 1} with enough words to be realistic.`,
+		x,
+		y: startY + i * leading,
+		size,
+		font,
+	}));
+}
+
+test('text-heavy PDF: validates, one body style, single clustered frame', async () => {
+	const ir = await parsePdfBuffer(buildPdf({ pages: [{ texts: bodyLines(8) }] }));
+	const validated = Document.parse(ir);
+	assert.equal(validated.irVersion, 1);
+	assert.equal(validated.spreads.length, 1);
+
+	const bodyStyles = ir.styles.filter((s) => s.properties.role === 'body');
+	assert.equal(bodyStyles.length, 1);
+	// 8 evenly-spaced lines in one column collapse to a single text frame.
+	assert.equal(textFrames(ir).length, 1);
+	const story = ir.stories.find((s) => s.id === textFrames(ir)[0].storyRef);
+	assert.ok(story.runs.every((r) => r.paragraphStyleRef === 'pdf-style-body'));
+});
+
+test('image-heavy PDF: every image becomes an addressable image frame', async () => {
+	const ir = await parsePdfBuffer(
+		buildPdf({
+			pages: [
+				{
+					texts: [{ text: 'Gallery', x: 72, y: 90, size: 18, font: 'Helvetica-Bold' }],
+					images: [
+						{ x: 72, y: 120, width: 150, height: 120, rgb: solidRgbImage(6, 5, [200, 30, 30]) },
+						{ x: 240, y: 120, width: 150, height: 120, rgb: solidRgbImage(6, 5, [30, 200, 60]) },
+						{ x: 72, y: 280, width: 150, height: 120, rgb: solidRgbImage(6, 5, [30, 60, 200]) },
+					],
+				},
+			],
+		}),
+	);
+	Document.parse(ir);
+	const imgs = imageFrames(ir);
+	assert.equal(imgs.length, 3);
+	assert.ok(imgs.every((f) => f.embedded === true));
+	assert.deepEqual(
+		imgs.map((f) => f.href),
+		['assets/pdf-p001-img001.png', 'assets/pdf-p001-img002.png', 'assets/pdf-p001-img003.png'],
+	);
+	assert.ok(imgs.every((f) => f.bounds.width > 0 && f.bounds.height > 0));
+});
+
+test('multi-column PDF: columns become separate frames + a warning', async () => {
+	const left = bodyLines(4, { x: 72, startY: 120 }).map((t) => ({ ...t, text: 'Left ' + t.text.slice(0, 20) }));
+	const right = bodyLines(4, { x: 340, startY: 120 }).map((t) => ({ ...t, text: 'Right ' + t.text.slice(0, 20) }));
+	const ir = await parsePdfBuffer(buildPdf({ pages: [{ texts: [...left, ...right] }] }));
+	Document.parse(ir);
+
+	const frames = textFrames(ir).sort((a, b) => a.bounds.x - b.bounds.x);
+	assert.ok(frames.length >= 2, `expected >=2 text frames, got ${frames.length}`);
+	// The two columns don't horizontally overlap.
+	assert.ok(frames[0].bounds.x + frames[0].bounds.width <= frames[1].bounds.x);
+	assert.ok(warningCodes(ir).includes('multi-column-layout'));
+});
+
+test('single-page brochure: heading + body + caption + image all reconstructed', async () => {
+	const ir = await parsePdfBuffer(
+		buildPdf({
+			title: 'Brochure',
+			pages: [
+				{
+					texts: [
+						{ text: 'Welcome', x: 72, y: 90, size: 36, font: 'Helvetica-Bold', color: [0, 0.4, 0.8] },
+						...bodyLines(3, { startY: 150, size: 12 }),
+						{ text: 'Figure 1: the hero image.', x: 72, y: 470, size: 8, color: [0.4, 0.4, 0.4] },
+					],
+					images: [{ x: 72, y: 220, width: 240, height: 160, rgb: solidRgbImage(8, 6, [120, 120, 120]) }],
+				},
+			],
+		}),
+	);
+	Document.parse(ir);
+	assert.equal(ir.meta.name, 'Brochure');
+
+	const roles = new Set(ir.styles.map((s) => s.properties.role));
+	assert.ok(roles.has('heading') && roles.has('body') && roles.has('caption'));
+	assert.equal(imageFrames(ir).length, 1);
+	assert.ok(textFrames(ir).length >= 3); // heading, body, caption separated
+
+	// The heading style resolves both a font and a swatch.
+	const h1 = ir.styles.find((s) => s.id === 'pdf-style-h1');
+	assert.ok(h1.fontRef && ir.fonts.some((f) => f.id === h1.fontRef));
+	assert.ok(h1.fillColorRef && ir.swatches.some((s) => s.id === h1.fillColorRef));
+});
+
+test('fidelity warnings are always present and describe approximations', async () => {
+	const ir = await parsePdfBuffer(
+		buildPdf({
+			pages: [
+				{
+					texts: [{ text: 'Hello world', x: 72, y: 90, size: 12, color: [0.1, 0.1, 0.1] }],
+					rects: [{ x: 0, y: 0, width: 612, height: 60, color: [0.9, 0.9, 0.9] }],
+				},
+			],
+		}),
+	);
+	const codes = warningCodes(ir);
+	assert.ok(codes.includes('pdf-fallback'));
+	assert.ok(codes.includes('text-reconstructed-from-glyphs'));
+	assert.ok(codes.includes('styles-synthesized'));
+	assert.ok(codes.includes('no-embedded-fonts')); // base-14, not embedded
+	assert.ok(codes.includes('color-attribution-approximate'));
+	assert.ok(codes.includes('vector-paths-dropped')); // the rect fill
+});
+
+test('assetCacheDir: extracted images are written as readable PNGs', async () => {
+	const dir = await fs.mkdtemp(path.join(os.tmpdir(), 'flavian-pdf-'));
+	try {
+		const ir = await parsePdfBuffer(
+			buildPdf({
+				pages: [
+					{
+						texts: [{ text: 'Pic', x: 72, y: 90, size: 12 }],
+						images: [{ x: 72, y: 120, width: 100, height: 80, rgb: solidRgbImage(4, 3, [10, 20, 30]) }],
+					},
+				],
+			}),
+			{ assetCacheDir: dir },
+		);
+		const href = imageFrames(ir)[0].href;
+		const buf = await fs.readFile(path.join(dir, href));
+		assert.ok(buf.subarray(0, 4).equals(Buffer.from([0x89, 0x50, 0x4e, 0x47])));
+		assert.equal(buf.readUInt32BE(16), 4); // IHDR width
+		assert.equal(buf.readUInt32BE(20), 3); // IHDR height
+	} finally {
+		await fs.rm(dir, { recursive: true, force: true });
+	}
+});
+
+test('swatchPalette: detected colors snap to IDML swatch ids', async () => {
+	const palette = [{ id: 'col-brand', name: 'Brand Blue', color: { hex: '#0066cc', space: 'RGB' } }];
+	const ir = await parsePdfBuffer(
+		// 0,0.4,0.8 → #0066cc exactly; use a slightly-off shade to prove snapping.
+		buildPdf({ pages: [{ texts: [{ text: 'Brand', x: 72, y: 90, size: 24, color: [0.01, 0.4, 0.79] }] }] }),
+		{ swatchPalette: palette },
+	);
+	assert.ok(ir.swatches.some((s) => s.id === 'col-brand'));
+	const h1 = ir.styles.find((s) => s.properties.role === 'heading') ?? ir.styles[0];
+	assert.equal(h1.fillColorRef, 'col-brand');
+});
+
+test('dpi scales geometry linearly', async () => {
+	const make = (dpi) =>
+		parsePdfBuffer(buildPdf({ pages: [{ texts: [{ text: 'Scale me', x: 72, y: 100, size: 12 }] }] }), { dpi });
+	const lo = await make(72);
+	const hi = await make(144);
+	const loFrame = lo.spreads[0].frames.find((f) => f.kind === 'text').bounds;
+	const hiFrame = hi.spreads[0].frames.find((f) => f.kind === 'text').bounds;
+	assert.ok(Math.abs(hiFrame.width - loFrame.width * 2) < 0.01);
+	assert.ok(Math.abs(hiFrame.x - loFrame.x * 2) < 0.01);
+});
+
+test('multi-page PDF yields one spread per page', async () => {
+	const page = { texts: [{ text: 'Page text', x: 72, y: 90, size: 12 }] };
+	const ir = await parsePdfBuffer(buildPdf({ pages: [page, page, page] }));
+	assert.equal(ir.spreads.length, 3);
+	assert.deepEqual(ir.spreads.map((s) => s.source), ['pdf:page:1', 'pdf:page:2', 'pdf:page:3']);
+});
+
+test('throws on bytes that are not a PDF', async () => {
+	const garbage = new TextEncoder().encode('this is definitely not a pdf');
+	await assert.rejects(() => parsePdfBuffer(garbage), /could not be opened/i);
+});
+
+test('parsePdf reads from disk (Buffer input) and prefers the embedded /Title', async () => {
+	const dir = await fs.mkdtemp(path.join(os.tmpdir(), 'flavian-pdf-disk-'));
+	try {
+		const file = path.join(dir, 'report-2026.pdf');
+		await fs.writeFile(file, buildPdf({ title: 'Quarterly Report', pages: [{ texts: bodyLines(3) }] }));
+		const ir = await parsePdf(file);
+		Document.parse(ir);
+		// Embedded /Title beats the filename fallback.
+		assert.equal(ir.meta.name, 'Quarterly Report');
+		assert.equal(ir.spreads.length, 1);
+	} finally {
+		await fs.rm(dir, { recursive: true, force: true });
+	}
+});
+
+test('parsePdf falls back to the filename when there is no /Title', async () => {
+	const dir = await fs.mkdtemp(path.join(os.tmpdir(), 'flavian-pdf-disk-'));
+	try {
+		const file = path.join(dir, 'untitled-doc.pdf');
+		await fs.writeFile(file, buildPdf({ pages: [{ texts: bodyLines(2) }] }));
+		const ir = await parsePdf(file);
+		assert.equal(ir.meta.name, 'untitled-doc');
+	} finally {
+		await fs.rm(dir, { recursive: true, force: true });
+	}
+});
diff --git a/packages/pipeline/tests/indesign/pdf-classify.test.mjs b/packages/pipeline/tests/indesign/pdf-classify.test.mjs
new file mode 100644
index 0000000..b2f8b17
--- /dev/null
+++ b/packages/pipeline/tests/indesign/pdf-classify.test.mjs
@@ -0,0 +1,78 @@
+// Style synthesis from font-size buckets (pure, no pdfjs).
+
+import { test } from 'node:test';
+import assert from 'node:assert/strict';
+import { classifyStyles } from '../../src/indesign/pdf/classify.js';
+
+function makeInput() {
+	return {
+		dpi: 96,
+		items: [
+			{ fontSize: 36, fontKey: 'pdf-font-helvetica-bold', text: 'Title' },
+			// Body dominates by character count.
+			{ fontSize: 12, fontKey: 'pdf-font-helvetica', text: 'The quick brown fox jumps over the lazy dog.' },
+			{ fontSize: 12, fontKey: 'pdf-font-helvetica', text: 'Another full line of ordinary body copy here.' },
+			{ fontSize: 8, fontKey: 'pdf-font-helvetica', text: 'fig' },
+		],
+		colorSamples: [
+			{ fontSizePt: 36, hex: '#0066cc', glyphs: 5 },
+			{ fontSizePt: 12, hex: '#111111', glyphs: 80 },
+			{ fontSizePt: 8, hex: '#888888', glyphs: 3 },
+		],
+	};
+}
+
+test('largest size becomes Heading 1, most-used becomes Body, smallest becomes Caption', () => {
+	const { buckets } = classifyStyles(makeInput());
+	const byRole = Object.fromEntries(buckets.map((b) => [b.role, b]));
+	assert.equal(byRole.heading.id, 'pdf-style-h1');
+	assert.equal(byRole.heading.sizePt, 36);
+	assert.equal(byRole.heading.fontSizePx, 48); // 36pt @ 96dpi
+	assert.equal(byRole.body.id, 'pdf-style-body');
+	assert.equal(byRole.body.sizePt, 12);
+	assert.equal(byRole.caption.id, 'pdf-style-caption');
+	assert.equal(byRole.caption.sizePt, 8);
+});
+
+test('buckets carry the dominant font and color for each size', () => {
+	const { buckets } = classifyStyles(makeInput());
+	const body = buckets.find((b) => b.role === 'body');
+	assert.equal(body.dominantFontKey, 'pdf-font-helvetica');
+	assert.equal(body.dominantHex, '#111111');
+	const heading = buckets.find((b) => b.role === 'heading');
+	assert.equal(heading.dominantFontKey, 'pdf-font-helvetica-bold');
+	assert.equal(heading.dominantHex, '#0066cc');
+});
+
+test('styleIdForSize maps a size back to its bucket id', () => {
+	const { styleIdForSize } = classifyStyles(makeInput());
+	assert.equal(styleIdForSize(36), 'pdf-style-h1');
+	assert.equal(styleIdForSize(12), 'pdf-style-body');
+	assert.equal(styleIdForSize(8), 'pdf-style-caption');
+	assert.equal(styleIdForSize(99), undefined);
+});
+
+test('multiple heading sizes get descending levels', () => {
+	const { buckets } = classifyStyles({
+		dpi: 96,
+		items: [
+			{ fontSize: 48, fontKey: 'f', text: 'Big' },
+			{ fontSize: 24, fontKey: 'f', text: 'Med' },
+			{ fontSize: 10, fontKey: 'f', text: 'lots of body copy lots of body copy' },
+		],
+	});
+	const headings = buckets.filter((b) => b.role === 'heading').sort((a, b) => b.sizePt - a.sizePt);
+	assert.equal(headings[0].id, 'pdf-style-h1');
+	assert.equal(headings[0].sizePt, 48);
+	assert.equal(headings[1].id, 'pdf-style-h2');
+	assert.equal(headings[1].sizePt, 24);
+});
+
+test('a single font size yields only a Body bucket', () => {
+	const { buckets } = classifyStyles({
+		dpi: 96,
+		items: [{ fontSize: 11, fontKey: 'f', text: 'uniform text everywhere' }],
+	});
+	assert.equal(buckets.length, 1);
+	assert.equal(buckets[0].role, 'body');
+});
diff --git a/packages/pipeline/tests/indesign/pdf-cluster.test.mjs b/packages/pipeline/tests/indesign/pdf-cluster.test.mjs
new file mode 100644
index 0000000..fce90ca
--- /dev/null
+++ b/packages/pipeline/tests/indesign/pdf-cluster.test.mjs
@@ -0,0 +1,60 @@
+// Positional clustering heuristics (pure, no pdfjs).
+
+import { test } from 'node:test';
+import assert from 'node:assert/strict';
+import { groupLines, clusterIntoFrames, detectColumns } from '../../src/indesign/pdf/cluster.js';
+
+function item(text, x, baseline, { width = 50, fontSize = 12, fontKey = 'f1' } = {}) {
+	return { text, x, baseline, width, fontSize, fontKey };
+}
+
+test('groupLines merges runs sharing a baseline', () => {
+	const lines = groupLines([
+		item('Hello', 72, 100),
+		item('World', 130, 100.2), // within baseline tolerance
+		item('Next', 72, 130), // new line
+	]);
+	assert.equal(lines.length, 2);
+	assert.equal(lines[0].items.length, 2);
+	assert.equal(lines[0].items[0].text, 'Hello'); // sorted by x
+	assert.equal(lines[1].items.length, 1);
+});
+
+test('clusterIntoFrames keeps adjacent body lines in one frame', () => {
+	const frames = clusterIntoFrames([
+		item('Line one of the paragraph', 72, 100),
+		item('Line two of the paragraph', 72, 116),
+		item('Line three of the paragraph', 72, 132),
+	]);
+	assert.equal(frames.length, 1);
+	assert.equal(frames[0].lines.length, 3);
+});
+
+test('clusterIntoFrames splits a far-apart block into a second frame', () => {
+	const frames = clusterIntoFrames([
+		item('Top block', 72, 100),
+		item('Bottom block far below', 72, 500),
+	]);
+	assert.equal(frames.length, 2);
+});
+
+test('clusterIntoFrames separates side-by-side columns', () => {
+	const left = [item('L1', 72, 100), item('L2', 72, 116), item('L3', 72, 132)];
+	const right = [item('R1', 340, 100), item('R2', 340, 116), item('R3', 340, 132)];
+	const frames = clusterIntoFrames([...left, ...right]);
+	assert.equal(frames.length, 2);
+	assert.equal(detectColumns(frames), 2);
+	// Frames don't horizontally overlap.
+	const [a, b] = frames.sort((x, y) => x.bounds.minX - y.bounds.minX);
+	assert.ok(a.bounds.maxX <= b.bounds.minX);
+});
+
+test('detectColumns is 1 for a single column', () => {
+	const frames = clusterIntoFrames([item('A', 72, 100), item('B', 72, 120)]);
+	assert.equal(detectColumns(frames), 1);
+});
+
+test('clusterIntoFrames ignores whitespace-only runs', () => {
+	const frames = clusterIntoFrames([item('   ', 72, 100), item('', 80, 100)]);
+	assert.equal(frames.length, 0);
+});
diff --git a/packages/pipeline/tests/indesign/pdf-color.test.mjs b/packages/pipeline/tests/indesign/pdf-color.test.mjs
new file mode 100644
index 0000000..a17b861
--- /dev/null
+++ b/packages/pipeline/tests/indesign/pdf-color.test.mjs
@@ -0,0 +1,47 @@
+// Color normalization + nearest-swatch matching (pure, no pdfjs).
+
+import { test } from 'node:test';
+import assert from 'node:assert/strict';
+import { rgbToHex, grayToHex, cmykToHex, hexToRgb, colorDistance, nearestSwatch } from '../../src/indesign/pdf/color.js';
+
+test('rgbToHex clamps and lowercases', () => {
+	assert.equal(rgbToHex([0, 102, 204]), '#0066cc');
+	assert.equal(rgbToHex([300, -5, 16]), '#ff0010');
+});
+
+test('grayToHex mirrors the channel', () => {
+	assert.equal(grayToHex(0), '#000000');
+	assert.equal(grayToHex(255), '#ffffff');
+	assert.equal(grayToHex(128), '#808080');
+});
+
+test('cmykToHex matches the IDML naive conversion (0/0/0/1 = black)', () => {
+	assert.equal(cmykToHex([0, 0, 0, 1]), '#000000');
+	assert.equal(cmykToHex([0, 0, 0, 0]), '#ffffff');
+});
+
+test('hexToRgb round-trips', () => {
+	assert.deepEqual(hexToRgb('#0066cc'), [0, 102, 204]);
+	assert.deepEqual(hexToRgb('ffffff'), [255, 255, 255]);
+});
+
+test('colorDistance is zero for identical colors', () => {
+	assert.equal(colorDistance('#123456', '#123456'), 0);
+	assert.ok(colorDistance('#000000', '#ffffff') > 0);
+});
+
+test('nearestSwatch snaps to the closest palette entry within tolerance', () => {
+	const palette = [
+		{ id: 'col-brand', name: 'Brand Blue', color: { hex: '#0066cc', space: 'RGB' } },
+		{ id: 'col-ink', name: 'Ink', color: { hex: '#000000', space: 'CMYK' } },
+	];
+	// Slightly-off brand blue should snap to the brand swatch.
+	assert.equal(nearestSwatch('#0265cb', palette)?.id, 'col-brand');
+	// Near-black snaps to ink.
+	assert.equal(nearestSwatch('#050505', palette)?.id, 'col-ink');
+});
+
+test('nearestSwatch returns null when nothing is close enough', () => {
+	const palette = [{ id: 'col-ink', name: 'Ink', color: { hex: '#000000', space: 'CMYK' } }];
+	assert.equal(nearestSwatch('#00ff00', palette), null);
+});
diff --git a/packages/pipeline/tests/indesign/pdf-png.test.mjs b/packages/pipeline/tests/indesign/pdf-png.test.mjs
new file mode 100644
index 0000000..4ca1fb1
--- /dev/null
+++ b/packages/pipeline/tests/indesign/pdf-png.test.mjs
@@ -0,0 +1,58 @@
+// PNG encoder (pure). We decode the output back with node:zlib to prove the
+// bytes are a real, readable PNG without pulling in an image library.
+
+import { test } from 'node:test';
+import assert from 'node:assert/strict';
+import { inflateSync } from 'node:zlib';
+import { encodePng, ImageKind } from '../../src/indesign/pdf/png.js';
+
+const SIG = Buffer.from([0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a]);
+
+function readChunks(buf) {
+	const chunks = {};
+	let off = 8;
+	while (off < buf.length) {
+		const len = buf.readUInt32BE(off);
+		const type = buf.toString('latin1', off + 4, off + 8);
+		chunks[type] = buf.subarray(off + 8, off + 8 + len);
+		off += 12 + len;
+	}
+	return chunks;
+}
+
+test('encodes a 2x1 RGB image with valid signature and IHDR', () => {
+	const data = new Uint8Array([255, 0, 0, 0, 255, 0]); // red, green
+	const png = encodePng({ width: 2, height: 1, kind: ImageKind.RGB_24BPP, data });
+	assert.ok(png.subarray(0, 8).equals(SIG));
+	const { IHDR, IDAT, IEND } = readChunks(png);
+	assert.ok(IHDR && IDAT && IEND);
+	assert.equal(IHDR.readUInt32BE(0), 2); // width
+	assert.equal(IHDR.readUInt32BE(4), 1); // height
+	assert.equal(IHDR[8], 8); // bit depth
+	assert.equal(IHDR[9], 2); // color type RGB
+});
+
+test('IDAT decompresses to filtered scanlines that preserve pixels', () => {
+	const data = new Uint8Array([10, 20, 30, 40, 50, 60]);
+	const png = encodePng({ width: 2, height: 1, kind: ImageKind.RGB_24BPP, data });
+	const { IDAT } = readChunks(png);
+	const raw = inflateSync(IDAT);
+	// One scanline: filter byte (0) + 6 pixel bytes.
+	assert.equal(raw.length, 7);
+	assert.equal(raw[0], 0);
+	assert.deepEqual([...raw.subarray(1)], [10, 20, 30, 40, 50, 60]);
+});
+
+test('RGBA input produces a color-type-6 PNG', () => {
+	const data = new Uint8Array([1, 2, 3, 255]);
+	const png = encodePng({ width: 1, height: 1, kind: ImageKind.RGBA_32BPP, data });
+	assert.equal(readChunks(png).IHDR[9], 6); // RGBA
+});
+
+test('grayscale 1bpp expands to RGB', () => {
+	// One row, 2px, MSB-first: bits 1,0 -> white, black. Padded to a byte.
+	const data = new Uint8Array([0b10000000]);
+	const png = encodePng({ width: 2, height: 1, kind: ImageKind.GRAYSCALE_1BPP, data });
+	const raw = inflateSync(readChunks(png).IDAT);
+	assert.deepEqual([...raw], [0, 255, 255, 255, 0, 0, 0]); // filter + white + black
+});
diff --git a/packages/pipeline/tests/indesign/pdf-roundtrip.test.mjs b/packages/pipeline/tests/indesign/pdf-roundtrip.test.mjs
new file mode 100644
index 0000000..5169b80
--- /dev/null
+++ b/packages/pipeline/tests/indesign/pdf-roundtrip.test.mjs
@@ -0,0 +1,127 @@
+// Round-trip agreement: build the *same logical document* two ways — as IDML
+// and as an InDesign-style PDF export — parse both, and assert the IRs agree
+// within documented tolerances. This is the cross-check the issue calls for:
+// PDF is a lossy fallback, so we assert structural agreement, not equality.
+//
+// Documented tolerances (see docs/pipeline/indesign-pdf-fidelity.md):
+//   - page / spread count ........ exact
+//   - image frame count .......... exact
+//   - text frame count ........... within ±1
+//   - style bucket count ......... within ±1
+//   - swatches ................... PDF colors snap to the IDML swatch palette
+
+import { test } from 'node:test';
+import assert from 'node:assert/strict';
+import { parseIdmlBuffer } from '../../src/indesign/parse-idml.js';
+import { parsePdfBuffer } from '../../src/indesign/parse-pdf.js';
+import { buildIdml } from './helpers/build-idml.js';
+import { buildPdf } from './helpers/build-pdf.js';
+
+const BRAND = [0, 102, 204]; // #0066cc
+const INK = [0, 0, 0]; // #000000
+
+function buildIdmlVersion() {
+	return buildIdml({
+		name: 'Round Trip',
+		colors: [
+			{ id: 'col-brand', name: 'Brand Blue', space: 'RGB', values: BRAND },
+			{ id: 'col-ink', name: 'Ink', space: 'CMYK', values: [0, 0, 0, 100] },
+		],
+		fonts: [
+			{ id: 'font-helv-bold', family: 'Helvetica', style: 'Bold', postScriptName: 'Helvetica-Bold' },
+			{ id: 'font-helv-reg', family: 'Helvetica', style: 'Regular', postScriptName: 'Helvetica' },
+		],
+		styles: [
+			{ id: 'pstyle-h1', name: 'Heading 1', kind: 'paragraph', pointSize: 36, appliedFont: 'font-helv-bold', fillColor: 'col-brand' },
+			{ id: 'pstyle-body', name: 'Body', kind: 'paragraph', pointSize: 12, appliedFont: 'font-helv-reg', fillColor: 'col-ink' },
+		],
+		stories: [
+			{ id: 'story-headline', runs: [{ text: 'Welcome', paragraphStyle: 'pstyle-h1' }] },
+			{
+				id: 'story-body',
+				runs: [{ text: 'Print to web in one pass with a usable styled result.', paragraphStyle: 'pstyle-body' }],
+			},
+		],
+		spreads: [
+			{
+				id: 'spread-1',
+				pages: [{ id: 'page-1', bounds: [0, 0, 792, 612] }],
+				frames: [
+					{ kind: 'text', id: 'frame-headline', bounds: [72, 72, 130, 400], parentStory: 'story-headline' },
+					{ kind: 'text', id: 'frame-body', bounds: [140, 72, 220, 400], parentStory: 'story-body' },
+					{ kind: 'image', id: 'frame-hero', bounds: [250, 72, 430, 400], href: 'file:Resources/hero.jpg' },
+				],
+			},
+		],
+	});
+}
+
+function buildPdfVersion() {
+	const toUnit = ([r, g, b]) => [r / 255, g / 255, b / 255];
+	return buildPdf({
+		title: 'Round Trip',
+		pages: [
+			{
+				width: 612,
+				height: 792,
+				texts: [
+					{ text: 'Welcome', x: 72, y: 96, size: 36, font: 'Helvetica-Bold', color: toUnit(BRAND) },
+					{ text: 'Print to web in one pass with a usable', x: 72, y: 150, size: 12, font: 'Helvetica', color: toUnit(INK) },
+					{ text: 'styled result that needs only light touch-ups.', x: 72, y: 166, size: 12, font: 'Helvetica', color: toUnit(INK) },
+				],
+				images: [{ x: 72, y: 220, width: 240, height: 160, rgb: { width: 6, height: 4, data: new Uint8Array(6 * 4 * 3).fill(128) } }],
+			},
+		],
+	});
+}
+
+test('round-trip: page, frame, and style-bucket counts agree within tolerance', async () => {
+	const idml = parseIdmlBuffer(buildIdmlVersion());
+	const pdf = await parsePdfBuffer(buildPdfVersion(), { swatchPalette: idml.swatches });
+
+	// Page / spread count: exact.
+	assert.equal(pdf.spreads.length, idml.spreads.length);
+
+	const countFrames = (ir, kind) =>
+		ir.spreads.flatMap((s) => s.frames).filter((f) => f.kind === kind).length;
+
+	// Image frames: exact.
+	assert.equal(countFrames(pdf, 'image'), countFrames(idml, 'image'));
+
+	// Text frames: within ±1.
+	const idmlText = countFrames(idml, 'text');
+	const pdfText = countFrames(pdf, 'text');
+	assert.ok(Math.abs(idmlText - pdfText) <= 1, `text frames: idml=${idmlText} pdf=${pdfText}`);
+
+	// Style buckets: within ±1 (IDML has h1 + body; PDF synthesizes the same two).
+	assert.ok(Math.abs(idml.styles.length - pdf.styles.length) <= 1, `styles: idml=${idml.styles.length} pdf=${pdf.styles.length}`);
+});
+
+test('round-trip: detected PDF colors snap onto the IDML swatch palette', async () => {
+	const idml = parseIdmlBuffer(buildIdmlVersion());
+	const pdf = await parsePdfBuffer(buildPdfVersion(), { swatchPalette: idml.swatches });
+
+	const pdfSwatchIds = new Set(pdf.swatches.map((s) => s.id));
+	assert.ok(pdfSwatchIds.has('col-brand'), 'brand blue should snap to col-brand');
+	assert.ok(pdfSwatchIds.has('col-ink'), 'near-black should snap to col-ink');
+
+	// And the synthesized heading style references the shared swatch id.
+	const h1 = pdf.styles.find((s) => s.properties.role === 'heading');
+	assert.equal(h1.fillColorRef, 'col-brand');
+});
+
+test('round-trip: the same prose is recoverable from both IRs', async () => {
+	const idml = parseIdmlBuffer(buildIdmlVersion());
+	const pdf = await parsePdfBuffer(buildPdfVersion());
+
+	const prose = (ir) =>
+		ir.stories
+			.flatMap((s) => s.runs.map((r) => r.text))
+			.join(' ')
+			.replace(/\s+/g, ' ')
+			.toLowerCase();
+
+	assert.ok(prose(idml).includes('welcome'));
+	assert.ok(prose(pdf).includes('welcome'));
+	assert.ok(prose(pdf).includes('print to web in one pass'));
+});
diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml
index 5c6ec36..eafe875 100644
--- a/pnpm-lock.yaml
+++ b/pnpm-lock.yaml
@@ -38,6 +38,9 @@ importers:
       fflate:
         specifier: ^0.8.2
         version: 0.8.3
+      pdfjs-dist:
+        specifier: ^4.10.38
+        version: 4.10.38
       zod:
         specifier: ^3.23.8
         version: 3.23.8
@@ -151,6 +154,76 @@ packages:
   '@lhci/utils@0.14.0':
     resolution: {integrity: sha512-LyP1RbvYQ9xNl7uLnl5AO8fDRata9MG/KYfVFKFkYenlsVS6QJsNjLzWNEoMIaE4jOPdQQlSp4tO7dtnyDxzbQ==}
 
+  '@napi-rs/canvas-android-arm64@0.1.100':
+    resolution: {integrity: sha512-hjhCKhntPv9+t4ckHymdx0phYNcVW+GKQR6Lzw2zE+pOVjOplSmtx9nNNknTjbEDLcuLZqA1y8ufKg1XfgftzQ==}
+    engines: {node: '>= 10'}
+    cpu: [arm64]
+    os: [android]
+
+  '@napi-rs/canvas-darwin-arm64@0.1.100':
+    resolution: {integrity: sha512-2PcswRaC7Ly645DGt88///zuFDhJxJYdKAs1uU3mfk1atYkXufgcgLfBpk6Tm12nCQBaNt1wpybuPZ4qOhTo8A==}
+    engines: {node: '>= 10'}
+    cpu: [arm64]
+    os: [darwin]
+
+  '@napi-rs/canvas-darwin-x64@0.1.100':
+    resolution: {integrity: sha512-ePNZtj7pNIva/siZMg+HmbeozkIjqUIYdoymH8HaA3qK7LfzFN4WMBM8G6HQ9ZC+H3+Dnn5pqtiXpgLykaPOhw==}
+    engines: {node: '>= 10'}
+    cpu: [x64]
+    os: [darwin]
+
+  '@napi-rs/canvas-linux-arm-gnueabihf@0.1.100':
+    resolution: {integrity: sha512-d5cDB48oWFGU8/XPhUOFAlySgb/VAu7D+s8fi55K1Pcfg8aPplHWqMgibhVLU8ky7Pyg/fuiVLz4Nf3JrSTuUA==}
+    engines: {node: '>= 10'}
+    cpu: [arm]
+    os: [linux]
+
+  '@napi-rs/canvas-linux-arm64-gnu@0.1.100':
+    resolution: {integrity: sha512-rDxgxRu69RvDlX/bh9o22DxLsGr8EqsNgotL9+RwQE1S0b0cqeatqsw6aW45mukm0B42DIAaAacKaYQ8cqS1nw==}
+    engines: {node: '>= 10'}
+    cpu: [arm64]
+    os: [linux]
+
+  '@napi-rs/canvas-linux-arm64-musl@0.1.100':
+    resolution: {integrity: sha512-K3mDW66N+xT2/V439u1alFANiBUjdEx2gLiNYnCmUsva5jZMxWTjafBYwTzYK+EMFMHrUoabuU+T1BIP5CgbYQ==}
+    engines: {node: '>= 10'}
+    cpu: [arm64]
+    os: [linux]
+
+  '@napi-rs/canvas-linux-riscv64-gnu@0.1.100':
+    resolution: {integrity: sha512-mooqUBTIsccZpnoQC4NgrC1v6C1vof39etLNMnBwCY+p0gajWJvAHLGQ6g/gGyS5YrpDW+GefSN4+Cvcr08UWw==}
+    engines: {node: '>= 10'}
+    cpu: [riscv64]
+    os: [linux]
+
+  '@napi-rs/canvas-linux-x64-gnu@0.1.100':
+    resolution: {integrity: sha512-1eCvkDCazm7FFhsT7DfGOdSaHgZVK3bt/dSBl5EWHOWmnz+I7j8tPseJqqD81NF+MH21jKUK4wQSDjN0mdhnTg==}
+    engines: {node: '>= 10'}
+    cpu: [x64]
+    os: [linux]
+
+  '@napi-rs/canvas-linux-x64-musl@0.1.100':
+    resolution: {integrity: sha512-20arT6lnI19S68qNlii73TSEDbECNgzMz2EpldC1V3mZFuRkeujXkcebRk0LRJe9SEUAooYiLokfMViY8IX7yA==}
+    engines: {node: '>= 10'}
+    cpu: [x64]
+    os: [linux]
+
+  '@napi-rs/canvas-win32-arm64-msvc@0.1.100':
+    resolution: {integrity: sha512-DZFFT1wIAg37LJw37yhMRFfjATd3vTQzjZ1Yki8u2vhO6Hi5VE6BVaGQ1aaDu7xb4iMErz+9EOwjpS7xcxFeBw==}
+    engines: {node: '>= 10'}
+    cpu: [arm64]
+    os: [win32]
+
+  '@napi-rs/canvas-win32-x64-msvc@0.1.100':
+    resolution: {integrity: sha512-MyT1j3mHC2+Lu4pBi9mKyMJhtP6U7k7EldY7sj/uS5gJA65gTXt8MefJQXLJo5d/vZbuWmfxzkEUNc/urV3pHA==}
+    engines: {node: '>= 10'}
+    cpu: [x64]
+    os: [win32]
+
+  '@napi-rs/canvas@0.1.100':
+    resolution: {integrity: sha512-xglYA6q3XO5P3BNJYxVZ1IV7DLVjp1Py6nwag88YntrS+3vKHyYcMqXVS4ZztJmwz2uGvz1FWhI/4LgbR5uQDA==}
+    engines: {node: '>= 10'}
+
   '@nodable/entities@2.1.0':
     resolution: {integrity: sha512-nyT7T3nbMyBI/lvr6L5TyWbFJAI9FTgVRakNoBqCD+PmID8DzFrrNdLLtHMwMszOtqZa8PAOV24ZqDnQrhQINA==}
 
@@ -1124,6 +1197,10 @@ packages:
   path-to-regexp@0.1.13:
     resolution: {integrity: sha512-A/AGNMFN3c8bOlvV9RreMdrv7jsmF9XIfDeCd87+I8RNg6s78BhJxMu69NEMHBSJFxKidViTEdruRwEk/WIKqA==}
 
+  pdfjs-dist@4.10.38:
+    resolution: {integrity: sha512-/Y3fcFrXEAsMjJXeL9J8+ZG9U01LbuWaYypvDW2ycW1jL269L3js3DVBjDJ0Up9Np1uqDXsDrRihHANhZOlwdQ==}
+    engines: {node: '>=20'}
+
   pend@1.2.0:
     resolution: {integrity: sha512-F3asv42UuXchdzt+xXqfW1OGlVBe+mxa2mqI0pg5yAHZPvFmY3Y6drSf/GQ1A86WgWEN9Kzh/WrgKa6iGcHXLg==}
 
@@ -1739,6 +1816,54 @@ snapshots:
       - supports-color
       - utf-8-validate
 
+  '@napi-rs/canvas-android-arm64@0.1.100':
+    optional: true
+
+  '@napi-rs/canvas-darwin-arm64@0.1.100':
+    optional: true
+
+  '@napi-rs/canvas-darwin-x64@0.1.100':
+    optional: true
+
+  '@napi-rs/canvas-linux-arm-gnueabihf@0.1.100':
+    optional: true
+
+  '@napi-rs/canvas-linux-arm64-gnu@0.1.100':
+    optional: true
+
+  '@napi-rs/canvas-linux-arm64-musl@0.1.100':
+    optional: true
+
+  '@napi-rs/canvas-linux-riscv64-gnu@0.1.100':
+    optional: true
+
+  '@napi-rs/canvas-linux-x64-gnu@0.1.100':
+    optional: true
+
+  '@napi-rs/canvas-linux-x64-musl@0.1.100':
+    optional: true
+
+  '@napi-rs/canvas-win32-arm64-msvc@0.1.100':
+    optional: true
+
+  '@napi-rs/canvas-win32-x64-msvc@0.1.100':
+    optional: true
+
+  '@napi-rs/canvas@0.1.100':
+    optionalDependencies:
+      '@napi-rs/canvas-android-arm64': 0.1.100
+      '@napi-rs/canvas-darwin-arm64': 0.1.100
+      '@napi-rs/canvas-darwin-x64': 0.1.100
+      '@napi-rs/canvas-linux-arm-gnueabihf': 0.1.100
+      '@napi-rs/canvas-linux-arm64-gnu': 0.1.100
+      '@napi-rs/canvas-linux-arm64-musl': 0.1.100
+      '@napi-rs/canvas-linux-riscv64-gnu': 0.1.100
+      '@napi-rs/canvas-linux-x64-gnu': 0.1.100
+      '@napi-rs/canvas-linux-x64-musl': 0.1.100
+      '@napi-rs/canvas-win32-arm64-msvc': 0.1.100
+      '@napi-rs/canvas-win32-x64-msvc': 0.1.100
+    optional: true
+
   '@nodable/entities@2.1.0': {}
 
   '@paulirish/trace_engine@0.0.23': {}
@@ -2752,6 +2877,10 @@ snapshots:
 
   path-to-regexp@0.1.13: {}
 
+  pdfjs-dist@4.10.38:
+    optionalDependencies:
+      '@napi-rs/canvas': 0.1.100
+
   pend@1.2.0: {}
 
   picocolors@1.1.1: {}