diff --git a/docs/pipeline/indesign-pdf-fidelity.md b/docs/pipeline/indesign-pdf-fidelity.md new file mode 100644 index 0000000..bde3efe --- /dev/null +++ b/docs/pipeline/indesign-pdf-fidelity.md @@ -0,0 +1,103 @@ +# InDesign PDF fallback: fidelity guide + +The InDesign-to-WordPress pipeline prefers **IDML** (`.idml`) as its input. When +IDML isn't available — a client only has the exported PDF, or you want to +cross-check IDML output — the pipeline can parse a **PDF exported from InDesign** +instead. + +PDF is intentionally a *fallback*. A PDF has no named styles, no text frames, +and no swatch palette; it's a bag of absolutely-positioned glyph runs, fill +colors, and placed images. The parser reconstructs an approximate version of the +same [intermediate representation](../../packages/pipeline/src/indesign/ir.js) +the IDML parser produces, so the downstream mapper and generator can consume +either source. **We never aim for pixel-perfect reconstruction — we aim for a +usable, styled IR that the generator can turn into WordPress patterns with +manual touch-ups.** + +## Usage + +```bash +# Print the reconstructed IR as JSON; fidelity warnings go to stderr. +node packages/pipeline/bin/parse-pdf.mjs brochure.pdf > ir.json + +# Also extract embedded images to an asset cache directory (as PNG). +node packages/pipeline/bin/parse-pdf.mjs brochure.pdf --asset-dir ./assets > ir.json +``` + +```js +import { parsePdf } from '@flavian/pipeline'; + +const ir = await parsePdf('./brochure.pdf', { + dpi: 96, // unit normalization (default 96) + assetCacheDir: './assets', // optional; write extracted images here + swatchPalette: idml.swatches, // optional; snap detected colors to IDML swatches +}); +``` + +## How reconstruction works + +| IR element | How it's derived from the PDF | +| --- | --- | +| `Spread` (one per page) | One spread per PDF page; page size from the MediaBox. | +| `TextFrame` | Glyph runs are grouped into lines (shared baseline), then lines into frames (vertically adjacent + horizontally overlapping). A wide horizontal gap on a shared baseline is treated as a **column gutter**, so side-by-side columns become separate frames. | +| `Story` / `TextRun` | One story per text frame; each run carries the paragraph-style reference of its font-size bucket. | +| `Style` | Synthesized from the font-size distribution: the most-used size is **Body**, larger sizes become **Heading 1..6** (largest first), smaller sizes become **Caption**. Each bucket records its dominant font and fill color. | +| `Font` | Resolved from each run's PostScript name (subset prefixes like `ABCDEF+` stripped); family/style split on the `-` and refined with pdfjs bold/italic flags. | +| `Swatch` | Distinct fill colors found in the content stream, normalized to hex. With a `swatchPalette`, each color snaps to the nearest IDML swatch (so PDF and IDML produce aligned token names). | +| `ImageFrame` | Image XObjects, placed via the current transform matrix. Pixels are PNG-encoded into the asset cache; `href` points at the cache-relative path. | +| `MasterSpread` | Always empty — PDF has no master pages. | + +## Fidelity warnings + +Every PDF parse attaches warnings describing the approximations made. They appear +on the CLI's stderr and in `ir.warnings`. Treat them as a checklist of things to +verify by eye. + +| Code | Meaning | When | +| --- | --- | --- | +| `pdf-fallback` | The whole IR is approximate; prefer IDML if you have it. | Always | +| `text-reconstructed-from-glyphs` | Text came from positioned glyph runs; ligatures, hidden text, and reading order may differ. | Always | +| `styles-synthesized` | Paragraph styles are font-size buckets, not real named styles. | When any text exists | +| `no-embedded-fonts` | No fonts are embedded; family/style mapping is best-effort from PostScript names. | No embedded fonts found | +| `color-attribution-approximate` | Colors are bucketed by font size, not resolved per run. | When any colored text exists | +| `vector-paths-dropped` | Vector paths / image masks were detected but aren't represented in the IR. | When the page draws vector fills/strokes | +| `multi-column-layout` | A page was split into N columns / separate frames. | When >1 column is detected | +| `image-extract-failed` | An image couldn't be decoded (e.g. an unsupported filter). | Per failed image | +| `empty-page` | A page produced no text or image frames. | Per empty page | +| `asset-write-failed` | An extracted image couldn't be written to the asset cache. | Per failed write | + +## Round-trip tolerances + +The test suite builds the *same logical document* as both IDML and PDF and +asserts the two IRs agree within these tolerances (see +`packages/pipeline/tests/indesign/pdf-roundtrip.test.mjs`): + +| Quantity | Tolerance | +| --- | --- | +| Page / spread count | Exact | +| Image frame count | Exact | +| Text frame count | Within ±1 | +| Style bucket count | Within ±1 | +| Swatch identity | Detected PDF colors snap onto the IDML swatch palette | + +These are deliberately loose on text-frame and style counts: where InDesign knows +a frame is one frame, the PDF only shows glyph positions, so a heading and its +body paragraph may merge or split by ±1 depending on spacing. + +## Known limitations + +- **Geometry is approximate.** Frame rectangles are derived from glyph baselines + using nominal ascent/descent ratios (0.8 / 0.2 of font size), not true font + metrics. Rotated or skewed text is flattened to its axis-aligned bounding box. +- **Per-run color is not resolved.** Color is attributed at the style-bucket + (font-size) level, because the IR carries color on `Style`, not `TextRun`. +- **Vector art is dropped.** Backgrounds, rules, and shapes drawn as vector paths + are noted via `vector-paths-dropped` but not reconstructed. +- **Image masks aren't extracted.** Stencil-masked images paint the current fill + through a 1-bit mask and have no extractable raster; they're treated as vector + content. +- **Leading/tracking are omitted.** The fallback doesn't infer line spacing or + tracking; the mapper applies its own defaults. + +When fidelity matters, export IDML from InDesign and use the +[IDML parser](../../packages/pipeline/README.md) instead. diff --git a/packages/pipeline/README.md b/packages/pipeline/README.md index 1c1ebec..144fb50 100644 --- a/packages/pipeline/README.md +++ b/packages/pipeline/README.md @@ -4,26 +4,39 @@ Conversion pipeline for InDesign (and future) sources into WordPress FSE themes. ## Status -This package currently ships the **IDML parser and intermediate representation** (sub-issue #62 of the InDesign-to-WordPress epic). Downstream stages — PDF fallback (#63), style + token mapper (#64), output generator (#65) — will land as separate PRs. The IR shape produced here is the contract those stages consume. +This package ships the **IDML parser** (sub-issue #62) and the **PDF fallback parser** (sub-issue #63) of the InDesign-to-WordPress epic. Both emit the same intermediate representation. Downstream stages — style + token mapper (#64), output generator (#65) — will land as separate PRs. The IR shape produced here is the contract those stages consume. + +IDML is the primary path (full access to stories, frames, styles, swatches, masters). PDF is a lossy fallback for when only the exported PDF is available, or as a verification source against IDML output — see [`docs/pipeline/indesign-pdf-fidelity.md`](../../docs/pipeline/indesign-pdf-fidelity.md). ## Layout ``` packages/pipeline/ -├── bin/parse-idml.mjs CLI entry; prints validated IR JSON on stdout +├── bin/ +│ ├── parse-idml.mjs CLI: IDML → validated IR JSON on stdout +│ └── parse-pdf.mjs CLI: PDF → reconstructed IR JSON on stdout └── src/ ├── index.js Re-exports the InDesign surface └── indesign/ ├── ir.js zod schemas + JSDoc typedefs for the IR - ├── parse-idml.js Main entry: unzips + orchestrates + cross-refs + validates + ├── parse-idml.js IDML entry: unzips + orchestrates + cross-refs + validates + ├── parse-pdf.js PDF entry: extracts + clusters + classifies + validates ├── units.js pt/pc/mm/cm/in → px at configurable DPI ├── warnings.js Non-fatal warning collector - └── parsers/ - ├── xml.js fast-xml-parser wrapper - ├── designmap.js designmap.xml → manifest with paths - ├── resources.js Graphic.xml + Fonts.xml + Styles.xml - ├── stories.js Stories/Story_*.xml → text runs - └── spreads.js Spreads/*.xml + MasterSpreads/*.xml + ├── parsers/ IDML XML decoders + │ ├── xml.js fast-xml-parser wrapper + │ ├── designmap.js designmap.xml → manifest with paths + │ ├── resources.js Graphic.xml + Fonts.xml + Styles.xml + │ ├── stories.js Stories/Story_*.xml → text runs + │ └── spreads.js Spreads/*.xml + MasterSpreads/*.xml + └── pdf/ PDF reconstruction modules + ├── pdfjs.js Lazy pdfjs-dist loader (headless, extraction-only) + ├── extract.js Per-page: text runs, fonts, colors, images, vector flag + ├── cluster.js Glyph runs → lines → frames; column detection (pure) + ├── classify.js Font-size buckets → heading/body/caption styles (pure) + ├── color.js RGB/gray/CMYK → hex; nearest-swatch matching (pure) + ├── png.js Decoded pixels → PNG via node:zlib (pure) + └── assets.js Write extracted images to the asset cache ``` ## Quick start @@ -47,6 +60,26 @@ Or from the command line: node packages/pipeline/bin/parse-idml.mjs my-document.idml > ir.json ``` +### PDF fallback + +When you only have a PDF exported from InDesign, use the fallback parser. It +emits the same IR, plus fidelity warnings describing every approximation it made. + +```js +import { parsePdf } from '@flavian/pipeline'; + +const ir = await parsePdf('./brochure.pdf', { + assetCacheDir: './assets', // optional: write extracted images (PNG) here + swatchPalette: idml?.swatches, // optional: snap detected colors to IDML swatches +}); +``` + +```bash +node packages/pipeline/bin/parse-pdf.mjs brochure.pdf --asset-dir ./assets > ir.json +``` + +PDF reconstruction is lossy by design. See [`docs/pipeline/indesign-pdf-fidelity.md`](../../docs/pipeline/indesign-pdf-fidelity.md) for how each IR element is derived, the full list of fidelity-warning codes, and the round-trip tolerances against IDML. + ## IR shape The intermediate representation is described in [`src/indesign/ir.js`](src/indesign/ir.js). At the top level: @@ -70,10 +103,12 @@ Geometry (`Page.bounds`, `Frame.bounds`) is normalized to pixels at `dpi` (defau ## Failure mode -- **Throws** on structural problems that make the IR meaningless: missing `designmap.xml`, malformed zip, a `` element that lacks `Self`. -- **Warns and continues** on everything else: missing optional resource files, dangling style references, unknown color spaces, empty stories, unrecognized unit suffixes. +Both parsers share the same philosophy: throw only when the document can't be read at all; otherwise emit a partial IR with warnings. + +- **IDML throws** on missing `designmap.xml`, a malformed zip, or a `` lacking `Self`; **warns** on missing optional resources, dangling references, unknown color spaces, empty stories, unrecognized units. +- **PDF throws** only when the file can't be opened as a PDF; **warns** on every approximation (text reconstructed from glyphs, synthesized styles, dropped vector paths, undecodable images, …). PDF parses always carry fidelity warnings — that's expected. -The CLI surfaces warnings on stderr and exits 0 unless the IR itself failed to build. +Each CLI surfaces warnings on stderr and exits 0 unless the IR itself failed to build. ## Testing @@ -81,7 +116,9 @@ The CLI surfaces warnings on stderr and exits 0 unless the IR itself failed to b pnpm --filter @flavian/pipeline test ``` -Tests build minimal IDML zips programmatically (see `tests/indesign/helpers/build-idml.js`) — no binary fixtures in git. The fixture builder mirrors the IDML XML grammar the parser reads, so adding a new test case is usually one option flag. +Tests build minimal fixtures programmatically — no binary fixtures in git. `tests/indesign/helpers/build-idml.js` emits IDML zips; `tests/indesign/helpers/build-pdf.js` emits PDFs (positioned text in base-14 fonts, FlateDecode image XObjects, vector fills). Building the *same logical document* both ways powers the IDML↔PDF round-trip test. + +The PDF heuristics (clustering, classification, color, PNG encoding) are split into pure modules under `src/indesign/pdf/` and unit-tested without a PDF engine; only `extract.js` and the orchestrator touch pdfjs. ## Adding a new input format diff --git a/packages/pipeline/bin/parse-pdf.mjs b/packages/pipeline/bin/parse-pdf.mjs new file mode 100644 index 0000000..8c052cf --- /dev/null +++ b/packages/pipeline/bin/parse-pdf.mjs @@ -0,0 +1,85 @@ +#!/usr/bin/env node +// CLI: print the reconstructed IR as JSON on stdout, fidelity warnings on stderr. +// +// flavian-parse-pdf [--dpi ] [--asset-dir ] [--quiet] +// +// PDF is the fallback path. Expect fidelity warnings on every run — that's the +// parser telling you which parts are approximate. + +import { parsePdf } from '../src/indesign/parse-pdf.js'; + +const args = process.argv.slice(2); +let inputPath; +let dpi; +let assetCacheDir; +let quiet = false; + +for (let i = 0; i < args.length; i += 1) { + const arg = args[i]; + if (arg === '--dpi') { + const next = args[i + 1]; + if (!next || Number.isNaN(Number(next))) { + console.error('--dpi requires a positive number'); + process.exit(2); + } + dpi = Number(next); + i += 1; + } else if (arg === '--asset-dir') { + const next = args[i + 1]; + if (!next) { + console.error('--asset-dir requires a directory path'); + process.exit(2); + } + assetCacheDir = next; + i += 1; + } else if (arg === '--quiet') { + quiet = true; + } else if (arg === '-h' || arg === '--help') { + printUsage(); + process.exit(0); + } else if (!inputPath && !arg.startsWith('-')) { + inputPath = arg; + } else { + console.error(`Unknown argument: ${arg}`); + printUsage(); + process.exit(2); + } +} + +if (!inputPath) { + printUsage(); + process.exit(2); +} + +try { + const options = {}; + if (dpi !== undefined) options.dpi = dpi; + if (assetCacheDir !== undefined) options.assetCacheDir = assetCacheDir; + const ir = await parsePdf(inputPath, options); + if (!quiet && ir.warnings.length > 0) { + for (const w of ir.warnings) { + const where = w.context?.file ? ` (${w.context.file}${w.context.id ? `#${w.context.id}` : ''})` : ''; + process.stderr.write(`[${w.code}] ${w.message}${where}\n`); + } + process.stderr.write(`\n${ir.warnings.length} warning(s).\n`); + } + process.stdout.write(JSON.stringify(ir, null, 2) + '\n'); +} catch (err) { + process.stderr.write(`error: ${err.message}\n`); + process.exit(1); +} + +function printUsage() { + process.stderr.write( + [ + 'Usage: flavian-parse-pdf [options]', + '', + 'Options:', + ' --dpi Pixels per inch for unit normalization (default 96)', + ' --asset-dir Write extracted images (PNG) under this directory', + ' --quiet Suppress fidelity warnings on stderr', + ' -h, --help Show this help', + '', + ].join('\n'), + ); +} diff --git a/packages/pipeline/package.json b/packages/pipeline/package.json index 4f79a4c..ba2dcf6 100644 --- a/packages/pipeline/package.json +++ b/packages/pipeline/package.json @@ -9,7 +9,8 @@ "./indesign": "./src/indesign/index.js" }, "bin": { - "flavian-parse-idml": "./bin/parse-idml.mjs" + "flavian-parse-idml": "./bin/parse-idml.mjs", + "flavian-parse-pdf": "./bin/parse-pdf.mjs" }, "scripts": { "test": "node --test \"tests/**/*.test.mjs\"" @@ -17,6 +18,7 @@ "dependencies": { "fast-xml-parser": "^5.7.0", "fflate": "^0.8.2", + "pdfjs-dist": "^4.10.38", "zod": "^3.23.8" }, "engines": { diff --git a/packages/pipeline/src/indesign/index.js b/packages/pipeline/src/indesign/index.js index 8fa1f9b..e76cd88 100644 --- a/packages/pipeline/src/indesign/index.js +++ b/packages/pipeline/src/indesign/index.js @@ -1,4 +1,5 @@ export { parseIdml, parseIdmlBuffer } from './parse-idml.js'; +export { parsePdf, parsePdfBuffer } from './parse-pdf.js'; export * as ir from './ir.js'; export { WarningCollector } from './warnings.js'; export { lengthToPx, ptToPx, roundPx } from './units.js'; diff --git a/packages/pipeline/src/indesign/parse-pdf.js b/packages/pipeline/src/indesign/parse-pdf.js new file mode 100644 index 0000000..79c1d84 --- /dev/null +++ b/packages/pipeline/src/indesign/parse-pdf.js @@ -0,0 +1,327 @@ +// PDF fallback parser. Reads a PDF exported from InDesign and reconstructs an +// approximation of the same IR the IDML parser emits, so the downstream mapper +// and generator can consume either source. +// +// PDF is intentionally lossy: there are no named styles, no text frames, no +// swatch palette — just absolutely-positioned glyph runs, fill colors, and +// placed images. We rebuild a *usable, styled* IR (not a pixel-perfect one) and +// attach fidelity warnings describing every approximation we made. Use IDML +// when you have it; use this when you don't, or to cross-check IDML output. +// +// Failure philosophy mirrors parse-idml.js: throw only when the document can't +// be opened at all; everything else becomes a warning and a partial IR. + +import { promises as fs } from 'node:fs'; + +import { Document } from './ir.js'; +import { WarningCollector } from './warnings.js'; +import { ptToPx, roundPx } from './units.js'; +import { openDocument, loadPdfjs } from './pdf/pdfjs.js'; +import { extractPage } from './pdf/extract.js'; +import { clusterIntoFrames, detectColumns } from './pdf/cluster.js'; +import { classifyStyles } from './pdf/classify.js'; +import { nearestSwatch } from './pdf/color.js'; +import { assetHref, writeAsset } from './pdf/assets.js'; + +const DEFAULT_DPI = 96; + +/** + * @typedef {Object} ParsePdfOptions + * @property {number} [dpi] Pixels-per-inch for unit normalization. Default 96. + * @property {string} [name] Override the document name (defaults to PDF /Title or the file basename). + * @property {string} [assetCacheDir] If set, extracted images are PNG-encoded and written here. + * @property {Array} [swatchPalette] IDML-derived swatches to snap detected colors to. + */ + +/** + * Parse a PDF file from disk. + * + * @param {string} path + * @param {ParsePdfOptions} [options] + * @returns {Promise} + */ +export async function parsePdf(path, options = {}) { + const bytes = await fs.readFile(path); + const fallbackName = path.split(/[\\/]/).pop()?.replace(/\.pdf$/i, ''); + return parsePdfBuffer(bytes, { ...options, name: options.name ?? fallbackName }); +} + +/** + * Parse PDF bytes already in memory. + * + * @param {Uint8Array} bytes + * @param {ParsePdfOptions} [options] + * @returns {Promise} + */ +export async function parsePdfBuffer(bytes, options = {}) { + const dpi = options.dpi ?? DEFAULT_DPI; + const palette = options.swatchPalette ?? []; + const warnings = new WarningCollector(); + + let doc; + try { + doc = await openDocument(bytes); + } catch (err) { + throw new Error(`PDF could not be opened: ${err.message}`); + } + + let title; + try { + const md = await doc.getMetadata(); + title = md?.info?.Title || undefined; + } catch { + // Metadata is optional; ignore. + } + + const pdfjs = await loadPdfjs(); + + // --- Pass 1: extract every page, normalizing font keys to a global identity. + const fontsById = new Map(); // fontId -> { id, family, style, postScriptName } + const pages = []; + const allItems = []; + const allColorSamples = []; + let anyVector = false; + let anyEmbeddedFont = false; + let sawFonts = false; + + for (let p = 0; p < doc.numPages; p += 1) { + const page = await doc.getPage(p + 1); + const extracted = await extractPage(page, pdfjs); + // Note: we deliberately don't page.cleanup() here — decoded image bytes + // are PNG-encoded later, and cleanup can clear page.objs out from under us. + // doc.cleanup() at the end releases everything. + + // loader font key (page-scoped, e.g. "g_d0_f1") -> stable global font id. + const localToGlobal = new Map(); + for (const [localKey, font] of extracted.fonts) { + sawFonts = true; + if (font.embedded) anyEmbeddedFont = true; + const id = fontId(font.name); + if (!fontsById.has(id)) { + fontsById.set(id, { id, family: font.family, style: font.style, postScriptName: font.name }); + } + localToGlobal.set(localKey, id); + } + + const items = extracted.textItems.map((it) => ({ + ...it, + fontKey: localToGlobal.get(it.fontKey) ?? it.fontKey, + })); + allItems.push(...items); + allColorSamples.push(...extracted.colorSamples); + if (extracted.hasVector) anyVector = true; + + pages.push({ index: p, ...extracted, items }); + } + + // --- Swatches: distinct detected colors, snapped to the IDML palette if given. + const swatches = []; + const hexToSwatchId = new Map(); + for (const sample of allColorSamples) { + if (hexToSwatchId.has(sample.hex)) continue; + const matched = palette.length ? nearestSwatch(sample.hex, palette) : null; + if (matched) { + hexToSwatchId.set(sample.hex, matched.id); + if (!swatches.some((s) => s.id === matched.id)) swatches.push(matched); + } else { + const id = `pdf-color-${sample.hex.slice(1)}`; + hexToSwatchId.set(sample.hex, id); + swatches.push({ id, name: sample.hex.toUpperCase(), color: { hex: sample.hex, space: 'RGB' } }); + } + } + + // --- Styles: synthesize buckets from font-size distribution. + const { buckets, styleIdForSize } = classifyStyles({ items: allItems, colorSamples: allColorSamples, dpi }); + const styles = buckets.map((b) => ({ + id: b.id, + name: b.name, + kind: 'paragraph', + fontSize: b.fontSizePx, + fontRef: b.dominantFontKey && fontsById.has(b.dominantFontKey) ? b.dominantFontKey : undefined, + fillColorRef: b.dominantHex ? hexToSwatchId.get(b.dominantHex) : undefined, + properties: { role: b.role, sourceSizePt: b.sizePt }, + })); + + // --- Spreads: one per page, with reconstructed text + image frames. + const stories = []; + const spreads = []; + const pendingWrites = []; // { href, image } to persist if assetCacheDir is set + for (const page of pages) { + const pageNum = page.index + 1; + const frames = []; + + const blocks = clusterIntoFrames(page.items); + const columns = detectColumns(blocks); + if (columns > 1) { + warnings.add( + 'multi-column-layout', + `Page ${pageNum}: detected ${columns} columns; emitted as ${blocks.length} separate text frames`, + { file: `pdf:page:${pageNum}` }, + ); + } + + blocks.forEach((block, idx) => { + const storyId = `pdf-story-p${pageNum}-${idx + 1}`; + stories.push({ id: storyId, source: `pdf:page:${pageNum}`, runs: blockToRuns(block, styleIdForSize) }); + frames.push({ + kind: 'text', + id: `pdf-frame-p${pageNum}-t${idx + 1}`, + bounds: rectToPx(block.bounds, dpi), + storyRef: storyId, + }); + }); + + page.images.forEach((img, idx) => { + const href = assetHref(page.index, idx); + frames.push({ + kind: 'image', + id: `pdf-frame-p${pageNum}-i${idx + 1}`, + bounds: boxToPx(img, dpi), + href, + embedded: true, + }); + if (img.failed) { + warnings.add('image-extract-failed', `Page ${pageNum}: could not decode image ${href}`, { + file: `pdf:page:${pageNum}`, + }); + } else if (options.assetCacheDir) { + // Defer the actual write (collected below) so failures don't abort. + pendingWrites.push({ href, image: img.image }); + } + }); + + if (frames.length === 0) { + warnings.add('empty-page', `Page ${pageNum} produced no text or image frames`, { + file: `pdf:page:${pageNum}`, + }); + } + + spreads.push({ + id: `pdf-spread-${pageNum}`, + source: `pdf:page:${pageNum}`, + pages: [{ id: `pdf-page-${pageNum}`, bounds: { x: 0, y: 0, width: roundPx(ptToPx(page.widthPt, dpi)), height: roundPx(ptToPx(page.heightPt, dpi)) } }], + frames, + appliedMasterRef: undefined, + }); + } + + // --- Persist extracted images, if a cache dir was given. + if (options.assetCacheDir) { + for (const w of pendingWrites) { + try { + await writeAsset(options.assetCacheDir, w.href, w.image); + } catch (err) { + warnings.add('asset-write-failed', `Failed writing ${w.href}: ${err.message}`); + } + } + } + + // --- Fidelity warnings: describe every approximation. + addFidelityWarnings(warnings, { + sawFonts, + anyEmbeddedFont, + anyColor: allColorSamples.length > 0, + anyVector, + hasStyles: styles.length > 0, + }); + + const document = Document.parse({ + irVersion: 1, + // Embedded /Title wins over the caller-supplied/filename fallback, mirroring + // how the IDML parser prefers designmap's @Name. + meta: { name: title ?? options.name }, + dpi, + swatches, + fonts: [...fontsById.values()], + styles, + stories, + spreads, + masterSpreads: [], + warnings: warnings.list(), + }); + + await doc.cleanup(); + return document; +} + +/** + * @param {import('./pdf/cluster.js').TextBlock} block + * @param {(sizePt: number) => string | undefined} styleIdForSize + * @returns {Array} + */ +function blockToRuns(block, styleIdForSize) { + const runs = []; + for (const line of block.lines) { + line.items.forEach((item, idx) => { + let text = item.text; + if (idx < line.items.length - 1) text += ' '; + runs.push({ text, paragraphStyleRef: styleIdForSize(item.fontSize) }); + }); + // Close each line with a newline so prose re-flows downstream. + if (runs.length > 0 && !runs[runs.length - 1].text.endsWith('\n')) { + runs[runs.length - 1].text += '\n'; + } + } + return runs; +} + +function rectToPx(b, dpi) { + return { + x: roundPx(ptToPx(b.minX, dpi)), + y: roundPx(ptToPx(b.minY, dpi)), + width: roundPx(ptToPx(Math.max(0, b.maxX - b.minX), dpi)), + height: roundPx(ptToPx(Math.max(0, b.maxY - b.minY), dpi)), + }; +} + +function boxToPx(box, dpi) { + return { + x: roundPx(ptToPx(box.x, dpi)), + y: roundPx(ptToPx(box.y, dpi)), + width: roundPx(ptToPx(Math.max(0, box.width), dpi)), + height: roundPx(ptToPx(Math.max(0, box.height), dpi)), + }; +} + +function fontId(psName) { + const slug = psName + .toLowerCase() + .replace(/[^a-z0-9]+/g, '-') + .replace(/^-+|-+$/g, ''); + return `pdf-font-${slug || 'unknown'}`; +} + +function addFidelityWarnings(warnings, flags) { + warnings.add( + 'pdf-fallback', + 'IR reconstructed from PDF; layout, frames, and styles are approximate. Prefer IDML when available.', + ); + warnings.add( + 'text-reconstructed-from-glyphs', + 'Text was recovered from positioned glyph runs; ligatures, hidden text, and reading order may differ from the source.', + ); + if (flags.hasStyles) { + warnings.add( + 'styles-synthesized', + 'Paragraph styles were inferred from font-size buckets, not real named styles.', + ); + } + if (flags.sawFonts && !flags.anyEmbeddedFont) { + warnings.add( + 'no-embedded-fonts', + 'No embedded fonts found; font family/style mapping is best-effort from PostScript names.', + ); + } + if (flags.anyColor) { + warnings.add( + 'color-attribution-approximate', + 'Swatch attribution is heuristic: colors are bucketed by font size, not resolved per run.', + ); + } + if (flags.anyVector) { + warnings.add( + 'vector-paths-dropped', + 'Vector paths and image masks were detected but are not represented in the IR.', + ); + } +} diff --git a/packages/pipeline/src/indesign/pdf/assets.js b/packages/pipeline/src/indesign/pdf/assets.js new file mode 100644 index 0000000..8765f58 --- /dev/null +++ b/packages/pipeline/src/indesign/pdf/assets.js @@ -0,0 +1,41 @@ +// Asset cache writer. Extracted images are PNG-encoded and written under a +// caller-provided directory; the IR's ImageFrame.href points at the +// cache-relative path so the downstream media importer can find them. +// +// Writing is opt-in: with no assetCacheDir the parser still records image +// frames and their hrefs (so the IR is complete and addressable), it just +// doesn't persist bytes — useful for verification runs and tests that only +// assert structure. + +import { promises as fs } from 'node:fs'; +import path from 'node:path'; + +import { encodePng } from './png.js'; + +/** + * Stable, collision-free href for an extracted image. + * @param {number} pageIndex 0-based + * @param {number} imageIndex 0-based + * @returns {string} + */ +export function assetHref(pageIndex, imageIndex) { + const p = String(pageIndex + 1).padStart(3, '0'); + const n = String(imageIndex + 1).padStart(3, '0'); + return `assets/pdf-p${p}-img${n}.png`; +} + +/** + * Encode + write one image to the cache. Returns the byte length written. + * + * @param {string} cacheDir + * @param {string} href cache-relative path (from assetHref) + * @param {{ width: number, height: number, kind: number, data: Uint8Array }} image + * @returns {Promise} + */ +export async function writeAsset(cacheDir, href, image) { + const png = encodePng(image); + const dest = path.join(cacheDir, href); + await fs.mkdir(path.dirname(dest), { recursive: true }); + await fs.writeFile(dest, png); + return png.length; +} diff --git a/packages/pipeline/src/indesign/pdf/classify.js b/packages/pipeline/src/indesign/pdf/classify.js new file mode 100644 index 0000000..63c773a --- /dev/null +++ b/packages/pipeline/src/indesign/pdf/classify.js @@ -0,0 +1,140 @@ +// Heuristic style synthesis. A PDF carries no named paragraph styles, so we +// infer them from the only signal we have: font size. The most-used size is +// "body"; larger sizes become headings (largest = Heading 1); smaller sizes +// become captions. Each synthesized bucket also remembers the font and fill +// color most associated with that size, so the token mapper downstream can turn +// it into a theme.json preset. +// +// This is deliberately coarse. The IDML parser reports real styles; this is the +// fallback's best approximation and is flagged as such in the IR warnings. + +import { ptToPx, roundPx } from '../units.js'; + +// Round sizes so 11.999pt and 12.001pt land in the same bucket. +const SIZE_QUANTUM = 0.5; +const MAX_HEADING_LEVEL = 6; + +function roundSize(pt) { + return Math.round(pt / SIZE_QUANTUM) * SIZE_QUANTUM; +} + +/** + * @param {Map} tally + * @returns {string | undefined} key with the highest count + */ +function argmax(tally) { + let best; + let bestN = -Infinity; + for (const [key, n] of tally) { + if (n > bestN) { + bestN = n; + best = key; + } + } + return best; +} + +/** + * @typedef {Object} StyleBucket + * @property {string} id + * @property {string} name + * @property {'heading'|'body'|'caption'} role + * @property {number} sizePt + * @property {number} fontSizePx + * @property {string} [dominantFontKey] + * @property {string} [dominantHex] + * + * @typedef {Object} ClassifyResult + * @property {StyleBucket[]} buckets + * @property {(sizePt: number) => string | undefined} styleIdForSize + */ + +/** + * @param {{ + * items: Array<{ fontSize: number, fontKey?: string, text: string }>, + * colorSamples?: Array<{ fontSizePt: number, hex: string, glyphs: number }>, + * dpi: number, + * }} input + * @returns {ClassifyResult} + */ +export function classifyStyles({ items, colorSamples = [], dpi }) { + // chars-per-size, plus per-size font and color tallies. + const charsBySize = new Map(); + const fontBySize = new Map(); // size -> Map(fontKey -> chars) + for (const item of items) { + const size = roundSize(item.fontSize); + const len = item.text.length; + charsBySize.set(size, (charsBySize.get(size) ?? 0) + len); + if (item.fontKey) { + const fonts = fontBySize.get(size) ?? new Map(); + fonts.set(item.fontKey, (fonts.get(item.fontKey) ?? 0) + len); + fontBySize.set(size, fonts); + } + } + + const colorBySize = new Map(); // size -> Map(hex -> glyphs) + for (const sample of colorSamples) { + const size = roundSize(sample.fontSizePt); + const colors = colorBySize.get(size) ?? new Map(); + colors.set(sample.hex, (colors.get(sample.hex) ?? 0) + sample.glyphs); + colorBySize.set(size, colors); + } + + const sizes = [...charsBySize.keys()]; + if (sizes.length === 0) { + return { buckets: [], styleIdForSize: () => undefined }; + } + + // Body = most-used size. Ties resolve to the smaller size (body text usually + // outnumbers display text, and the smaller of two equal counts is the safer + // "body" pick for sparse pages). + let bodySize = sizes[0]; + for (const size of sizes) { + const n = charsBySize.get(size); + const bestN = charsBySize.get(bodySize); + if (n > bestN || (n === bestN && size < bodySize)) { + bodySize = size; + } + } + + const larger = sizes.filter((s) => s > bodySize).sort((a, b) => b - a); + const smaller = sizes.filter((s) => s < bodySize).sort((a, b) => b - a); + + /** @type {Map} */ + const bySize = new Map(); + const makeBucket = (size, role, id, name) => { + const fonts = fontBySize.get(size); + const colors = colorBySize.get(size); + const bucket = { + id, + name, + role, + sizePt: size, + fontSizePx: roundPx(ptToPx(size, dpi)), + dominantFontKey: fonts ? argmax(fonts) : undefined, + dominantHex: colors ? argmax(colors) : undefined, + }; + bySize.set(size, bucket); + return bucket; + }; + + const buckets = []; + larger.forEach((size, i) => { + const level = Math.min(i + 1, MAX_HEADING_LEVEL); + buckets.push(makeBucket(size, 'heading', `pdf-style-h${level}`, `Heading ${level}`)); + }); + buckets.push(makeBucket(bodySize, 'body', 'pdf-style-body', 'Body')); + smaller.forEach((size, i) => { + const suffix = i === 0 ? '' : `-${i + 1}`; + const name = i === 0 ? 'Caption' : `Caption ${i + 1}`; + buckets.push(makeBucket(size, 'caption', `pdf-style-caption${suffix}`, name)); + }); + + // Headings first (largest → smallest), then body, then captions. + buckets.sort((a, b) => b.sizePt - a.sizePt); + + return { + buckets, + styleIdForSize: (sizePt) => bySize.get(roundSize(sizePt))?.id, + }; +} diff --git a/packages/pipeline/src/indesign/pdf/cluster.js b/packages/pipeline/src/indesign/pdf/cluster.js new file mode 100644 index 0000000..ba7b1ae --- /dev/null +++ b/packages/pipeline/src/indesign/pdf/cluster.js @@ -0,0 +1,194 @@ +// Positional clustering: turn a flat bag of positioned text runs into logical +// text frames, the way a reader would group them. +// +// PDF has no concept of a "text frame" — InDesign flattens everything to +// absolutely-positioned glyph runs. We reconstruct frames in two passes: +// 1. Runs sharing a baseline (within tolerance) become a line. +// 2. Lines that are vertically adjacent AND horizontally overlapping become a +// block (frame). Processing lines top-to-bottom while keeping several +// blocks "open" lets side-by-side columns fall out naturally — a line only +// joins a block in its own column. +// +// All geometry here is in points with a top-left origin (y grows downward), +// which is what the orchestrator converts to px when emitting the IR. + +// Fractions of font size used to approximate a glyph run's vertical box from +// its baseline. Real ascent/descent vary per font; these are good enough to +// cluster and to draw a frame rectangle a human would accept. +const ASCENT_RATIO = 0.8; +const DESCENT_RATIO = 0.2; + +// Two runs are on the same line if their baselines are within this fraction of +// the smaller font size. +const LINE_BASELINE_TOL = 0.5; + +// A line joins a block only if the vertical gap to the block's last line is no +// more than this multiple of the line height — enough for paragraph spacing, +// not enough to swallow a separate block further down the page. +const BLOCK_GAP_FACTOR = 1.8; + +// Runs on the same baseline but separated by more than this multiple of the +// font size are treated as different columns, not one wide line. Print columns +// commonly share baselines, so baseline proximity alone can't tell them apart — +// the horizontal gutter does. +const GUTTER_FACTOR = 2.5; + +/** + * @typedef {Object} TextItem + * @property {string} text + * @property {number} x Left edge (pt). + * @property {number} baseline Baseline y, top-left origin (pt). + * @property {number} width Run advance width (pt). + * @property {number} fontSize Font size (pt). + * @property {string} fontKey Stable key into the page's font table. + * + * @typedef {Object} Line + * @property {number} baseline + * @property {number} top + * @property {number} bottom + * @property {number} left + * @property {number} right + * @property {number} lineHeight + * @property {TextItem[]} items + * + * @typedef {Object} TextBlock + * @property {Line[]} lines + * @property {TextItem[]} items + * @property {{minX: number, minY: number, maxX: number, maxY: number}} bounds + */ + +function itemTop(item) { + return item.baseline - item.fontSize * ASCENT_RATIO; +} + +function itemBottom(item) { + return item.baseline + item.fontSize * DESCENT_RATIO; +} + +function rangesOverlap(aMin, aMax, bMin, bMax) { + return aMin < bMax && bMin < aMax; +} + +/** + * Group runs sharing a baseline into lines. + * + * @param {TextItem[]} items + * @returns {Line[]} + */ +export function groupLines(items) { + const sorted = [...items].sort((a, b) => a.baseline - b.baseline || a.x - b.x); + /** @type {Line[]} */ + const lines = []; + for (const item of sorted) { + const last = lines[lines.length - 1]; + const tol = Math.min(item.fontSize, last ? last.lineHeight : item.fontSize) * LINE_BASELINE_TOL; + const sameBaseline = last && Math.abs(item.baseline - last.baseline) <= tol; + // A wide horizontal gap on the same baseline is a column gutter, not a + // space — start a new line so the two columns cluster apart. + const gutter = Math.max(last ? last.lineHeight : item.fontSize, item.fontSize) * GUTTER_FACTOR; + const acrossGutter = last && item.x - last.right > gutter; + if (sameBaseline && !acrossGutter) { + last.items.push(item); + last.left = Math.min(last.left, item.x); + last.right = Math.max(last.right, item.x + item.width); + last.top = Math.min(last.top, itemTop(item)); + last.bottom = Math.max(last.bottom, itemBottom(item)); + last.lineHeight = Math.max(last.lineHeight, item.fontSize); + } else { + lines.push({ + baseline: item.baseline, + top: itemTop(item), + bottom: itemBottom(item), + left: item.x, + right: item.x + item.width, + lineHeight: item.fontSize, + items: [item], + }); + } + } + // Reading order within each line. + for (const line of lines) { + line.items.sort((a, b) => a.x - b.x); + } + return lines; +} + +/** + * Group lines into blocks (frames). Multiple blocks stay open at once so that + * two columns processed in interleaved vertical order don't merge. + * + * @param {Line[]} lines + * @returns {TextBlock[]} + */ +export function groupBlocks(lines) { + /** @type {Array<{lines: Line[], left: number, right: number, lastBottom: number}>} */ + const open = []; + for (const line of [...lines].sort((a, b) => a.top - b.top)) { + let target = null; + for (const block of open) { + const gap = line.top - block.lastBottom; + const tol = BLOCK_GAP_FACTOR * line.lineHeight; + if (gap <= tol && rangesOverlap(block.left, block.right, line.left, line.right)) { + target = block; + break; + } + } + if (!target) { + target = { lines: [], left: line.left, right: line.right, lastBottom: -Infinity }; + open.push(target); + } + target.lines.push(line); + target.left = Math.min(target.left, line.left); + target.right = Math.max(target.right, line.right); + target.lastBottom = Math.max(target.lastBottom, line.bottom); + } + + return open.map((block) => { + const items = block.lines.flatMap((l) => l.items); + const minY = Math.min(...block.lines.map((l) => l.top)); + const maxY = Math.max(...block.lines.map((l) => l.bottom)); + return { + lines: block.lines, + items, + bounds: { minX: block.left, minY, maxX: block.right, maxY }, + }; + }); +} + +/** + * Full pipeline: positioned runs → text frames, in top-to-bottom reading order. + * + * @param {TextItem[]} items + * @returns {TextBlock[]} + */ +export function clusterIntoFrames(items) { + const withText = items.filter((it) => it.text && it.text.trim().length > 0); + if (withText.length === 0) return []; + const blocks = groupBlocks(groupLines(withText)); + return blocks.sort((a, b) => a.bounds.minY - b.bounds.minY || a.bounds.minX - b.bounds.minX); +} + +/** + * Count distinct columns: clusters of frames whose horizontal x-ranges don't + * overlap. Used for the multi-column fidelity check and round-trip reporting. + * + * @param {TextBlock[]} blocks + * @returns {number} + */ +export function detectColumns(blocks) { + if (blocks.length === 0) return 0; + const intervals = blocks + .map((b) => ({ min: b.bounds.minX, max: b.bounds.maxX })) + .sort((a, b) => a.min - b.min); + let columns = 1; + let currentMax = intervals[0].max; + for (let i = 1; i < intervals.length; i += 1) { + if (intervals[i].min >= currentMax) { + columns += 1; + currentMax = intervals[i].max; + } else { + currentMax = Math.max(currentMax, intervals[i].max); + } + } + return columns; +} diff --git a/packages/pipeline/src/indesign/pdf/color.js b/packages/pipeline/src/indesign/pdf/color.js new file mode 100644 index 0000000..9be97bb --- /dev/null +++ b/packages/pipeline/src/indesign/pdf/color.js @@ -0,0 +1,93 @@ +// Color normalization for PDF fill operators, plus nearest-match against an +// IDML-derived swatch palette. +// +// PDF content streams set fill color via three operator families: +// rg -> DeviceRGB (pdfjs hands us 0..255 ints) +// g -> DeviceGray (pdfjs hands us a single 0..255 int) +// k -> DeviceCMYK (pdfjs hands us 0..1 floats) +// We collapse all of them to "#rrggbb". When a palette from a sibling IDML +// parse is available, we snap each detected color to the closest swatch so the +// PDF and IDML pipelines produce aligned token names downstream. + +/** + * @param {number} n + * @returns {string} two-digit lowercase hex + */ +function hexByte(n) { + return Math.max(0, Math.min(255, Math.round(n))).toString(16).padStart(2, '0'); +} + +/** + * @param {[number, number, number]} rgb 0..255 per channel + * @returns {string} + */ +export function rgbToHex([r, g, b]) { + return `#${hexByte(r)}${hexByte(g)}${hexByte(b)}`; +} + +/** + * @param {number} gray 0..255 + * @returns {string} + */ +export function grayToHex(gray) { + return rgbToHex([gray, gray, gray]); +} + +/** + * DeviceCMYK (0..1) → hex via the same naive conversion the IDML graphic + * parser uses, so identical CMYK swatches land on identical hex in both pipelines. + * + * @param {[number, number, number, number]} cmyk 0..1 per channel + * @returns {string} + */ +export function cmykToHex([c, m, y, k]) { + const r = 255 * (1 - c) * (1 - k); + const g = 255 * (1 - m) * (1 - k); + const b = 255 * (1 - y) * (1 - k); + return rgbToHex([r, g, b]); +} + +/** + * @param {string} hex "#rrggbb" + * @returns {[number, number, number]} + */ +export function hexToRgb(hex) { + const m = /^#?([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})$/i.exec(hex); + if (!m) return [0, 0, 0]; + return [parseInt(m[1], 16), parseInt(m[2], 16), parseInt(m[3], 16)]; +} + +/** + * Squared Euclidean distance in RGB. Squared is enough for "which is closest" + * and avoids a sqrt per comparison. + * + * @param {string} a "#rrggbb" + * @param {string} b "#rrggbb" + * @returns {number} + */ +export function colorDistance(a, b) { + const [r1, g1, b1] = hexToRgb(a); + const [r2, g2, b2] = hexToRgb(b); + return (r1 - r2) ** 2 + (g1 - g2) ** 2 + (b1 - b2) ** 2; +} + +/** + * Find the closest swatch in a palette, within a tolerance. + * + * @param {string} hex "#rrggbb" + * @param {Array} palette + * @param {number} [maxDistance] Squared-distance cutoff (default ~24/channel). + * @returns {import('../ir.js').SwatchIR | null} + */ +export function nearestSwatch(hex, palette, maxDistance = 24 * 24 * 3) { + let best = null; + let bestDist = Infinity; + for (const swatch of palette) { + const dist = colorDistance(hex, swatch.color.hex); + if (dist < bestDist) { + bestDist = dist; + best = swatch; + } + } + return best && bestDist <= maxDistance ? best : null; +} diff --git a/packages/pipeline/src/indesign/pdf/extract.js b/packages/pipeline/src/indesign/pdf/extract.js new file mode 100644 index 0000000..7134551 --- /dev/null +++ b/packages/pipeline/src/indesign/pdf/extract.js @@ -0,0 +1,228 @@ +// The one module that talks to pdfjs. Everything pdfjs-shaped is converted here +// into plain data the pure modules (cluster, classify, color, png) can consume, +// so the heuristics stay unit-testable without a PDF engine. +// +// Per page we pull four things: +// - positioned text runs (getTextContent) → geometry + font key per run +// - font metadata (commonObjs) → real PostScript name + embedded flag +// - a walk of the operator list → fill-color-per-size samples, placed images, +// and whether any vector paths were drawn (which we cannot represent) +// +// Coordinates are converted from PDF's bottom-left origin to a top-left origin +// (points). The orchestrator scales points → px. + +import { rgbToHex, grayToHex, cmykToHex } from './color.js'; + +const SUBSET_PREFIX = /^[A-Z]{6}\+/; + +/** + * @param {[number, number, number, number, number, number]} m + * @param {number} x + * @param {number} y + * @returns {[number, number]} + */ +function applyMatrix(m, x, y) { + return [m[0] * x + m[2] * y + m[4], m[1] * x + m[3] * y + m[5]]; +} + +/** Concatenate `cm` onto the current matrix (PDF row-vector convention). */ +function multiply(cm, ctm) { + return [ + cm[0] * ctm[0] + cm[1] * ctm[2], + cm[0] * ctm[1] + cm[1] * ctm[3], + cm[2] * ctm[0] + cm[3] * ctm[2], + cm[2] * ctm[1] + cm[3] * ctm[3], + cm[4] * ctm[0] + cm[5] * ctm[2] + ctm[4], + cm[4] * ctm[1] + cm[5] * ctm[3] + ctm[5], + ]; +} + +function parseFontName(psName, fontObj) { + const clean = psName.replace(SUBSET_PREFIX, ''); + const dash = clean.indexOf('-'); + let family = dash >= 0 ? clean.slice(0, dash) : clean; + let style = dash >= 0 ? clean.slice(dash + 1) : ''; + if (!style) { + if (fontObj?.bold && fontObj?.italic) style = 'Bold Italic'; + else if (fontObj?.bold) style = 'Bold'; + else if (fontObj?.italic) style = 'Italic'; + else style = 'Regular'; + } + // "Times-Roman" reads better as family Times, style Regular for web mapping. + if (style === 'Roman') style = 'Regular'; + return { family: family || clean, style }; +} + +function getImageObject(page, name) { + return new Promise((resolve) => { + try { + if (page.objs.has(name)) { + resolve(page.objs.get(name)); + } else { + page.objs.get(name, resolve); + } + } catch { + resolve(null); + } + }); +} + +/** + * @param {import('pdfjs-dist/legacy/build/pdf.mjs').PDFPageProxy} page + * @param {typeof import('pdfjs-dist/legacy/build/pdf.mjs')} pdfjs + * @returns {Promise<{ + * widthPt: number, + * heightPt: number, + * textItems: Array<{ text: string, x: number, baseline: number, width: number, fontSize: number, fontKey: string }>, + * fonts: Map, + * colorSamples: Array<{ fontSizePt: number, hex: string, glyphs: number }>, + * images: Array<{ x: number, y: number, width: number, height: number, image: object | null, failed: boolean }>, + * hasVector: boolean, + * }>} + */ +export async function extractPage(page, pdfjs) { + const { OPS } = pdfjs; + const [x0, y0, x1, y1] = page.view; + const widthPt = x1 - x0; + const heightPt = y1 - y0; + + // Operator list first: it populates objs/commonObjs and gives us colors, + // images, and vector presence. + const opList = await page.getOperatorList(); + + let ctm = [1, 0, 0, 1, 0, 0]; + const ctmStack = []; + let fillHex = '#000000'; + let currentSize = 0; + let hasVector = false; + const colorSamples = []; + const images = []; + + const countGlyphs = (glyphs) => + Array.isArray(glyphs) ? glyphs.filter((g) => g && typeof g === 'object' && 'unicode' in g).length : 0; + + for (let i = 0; i < opList.fnArray.length; i += 1) { + const fn = opList.fnArray[i]; + const args = opList.argsArray[i]; + switch (fn) { + case OPS.save: + ctmStack.push(ctm); + break; + case OPS.restore: + ctm = ctmStack.pop() ?? ctm; + break; + case OPS.transform: + ctm = multiply(args, ctm); + break; + case OPS.setFillRGBColor: + fillHex = rgbToHex([args[0], args[1], args[2]]); + break; + case OPS.setFillGray: + fillHex = grayToHex(args[0]); + break; + case OPS.setFillCMYKColor: + fillHex = cmykToHex([args[0], args[1], args[2], args[3]]); + break; + case OPS.setFont: + currentSize = Math.abs(args[1]); + break; + case OPS.showText: + case OPS.showSpacedText: { + const glyphs = countGlyphs(args[0]); + if (glyphs > 0 && currentSize > 0) { + colorSamples.push({ fontSizePt: currentSize, hex: fillHex, glyphs }); + } + break; + } + case OPS.fill: + case OPS.eoFill: + case OPS.stroke: + case OPS.fillStroke: + case OPS.eoFillStroke: + case OPS.closeFillStroke: + case OPS.closeEOFillStroke: + case OPS.closeStroke: + hasVector = true; + break; + case OPS.paintImageXObject: + case OPS.paintImageXObjectRepeat: { + const name = args[0]; + // Unit square mapped through the CTM gives the placed rectangle. + const c0 = applyMatrix(ctm, 0, 0); + const c1 = applyMatrix(ctm, 1, 1); + const left = Math.min(c0[0], c1[0]); + const right = Math.max(c0[0], c1[0]); + const bottom = Math.min(c0[1], c1[1]); + const top = Math.max(c0[1], c1[1]); + const obj = await getImageObject(page, name); + images.push({ + x: left, + y: heightPt - top, // flip to top-left origin + width: right - left, + height: top - bottom, + image: obj && obj.data ? obj : null, + failed: !(obj && obj.data), + }); + break; + } + case OPS.paintInlineImageXObject: { + const obj = args[0]; + const c0 = applyMatrix(ctm, 0, 0); + const c1 = applyMatrix(ctm, 1, 1); + const left = Math.min(c0[0], c1[0]); + const right = Math.max(c0[0], c1[0]); + const bottom = Math.min(c0[1], c1[1]); + const top = Math.max(c0[1], c1[1]); + images.push({ + x: left, + y: heightPt - top, + width: right - left, + height: top - bottom, + image: obj && obj.data ? obj : null, + failed: !(obj && obj.data), + }); + break; + } + case OPS.paintImageMaskXObject: + // Stencil masks paint the current fill through a 1-bit mask; there's + // no extractable raster, so we note it as vector-like content. + hasVector = true; + break; + default: + break; + } + } + + // Text geometry + font keys. + const textContent = await page.getTextContent(); + const fonts = new Map(); + const textItems = []; + for (const item of textContent.items) { + if (!('str' in item) || item.str.trim().length === 0) continue; + const t = item.transform; // [a,b,c,d,e,f] + const fontSize = Math.hypot(t[2], t[3]) || item.height || 0; + if (fontSize === 0) continue; + textItems.push({ + text: item.str, + x: t[4], + baseline: heightPt - t[5], + width: item.width, + fontSize, + fontKey: item.fontName, + }); + if (item.fontName && !fonts.has(item.fontName)) { + const obj = page.commonObjs.has(item.fontName) ? page.commonObjs.get(item.fontName) : null; + const psName = obj?.name ?? item.fontName; + const { family, style } = parseFontName(psName, obj); + fonts.set(item.fontName, { + name: psName, + family, + style, + embedded: !!obj && obj.missingFile === false, + type: obj?.type ?? 'unknown', + }); + } + } + + return { widthPt, heightPt, textItems, fonts, colorSamples, images, hasVector }; +} diff --git a/packages/pipeline/src/indesign/pdf/pdfjs.js b/packages/pipeline/src/indesign/pdf/pdfjs.js new file mode 100644 index 0000000..c5ab833 --- /dev/null +++ b/packages/pipeline/src/indesign/pdf/pdfjs.js @@ -0,0 +1,56 @@ +// Thin loader around pdfjs-dist's legacy build (the one that runs in Node +// without a DOM). We import it lazily so consumers that only ever parse IDML +// don't pull pdfjs — a multi-megabyte dependency — into their bundle/startup. + +let pdfjsPromise; + +// pdfjs-dist 4.x calls Promise.withResolvers, which only exists on Node 22+. +// This package supports Node >=20, so polyfill it (guarded) before pdfjs loads. +if (typeof Promise.withResolvers !== 'function') { + Promise.withResolvers = function withResolvers() { + let resolve; + let reject; + const promise = new Promise((res, rej) => { + resolve = res; + reject = rej; + }); + return { promise, resolve, reject }; + }; +} + +/** + * Resolve the pdfjs module once and cache it. + * @returns {Promise} + */ +export function loadPdfjs() { + if (!pdfjsPromise) { + pdfjsPromise = import('pdfjs-dist/legacy/build/pdf.mjs'); + } + return pdfjsPromise; +} + +/** + * Open a PDF document from raw bytes with settings tuned for headless, + * extraction-only use: + * - no worker / no eval (we never render to a canvas) + * - verbosity errors-only (base-14 fonts otherwise spam "standard font data" + * warnings we handle ourselves via fidelity warnings) + * - a private copy of the bytes, because pdfjs transfers/detaches the buffer + * + * @param {Uint8Array} bytes + * @returns {Promise} + */ +export async function openDocument(bytes) { + const pdfjs = await loadPdfjs(); + // pdfjs requires a *plain* Uint8Array and detaches the buffer it's given. + // Node's fs.readFile returns a Buffer (a Uint8Array subclass) which pdfjs + // rejects, so always copy into a fresh Uint8Array we can safely hand over. + const data = new Uint8Array(bytes.byteLength); + data.set(bytes); + return pdfjs.getDocument({ + data, + isEvalSupported: false, + useSystemFonts: false, + verbosity: 0, + }).promise; +} diff --git a/packages/pipeline/src/indesign/pdf/png.js b/packages/pipeline/src/indesign/pdf/png.js new file mode 100644 index 0000000..7ea0562 --- /dev/null +++ b/packages/pipeline/src/indesign/pdf/png.js @@ -0,0 +1,117 @@ +// Minimal PNG encoder for extracted image data. +// +// pdfjs hands back *decoded* pixels (RGB/RGBA/grayscale), not the original +// encoded stream, so we re-encode to a real image file the downstream media +// importer can use. PNG is lossless and dependency-free here: deflate comes +// from node:zlib, and CRC-32 is a tiny table we build once. We deliberately +// avoid pulling in an image library — a fallback parser shouldn't add weight. + +import { deflateSync } from 'node:zlib'; + +// pdfjs ImageKind values (stable across pdfjs 4.x). Re-declared so we don't +// depend on importing the enum from the lazily-loaded pdfjs module. +export const ImageKind = { + GRAYSCALE_1BPP: 1, + RGB_24BPP: 2, + RGBA_32BPP: 3, +}; + +const PNG_SIGNATURE = Buffer.from([0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a]); + +const CRC_TABLE = (() => { + const table = new Uint32Array(256); + for (let n = 0; n < 256; n += 1) { + let c = n; + for (let k = 0; k < 8; k += 1) { + c = c & 1 ? 0xedb88320 ^ (c >>> 1) : c >>> 1; + } + table[n] = c >>> 0; + } + return table; +})(); + +function crc32(buf) { + let c = 0xffffffff; + for (let i = 0; i < buf.length; i += 1) { + c = CRC_TABLE[(c ^ buf[i]) & 0xff] ^ (c >>> 8); + } + return (c ^ 0xffffffff) >>> 0; +} + +function chunk(type, data) { + const typeBuf = Buffer.from(type, 'latin1'); + const len = Buffer.alloc(4); + len.writeUInt32BE(data.length, 0); + const crc = Buffer.alloc(4); + crc.writeUInt32BE(crc32(Buffer.concat([typeBuf, data])), 0); + return Buffer.concat([len, typeBuf, data, crc]); +} + +/** + * Normalize pdfjs pixel data to packed RGB or RGBA scanlines. + * + * @param {{width: number, height: number, kind: number, data: Uint8Array|Uint8ClampedArray}} image + * @returns {{channels: 3|4, pixels: Uint8Array}} + */ +function toPixels(image) { + const { width, height, kind, data } = image; + if (kind === ImageKind.RGBA_32BPP) { + return { channels: 4, pixels: data instanceof Uint8Array ? data : Uint8Array.from(data) }; + } + if (kind === ImageKind.RGB_24BPP) { + return { channels: 3, pixels: data instanceof Uint8Array ? data : Uint8Array.from(data) }; + } + if (kind === ImageKind.GRAYSCALE_1BPP) { + // 1 bit per pixel, MSB-first, rows padded to whole bytes. Expand to RGB. + const rowBytes = (width + 7) >> 3; + const pixels = new Uint8Array(width * height * 3); + for (let y = 0; y < height; y += 1) { + for (let x = 0; x < width; x += 1) { + const bit = (data[y * rowBytes + (x >> 3)] >> (7 - (x & 7))) & 1; + const v = bit ? 255 : 0; + const o = (y * width + x) * 3; + pixels[o] = v; + pixels[o + 1] = v; + pixels[o + 2] = v; + } + } + return { channels: 3, pixels }; + } + throw new Error(`unsupported image kind ${kind}`); +} + +/** + * Encode decoded pixel data as a PNG buffer. + * + * @param {{width: number, height: number, kind: number, data: Uint8Array|Uint8ClampedArray}} image + * @returns {Buffer} + */ +export function encodePng(image) { + const { width, height } = image; + const { channels, pixels } = toPixels(image); + + // Prefix each scanline with a filter byte (0 = none). Keeps the encoder + // trivial; deflate still compresses solid regions well. + const stride = width * channels; + const raw = Buffer.alloc((stride + 1) * height); + for (let y = 0; y < height; y += 1) { + raw[y * (stride + 1)] = 0; + Buffer.from(pixels.buffer, pixels.byteOffset + y * stride, stride).copy(raw, y * (stride + 1) + 1); + } + + const ihdr = Buffer.alloc(13); + ihdr.writeUInt32BE(width, 0); + ihdr.writeUInt32BE(height, 4); + ihdr[8] = 8; // bit depth + ihdr[9] = channels === 4 ? 6 : 2; // color type: 6 = RGBA, 2 = RGB + ihdr[10] = 0; // compression + ihdr[11] = 0; // filter + ihdr[12] = 0; // interlace + + return Buffer.concat([ + PNG_SIGNATURE, + chunk('IHDR', ihdr), + chunk('IDAT', deflateSync(raw)), + chunk('IEND', Buffer.alloc(0)), + ]); +} diff --git a/packages/pipeline/tests/indesign/helpers/build-pdf.js b/packages/pipeline/tests/indesign/helpers/build-pdf.js new file mode 100644 index 0000000..593850a --- /dev/null +++ b/packages/pipeline/tests/indesign/helpers/build-pdf.js @@ -0,0 +1,298 @@ +// In-memory PDF fixture builder. Tests call buildPdf({...}) and get back a +// Uint8Array they can hand straight to parsePdfBuffer(). +// +// Mirrors the philosophy of build-idml.js: no binary blobs in git, fixtures are +// generated from a readable spec. We emit the small subset of PDF the fallback +// parser reads — positioned text runs in base-14 (non-embedded) fonts, placed +// FlateDecode/DeviceRGB image XObjects, and optional vector fills. +// +// Coordinates in the spec use a TOP-LEFT origin measured in points (y grows +// downward), because that matches how the parser reports geometry and how a +// designer thinks about a page. We flip to PDF's native bottom-left origin +// while writing. + +import { deflateSync } from 'node:zlib'; + +const DEFAULT_PAGE = { width: 612, height: 792 }; // US Letter, points. + +/** + * @typedef {Object} TextSpec + * @property {string} text + * @property {number} x Left edge, points from page left. + * @property {number} y Baseline, points from page top. + * @property {number} size Font size in points. + * @property {string} [font] Base-14 font name (default 'Helvetica'). + * @property {[number, number, number]} [color] Fill color, 0..1 RGB (default black). + * + * @typedef {Object} ImageSpec + * @property {number} x Left edge, points from page left. + * @property {number} y Top edge, points from page top. + * @property {number} width Display width in points. + * @property {number} height Display height in points. + * @property {{width: number, height: number, data: Uint8Array}} rgb Raw 8bpc RGB pixels (width*height*3 bytes). + * + * @typedef {Object} RectSpec + * @property {number} x + * @property {number} y + * @property {number} width + * @property {number} height + * @property {[number, number, number]} [color] + * + * @typedef {Object} PageSpec + * @property {number} [width] + * @property {number} [height] + * @property {TextSpec[]} [texts] + * @property {ImageSpec[]} [images] + * @property {RectSpec[]} [rects] + * + * @typedef {Object} BuildPdfOptions + * @property {string} [title] Document title (/Info /Title). + * @property {PageSpec[]} pages + */ + +/** + * @param {BuildPdfOptions} options + * @returns {Uint8Array} + */ +export function buildPdf(options) { + const pages = (options.pages ?? []).map((p) => ({ + width: p.width ?? DEFAULT_PAGE.width, + height: p.height ?? DEFAULT_PAGE.height, + texts: p.texts ?? [], + images: p.images ?? [], + rects: p.rects ?? [], + })); + + const writer = new ObjectWriter(); + + // Object 1 is the catalog, object 2 the pages tree. We reserve them up front + // so child page objects can reference the parent by a known id. + const catalogId = writer.reserve(); + const pagesId = writer.reserve(); + + const pageIds = []; + for (const page of pages) { + pageIds.push(buildPageObjects(writer, page, pagesId)); + } + + writer.define(catalogId, `<< /Type /Catalog /Pages ${pagesId} 0 R >>`); + writer.define( + pagesId, + `<< /Type /Pages /Kids [ ${pageIds.map((id) => `${id} 0 R`).join(' ')} ] /Count ${pageIds.length} >>`, + ); + + let infoId; + if (options.title) { + infoId = writer.add(`<< /Title (${escapePdfString(options.title)}) /Producer (flavian-test) >>`); + } + + return writer.serialize(catalogId, infoId); +} + +/** + * Emit the content stream + page dict + its resource objects. + * Returns the page object id. + */ +function buildPageObjects(writer, page, pagesId) { + const fontResources = new Map(); // base-font name -> resource key (F1, F2…) + const xobjectResources = new Map(); // image object id -> resource key (Im1…) + + const ops = []; + + // Vector fills first (drawn underneath). + for (const rect of page.rects) { + const [r, g, b] = rect.color ?? [0, 0, 0]; + const yPdf = page.height - rect.y - rect.height; + ops.push(`${fmt(r)} ${fmt(g)} ${fmt(b)} rg`); + ops.push(`${fmt(rect.x)} ${fmt(yPdf)} ${fmt(rect.width)} ${fmt(rect.height)} re f`); + } + + // Images. + for (const image of page.images) { + const imgId = writer.add(imageXObject(image.rgb)); + let key = xobjectResources.get(imgId); + if (!key) { + key = `Im${xobjectResources.size + 1}`; + xobjectResources.set(imgId, key); + } + const yPdf = page.height - image.y - image.height; + // cm maps the unit square to the placement rectangle. + ops.push('q'); + ops.push(`${fmt(image.width)} 0 0 ${fmt(image.height)} ${fmt(image.x)} ${fmt(yPdf)} cm`); + ops.push(`/${key} Do`); + ops.push('Q'); + } + + // Text runs. + for (const t of page.texts) { + const fontName = t.font ?? 'Helvetica'; + let fontKey = fontResources.get(fontName); + if (!fontKey) { + fontKey = `F${fontResources.size + 1}`; + fontResources.set(fontName, fontKey); + } + const [r, g, b] = t.color ?? [0, 0, 0]; + const yPdf = page.height - t.y; + ops.push('BT'); + ops.push(`${fmt(r)} ${fmt(g)} ${fmt(b)} rg`); + ops.push(`/${fontKey} ${fmt(t.size)} Tf`); + ops.push(`${fmt(t.x)} ${fmt(yPdf)} Td`); + ops.push(`(${escapePdfString(t.text)}) Tj`); + ops.push('ET'); + } + + const contentStream = ops.join('\n') + '\n'; + const contentId = writer.add(streamObject('<< /Length LEN >>', Buffer.from(contentStream, 'latin1'))); + + // Font objects. + const fontEntries = []; + for (const [baseFont, key] of fontResources) { + const fontId = writer.add( + `<< /Type /Font /Subtype /Type1 /BaseFont /${baseFont} /Encoding /WinAnsiEncoding >>`, + ); + fontEntries.push(`/${key} ${fontId} 0 R`); + } + + const xobjectEntries = []; + for (const [imgId, key] of xobjectResources) { + xobjectEntries.push(`/${key} ${imgId} 0 R`); + } + + const resourceParts = ['/ProcSet [ /PDF /Text /ImageC ]']; + if (fontEntries.length > 0) { + resourceParts.push(`/Font << ${fontEntries.join(' ')} >>`); + } + if (xobjectEntries.length > 0) { + resourceParts.push(`/XObject << ${xobjectEntries.join(' ')} >>`); + } + + return writer.add( + `<< /Type /Page /Parent ${pagesId} 0 R /MediaBox [ 0 0 ${fmt(page.width)} ${fmt(page.height)} ] ` + + `/Resources << ${resourceParts.join(' ')} >> /Contents ${contentId} 0 R >>`, + ); +} + +/** + * A FlateDecode/DeviceRGB 8-bit image XObject. pdfjs decodes this to raw RGB + * without needing a canvas, which is exactly what the parser's extractor reads. + */ +function imageXObject(rgb) { + const raw = Buffer.from(rgb.data.buffer ?? rgb.data, rgb.data.byteOffset ?? 0, rgb.data.byteLength ?? rgb.data.length); + const compressed = deflateSync(raw); + const dict = + `<< /Type /XObject /Subtype /Image /Width ${rgb.width} /Height ${rgb.height} ` + + `/ColorSpace /DeviceRGB /BitsPerComponent 8 /Filter /FlateDecode /Length LEN >>`; + return streamObject(dict, compressed); +} + +/** Marker so the writer knows this object carries a binary stream payload. */ +function streamObject(dict, payload) { + return { dict, payload }; +} + +class ObjectWriter { + constructor() { + /** @type {Array} */ + this.objects = []; + } + + /** Reserve an id, to be filled in later with define(). */ + reserve() { + this.objects.push(null); + return this.objects.length; + } + + define(id, body) { + this.objects[id - 1] = body; + } + + /** Append a fully-formed object and return its id. */ + add(body) { + this.objects.push(body); + return this.objects.length; + } + + /** + * Assemble the file with a correct classic xref table. + * @param {number} rootId + * @param {number} [infoId] + * @returns {Uint8Array} + */ + serialize(rootId, infoId) { + const chunks = []; + let offset = 0; + const offsets = new Array(this.objects.length + 1).fill(0); + + const push = (buf) => { + chunks.push(buf); + offset += buf.length; + }; + + // Binary marker comment keeps tools (and pdfjs heuristics) treating the + // file as binary. + push(Buffer.from('%PDF-1.7\n%\xE2\xE3\xCF\xD3\n', 'latin1')); + + for (let i = 0; i < this.objects.length; i += 1) { + const id = i + 1; + const body = this.objects[i]; + if (body === null) { + throw new Error(`PDF object ${id} was reserved but never defined`); + } + offsets[id] = offset; + push(Buffer.from(`${id} 0 obj\n`, 'latin1')); + if (typeof body === 'string') { + push(Buffer.from(body + '\n', 'latin1')); + } else { + const dict = body.dict.replace('LEN', String(body.payload.length)); + push(Buffer.from(dict + '\nstream\n', 'latin1')); + push(body.payload); + push(Buffer.from('\nendstream\n', 'latin1')); + } + push(Buffer.from('endobj\n', 'latin1')); + } + + const xrefOffset = offset; + const count = this.objects.length + 1; + let xref = `xref\n0 ${count}\n0000000000 65535 f \n`; + for (let id = 1; id < count; id += 1) { + xref += `${String(offsets[id]).padStart(10, '0')} 00000 n \n`; + } + push(Buffer.from(xref, 'latin1')); + + const trailerParts = [`/Size ${count}`, `/Root ${rootId} 0 R`]; + if (infoId) { + trailerParts.push(`/Info ${infoId} 0 R`); + } + push(Buffer.from(`trailer\n<< ${trailerParts.join(' ')} >>\nstartxref\n${xrefOffset}\n%%EOF\n`, 'latin1')); + + return new Uint8Array(Buffer.concat(chunks)); + } +} + +/** Format a number for PDF content streams: trim trailing zeros, no exponent. */ +function fmt(n) { + if (!Number.isFinite(n)) return '0'; + return (Math.round(n * 1000) / 1000).toString(); +} + +function escapePdfString(s) { + return String(s).replace(/\\/g, '\\\\').replace(/\(/g, '\\(').replace(/\)/g, '\\)'); +} + +/** + * Convenience: a solid-color RGB image buffer for image fixtures. + * + * @param {number} width + * @param {number} height + * @param {[number, number, number]} rgb 0..255 per channel. + * @returns {{width: number, height: number, data: Uint8Array}} + */ +export function solidRgbImage(width, height, [r, g, b]) { + const data = new Uint8Array(width * height * 3); + for (let i = 0; i < width * height; i += 1) { + data[i * 3] = r; + data[i * 3 + 1] = g; + data[i * 3 + 2] = b; + } + return { width, height, data }; +} diff --git a/packages/pipeline/tests/indesign/parse-pdf.test.mjs b/packages/pipeline/tests/indesign/parse-pdf.test.mjs new file mode 100644 index 0000000..cb68b68 --- /dev/null +++ b/packages/pipeline/tests/indesign/parse-pdf.test.mjs @@ -0,0 +1,215 @@ +// Integration tests: drive the whole PDF parser against programmatically-built +// fixtures (text-heavy, image-heavy, multi-column, single-page brochure) and +// assert the reconstructed IR validates and matches expectations. + +import { test } from 'node:test'; +import assert from 'node:assert/strict'; +import { promises as fs } from 'node:fs'; +import os from 'node:os'; +import path from 'node:path'; +import { parsePdf, parsePdfBuffer } from '../../src/indesign/parse-pdf.js'; +import { Document } from '../../src/indesign/ir.js'; +import { buildPdf, solidRgbImage } from './helpers/build-pdf.js'; + +const warningCodes = (ir) => ir.warnings.map((w) => w.code); +const textFrames = (ir) => ir.spreads.flatMap((s) => s.frames.filter((f) => f.kind === 'text')); +const imageFrames = (ir) => ir.spreads.flatMap((s) => s.frames.filter((f) => f.kind === 'image')); + +function bodyLines(count, { x = 72, startY = 120, size = 11, leading = 15, font = 'Helvetica' } = {}) { + return Array.from({ length: count }, (_, i) => ({ + text: `Body copy line number ${i + 1} with enough words to be realistic.`, + x, + y: startY + i * leading, + size, + font, + })); +} + +test('text-heavy PDF: validates, one body style, single clustered frame', async () => { + const ir = await parsePdfBuffer(buildPdf({ pages: [{ texts: bodyLines(8) }] })); + const validated = Document.parse(ir); + assert.equal(validated.irVersion, 1); + assert.equal(validated.spreads.length, 1); + + const bodyStyles = ir.styles.filter((s) => s.properties.role === 'body'); + assert.equal(bodyStyles.length, 1); + // 8 evenly-spaced lines in one column collapse to a single text frame. + assert.equal(textFrames(ir).length, 1); + const story = ir.stories.find((s) => s.id === textFrames(ir)[0].storyRef); + assert.ok(story.runs.every((r) => r.paragraphStyleRef === 'pdf-style-body')); +}); + +test('image-heavy PDF: every image becomes an addressable image frame', async () => { + const ir = await parsePdfBuffer( + buildPdf({ + pages: [ + { + texts: [{ text: 'Gallery', x: 72, y: 90, size: 18, font: 'Helvetica-Bold' }], + images: [ + { x: 72, y: 120, width: 150, height: 120, rgb: solidRgbImage(6, 5, [200, 30, 30]) }, + { x: 240, y: 120, width: 150, height: 120, rgb: solidRgbImage(6, 5, [30, 200, 60]) }, + { x: 72, y: 280, width: 150, height: 120, rgb: solidRgbImage(6, 5, [30, 60, 200]) }, + ], + }, + ], + }), + ); + Document.parse(ir); + const imgs = imageFrames(ir); + assert.equal(imgs.length, 3); + assert.ok(imgs.every((f) => f.embedded === true)); + assert.deepEqual( + imgs.map((f) => f.href), + ['assets/pdf-p001-img001.png', 'assets/pdf-p001-img002.png', 'assets/pdf-p001-img003.png'], + ); + assert.ok(imgs.every((f) => f.bounds.width > 0 && f.bounds.height > 0)); +}); + +test('multi-column PDF: columns become separate frames + a warning', async () => { + const left = bodyLines(4, { x: 72, startY: 120 }).map((t) => ({ ...t, text: 'Left ' + t.text.slice(0, 20) })); + const right = bodyLines(4, { x: 340, startY: 120 }).map((t) => ({ ...t, text: 'Right ' + t.text.slice(0, 20) })); + const ir = await parsePdfBuffer(buildPdf({ pages: [{ texts: [...left, ...right] }] })); + Document.parse(ir); + + const frames = textFrames(ir).sort((a, b) => a.bounds.x - b.bounds.x); + assert.ok(frames.length >= 2, `expected >=2 text frames, got ${frames.length}`); + // The two columns don't horizontally overlap. + assert.ok(frames[0].bounds.x + frames[0].bounds.width <= frames[1].bounds.x); + assert.ok(warningCodes(ir).includes('multi-column-layout')); +}); + +test('single-page brochure: heading + body + caption + image all reconstructed', async () => { + const ir = await parsePdfBuffer( + buildPdf({ + title: 'Brochure', + pages: [ + { + texts: [ + { text: 'Welcome', x: 72, y: 90, size: 36, font: 'Helvetica-Bold', color: [0, 0.4, 0.8] }, + ...bodyLines(3, { startY: 150, size: 12 }), + { text: 'Figure 1: the hero image.', x: 72, y: 470, size: 8, color: [0.4, 0.4, 0.4] }, + ], + images: [{ x: 72, y: 220, width: 240, height: 160, rgb: solidRgbImage(8, 6, [120, 120, 120]) }], + }, + ], + }), + ); + Document.parse(ir); + assert.equal(ir.meta.name, 'Brochure'); + + const roles = new Set(ir.styles.map((s) => s.properties.role)); + assert.ok(roles.has('heading') && roles.has('body') && roles.has('caption')); + assert.equal(imageFrames(ir).length, 1); + assert.ok(textFrames(ir).length >= 3); // heading, body, caption separated + + // The heading style resolves both a font and a swatch. + const h1 = ir.styles.find((s) => s.id === 'pdf-style-h1'); + assert.ok(h1.fontRef && ir.fonts.some((f) => f.id === h1.fontRef)); + assert.ok(h1.fillColorRef && ir.swatches.some((s) => s.id === h1.fillColorRef)); +}); + +test('fidelity warnings are always present and describe approximations', async () => { + const ir = await parsePdfBuffer( + buildPdf({ + pages: [ + { + texts: [{ text: 'Hello world', x: 72, y: 90, size: 12, color: [0.1, 0.1, 0.1] }], + rects: [{ x: 0, y: 0, width: 612, height: 60, color: [0.9, 0.9, 0.9] }], + }, + ], + }), + ); + const codes = warningCodes(ir); + assert.ok(codes.includes('pdf-fallback')); + assert.ok(codes.includes('text-reconstructed-from-glyphs')); + assert.ok(codes.includes('styles-synthesized')); + assert.ok(codes.includes('no-embedded-fonts')); // base-14, not embedded + assert.ok(codes.includes('color-attribution-approximate')); + assert.ok(codes.includes('vector-paths-dropped')); // the rect fill +}); + +test('assetCacheDir: extracted images are written as readable PNGs', async () => { + const dir = await fs.mkdtemp(path.join(os.tmpdir(), 'flavian-pdf-')); + try { + const ir = await parsePdfBuffer( + buildPdf({ + pages: [ + { + texts: [{ text: 'Pic', x: 72, y: 90, size: 12 }], + images: [{ x: 72, y: 120, width: 100, height: 80, rgb: solidRgbImage(4, 3, [10, 20, 30]) }], + }, + ], + }), + { assetCacheDir: dir }, + ); + const href = imageFrames(ir)[0].href; + const buf = await fs.readFile(path.join(dir, href)); + assert.ok(buf.subarray(0, 4).equals(Buffer.from([0x89, 0x50, 0x4e, 0x47]))); + assert.equal(buf.readUInt32BE(16), 4); // IHDR width + assert.equal(buf.readUInt32BE(20), 3); // IHDR height + } finally { + await fs.rm(dir, { recursive: true, force: true }); + } +}); + +test('swatchPalette: detected colors snap to IDML swatch ids', async () => { + const palette = [{ id: 'col-brand', name: 'Brand Blue', color: { hex: '#0066cc', space: 'RGB' } }]; + const ir = await parsePdfBuffer( + // 0,0.4,0.8 → #0066cc exactly; use a slightly-off shade to prove snapping. + buildPdf({ pages: [{ texts: [{ text: 'Brand', x: 72, y: 90, size: 24, color: [0.01, 0.4, 0.79] }] }] }), + { swatchPalette: palette }, + ); + assert.ok(ir.swatches.some((s) => s.id === 'col-brand')); + const h1 = ir.styles.find((s) => s.properties.role === 'heading') ?? ir.styles[0]; + assert.equal(h1.fillColorRef, 'col-brand'); +}); + +test('dpi scales geometry linearly', async () => { + const make = (dpi) => + parsePdfBuffer(buildPdf({ pages: [{ texts: [{ text: 'Scale me', x: 72, y: 100, size: 12 }] }] }), { dpi }); + const lo = await make(72); + const hi = await make(144); + const loFrame = lo.spreads[0].frames.find((f) => f.kind === 'text').bounds; + const hiFrame = hi.spreads[0].frames.find((f) => f.kind === 'text').bounds; + assert.ok(Math.abs(hiFrame.width - loFrame.width * 2) < 0.01); + assert.ok(Math.abs(hiFrame.x - loFrame.x * 2) < 0.01); +}); + +test('multi-page PDF yields one spread per page', async () => { + const page = { texts: [{ text: 'Page text', x: 72, y: 90, size: 12 }] }; + const ir = await parsePdfBuffer(buildPdf({ pages: [page, page, page] })); + assert.equal(ir.spreads.length, 3); + assert.deepEqual(ir.spreads.map((s) => s.source), ['pdf:page:1', 'pdf:page:2', 'pdf:page:3']); +}); + +test('throws on bytes that are not a PDF', async () => { + const garbage = new TextEncoder().encode('this is definitely not a pdf'); + await assert.rejects(() => parsePdfBuffer(garbage), /could not be opened/i); +}); + +test('parsePdf reads from disk (Buffer input) and prefers the embedded /Title', async () => { + const dir = await fs.mkdtemp(path.join(os.tmpdir(), 'flavian-pdf-disk-')); + try { + const file = path.join(dir, 'report-2026.pdf'); + await fs.writeFile(file, buildPdf({ title: 'Quarterly Report', pages: [{ texts: bodyLines(3) }] })); + const ir = await parsePdf(file); + Document.parse(ir); + // Embedded /Title beats the filename fallback. + assert.equal(ir.meta.name, 'Quarterly Report'); + assert.equal(ir.spreads.length, 1); + } finally { + await fs.rm(dir, { recursive: true, force: true }); + } +}); + +test('parsePdf falls back to the filename when there is no /Title', async () => { + const dir = await fs.mkdtemp(path.join(os.tmpdir(), 'flavian-pdf-disk-')); + try { + const file = path.join(dir, 'untitled-doc.pdf'); + await fs.writeFile(file, buildPdf({ pages: [{ texts: bodyLines(2) }] })); + const ir = await parsePdf(file); + assert.equal(ir.meta.name, 'untitled-doc'); + } finally { + await fs.rm(dir, { recursive: true, force: true }); + } +}); diff --git a/packages/pipeline/tests/indesign/pdf-classify.test.mjs b/packages/pipeline/tests/indesign/pdf-classify.test.mjs new file mode 100644 index 0000000..b2f8b17 --- /dev/null +++ b/packages/pipeline/tests/indesign/pdf-classify.test.mjs @@ -0,0 +1,78 @@ +// Style synthesis from font-size buckets (pure, no pdfjs). + +import { test } from 'node:test'; +import assert from 'node:assert/strict'; +import { classifyStyles } from '../../src/indesign/pdf/classify.js'; + +function makeInput() { + return { + dpi: 96, + items: [ + { fontSize: 36, fontKey: 'pdf-font-helvetica-bold', text: 'Title' }, + // Body dominates by character count. + { fontSize: 12, fontKey: 'pdf-font-helvetica', text: 'The quick brown fox jumps over the lazy dog.' }, + { fontSize: 12, fontKey: 'pdf-font-helvetica', text: 'Another full line of ordinary body copy here.' }, + { fontSize: 8, fontKey: 'pdf-font-helvetica', text: 'fig' }, + ], + colorSamples: [ + { fontSizePt: 36, hex: '#0066cc', glyphs: 5 }, + { fontSizePt: 12, hex: '#111111', glyphs: 80 }, + { fontSizePt: 8, hex: '#888888', glyphs: 3 }, + ], + }; +} + +test('largest size becomes Heading 1, most-used becomes Body, smallest becomes Caption', () => { + const { buckets } = classifyStyles(makeInput()); + const byRole = Object.fromEntries(buckets.map((b) => [b.role, b])); + assert.equal(byRole.heading.id, 'pdf-style-h1'); + assert.equal(byRole.heading.sizePt, 36); + assert.equal(byRole.heading.fontSizePx, 48); // 36pt @ 96dpi + assert.equal(byRole.body.id, 'pdf-style-body'); + assert.equal(byRole.body.sizePt, 12); + assert.equal(byRole.caption.id, 'pdf-style-caption'); + assert.equal(byRole.caption.sizePt, 8); +}); + +test('buckets carry the dominant font and color for each size', () => { + const { buckets } = classifyStyles(makeInput()); + const body = buckets.find((b) => b.role === 'body'); + assert.equal(body.dominantFontKey, 'pdf-font-helvetica'); + assert.equal(body.dominantHex, '#111111'); + const heading = buckets.find((b) => b.role === 'heading'); + assert.equal(heading.dominantFontKey, 'pdf-font-helvetica-bold'); + assert.equal(heading.dominantHex, '#0066cc'); +}); + +test('styleIdForSize maps a size back to its bucket id', () => { + const { styleIdForSize } = classifyStyles(makeInput()); + assert.equal(styleIdForSize(36), 'pdf-style-h1'); + assert.equal(styleIdForSize(12), 'pdf-style-body'); + assert.equal(styleIdForSize(8), 'pdf-style-caption'); + assert.equal(styleIdForSize(99), undefined); +}); + +test('multiple heading sizes get descending levels', () => { + const { buckets } = classifyStyles({ + dpi: 96, + items: [ + { fontSize: 48, fontKey: 'f', text: 'Big' }, + { fontSize: 24, fontKey: 'f', text: 'Med' }, + { fontSize: 10, fontKey: 'f', text: 'lots of body copy lots of body copy' }, + ], + }); + const headings = buckets.filter((b) => b.role === 'heading').sort((a, b) => b.sizePt - a.sizePt); + assert.equal(headings[0].id, 'pdf-style-h1'); + assert.equal(headings[0].sizePt, 48); + assert.equal(headings[1].id, 'pdf-style-h2'); + assert.equal(headings[1].sizePt, 24); +}); + +test('a single font size yields only a Body bucket', () => { + const { buckets } = classifyStyles({ + dpi: 96, + items: [{ fontSize: 11, fontKey: 'f', text: 'uniform text everywhere' }], + }); + assert.equal(buckets.length, 1); + assert.equal(buckets[0].role, 'body'); +}); diff --git a/packages/pipeline/tests/indesign/pdf-cluster.test.mjs b/packages/pipeline/tests/indesign/pdf-cluster.test.mjs new file mode 100644 index 0000000..fce90ca --- /dev/null +++ b/packages/pipeline/tests/indesign/pdf-cluster.test.mjs @@ -0,0 +1,60 @@ +// Positional clustering heuristics (pure, no pdfjs). + +import { test } from 'node:test'; +import assert from 'node:assert/strict'; +import { groupLines, clusterIntoFrames, detectColumns } from '../../src/indesign/pdf/cluster.js'; + +function item(text, x, baseline, { width = 50, fontSize = 12, fontKey = 'f1' } = {}) { + return { text, x, baseline, width, fontSize, fontKey }; +} + +test('groupLines merges runs sharing a baseline', () => { + const lines = groupLines([ + item('Hello', 72, 100), + item('World', 130, 100.2), // within baseline tolerance + item('Next', 72, 130), // new line + ]); + assert.equal(lines.length, 2); + assert.equal(lines[0].items.length, 2); + assert.equal(lines[0].items[0].text, 'Hello'); // sorted by x + assert.equal(lines[1].items.length, 1); +}); + +test('clusterIntoFrames keeps adjacent body lines in one frame', () => { + const frames = clusterIntoFrames([ + item('Line one of the paragraph', 72, 100), + item('Line two of the paragraph', 72, 116), + item('Line three of the paragraph', 72, 132), + ]); + assert.equal(frames.length, 1); + assert.equal(frames[0].lines.length, 3); +}); + +test('clusterIntoFrames splits a far-apart block into a second frame', () => { + const frames = clusterIntoFrames([ + item('Top block', 72, 100), + item('Bottom block far below', 72, 500), + ]); + assert.equal(frames.length, 2); +}); + +test('clusterIntoFrames separates side-by-side columns', () => { + const left = [item('L1', 72, 100), item('L2', 72, 116), item('L3', 72, 132)]; + const right = [item('R1', 340, 100), item('R2', 340, 116), item('R3', 340, 132)]; + const frames = clusterIntoFrames([...left, ...right]); + assert.equal(frames.length, 2); + assert.equal(detectColumns(frames), 2); + // Frames don't horizontally overlap. + const [a, b] = frames.sort((x, y) => x.bounds.minX - y.bounds.minX); + assert.ok(a.bounds.maxX <= b.bounds.minX); +}); + +test('detectColumns is 1 for a single column', () => { + const frames = clusterIntoFrames([item('A', 72, 100), item('B', 72, 120)]); + assert.equal(detectColumns(frames), 1); +}); + +test('clusterIntoFrames ignores whitespace-only runs', () => { + const frames = clusterIntoFrames([item(' ', 72, 100), item('', 80, 100)]); + assert.equal(frames.length, 0); +}); diff --git a/packages/pipeline/tests/indesign/pdf-color.test.mjs b/packages/pipeline/tests/indesign/pdf-color.test.mjs new file mode 100644 index 0000000..a17b861 --- /dev/null +++ b/packages/pipeline/tests/indesign/pdf-color.test.mjs @@ -0,0 +1,47 @@ +// Color normalization + nearest-swatch matching (pure, no pdfjs). + +import { test } from 'node:test'; +import assert from 'node:assert/strict'; +import { rgbToHex, grayToHex, cmykToHex, hexToRgb, colorDistance, nearestSwatch } from '../../src/indesign/pdf/color.js'; + +test('rgbToHex clamps and lowercases', () => { + assert.equal(rgbToHex([0, 102, 204]), '#0066cc'); + assert.equal(rgbToHex([300, -5, 16]), '#ff0010'); +}); + +test('grayToHex mirrors the channel', () => { + assert.equal(grayToHex(0), '#000000'); + assert.equal(grayToHex(255), '#ffffff'); + assert.equal(grayToHex(128), '#808080'); +}); + +test('cmykToHex matches the IDML naive conversion (0/0/0/1 = black)', () => { + assert.equal(cmykToHex([0, 0, 0, 1]), '#000000'); + assert.equal(cmykToHex([0, 0, 0, 0]), '#ffffff'); +}); + +test('hexToRgb round-trips', () => { + assert.deepEqual(hexToRgb('#0066cc'), [0, 102, 204]); + assert.deepEqual(hexToRgb('ffffff'), [255, 255, 255]); +}); + +test('colorDistance is zero for identical colors', () => { + assert.equal(colorDistance('#123456', '#123456'), 0); + assert.ok(colorDistance('#000000', '#ffffff') > 0); +}); + +test('nearestSwatch snaps to the closest palette entry within tolerance', () => { + const palette = [ + { id: 'col-brand', name: 'Brand Blue', color: { hex: '#0066cc', space: 'RGB' } }, + { id: 'col-ink', name: 'Ink', color: { hex: '#000000', space: 'CMYK' } }, + ]; + // Slightly-off brand blue should snap to the brand swatch. + assert.equal(nearestSwatch('#0265cb', palette)?.id, 'col-brand'); + // Near-black snaps to ink. + assert.equal(nearestSwatch('#050505', palette)?.id, 'col-ink'); +}); + +test('nearestSwatch returns null when nothing is close enough', () => { + const palette = [{ id: 'col-ink', name: 'Ink', color: { hex: '#000000', space: 'CMYK' } }]; + assert.equal(nearestSwatch('#00ff00', palette), null); +}); diff --git a/packages/pipeline/tests/indesign/pdf-png.test.mjs b/packages/pipeline/tests/indesign/pdf-png.test.mjs new file mode 100644 index 0000000..4ca1fb1 --- /dev/null +++ b/packages/pipeline/tests/indesign/pdf-png.test.mjs @@ -0,0 +1,58 @@ +// PNG encoder (pure). We decode the output back with node:zlib to prove the +// bytes are a real, readable PNG without pulling in an image library. + +import { test } from 'node:test'; +import assert from 'node:assert/strict'; +import { inflateSync } from 'node:zlib'; +import { encodePng, ImageKind } from '../../src/indesign/pdf/png.js'; + +const SIG = Buffer.from([0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a]); + +function readChunks(buf) { + const chunks = {}; + let off = 8; + while (off < buf.length) { + const len = buf.readUInt32BE(off); + const type = buf.toString('latin1', off + 4, off + 8); + chunks[type] = buf.subarray(off + 8, off + 8 + len); + off += 12 + len; + } + return chunks; +} + +test('encodes a 2x1 RGB image with valid signature and IHDR', () => { + const data = new Uint8Array([255, 0, 0, 0, 255, 0]); // red, green + const png = encodePng({ width: 2, height: 1, kind: ImageKind.RGB_24BPP, data }); + assert.ok(png.subarray(0, 8).equals(SIG)); + const { IHDR, IDAT, IEND } = readChunks(png); + assert.ok(IHDR && IDAT && IEND); + assert.equal(IHDR.readUInt32BE(0), 2); // width + assert.equal(IHDR.readUInt32BE(4), 1); // height + assert.equal(IHDR[8], 8); // bit depth + assert.equal(IHDR[9], 2); // color type RGB +}); + +test('IDAT decompresses to filtered scanlines that preserve pixels', () => { + const data = new Uint8Array([10, 20, 30, 40, 50, 60]); + const png = encodePng({ width: 2, height: 1, kind: ImageKind.RGB_24BPP, data }); + const { IDAT } = readChunks(png); + const raw = inflateSync(IDAT); + // One scanline: filter byte (0) + 6 pixel bytes. + assert.equal(raw.length, 7); + assert.equal(raw[0], 0); + assert.deepEqual([...raw.subarray(1)], [10, 20, 30, 40, 50, 60]); +}); + +test('RGBA input produces a color-type-6 PNG', () => { + const data = new Uint8Array([1, 2, 3, 255]); + const png = encodePng({ width: 1, height: 1, kind: ImageKind.RGBA_32BPP, data }); + assert.equal(readChunks(png).IHDR[9], 6); // RGBA +}); + +test('grayscale 1bpp expands to RGB', () => { + // One row, 2px, MSB-first: bits 1,0 -> white, black. Padded to a byte. + const data = new Uint8Array([0b10000000]); + const png = encodePng({ width: 2, height: 1, kind: ImageKind.GRAYSCALE_1BPP, data }); + const raw = inflateSync(readChunks(png).IDAT); + assert.deepEqual([...raw], [0, 255, 255, 255, 0, 0, 0]); // filter + white + black +}); diff --git a/packages/pipeline/tests/indesign/pdf-roundtrip.test.mjs b/packages/pipeline/tests/indesign/pdf-roundtrip.test.mjs new file mode 100644 index 0000000..5169b80 --- /dev/null +++ b/packages/pipeline/tests/indesign/pdf-roundtrip.test.mjs @@ -0,0 +1,127 @@ +// Round-trip agreement: build the *same logical document* two ways — as IDML +// and as an InDesign-style PDF export — parse both, and assert the IRs agree +// within documented tolerances. This is the cross-check the issue calls for: +// PDF is a lossy fallback, so we assert structural agreement, not equality. +// +// Documented tolerances (see docs/pipeline/indesign-pdf-fidelity.md): +// - page / spread count ........ exact +// - image frame count .......... exact +// - text frame count ........... within ±1 +// - style bucket count ......... within ±1 +// - swatches ................... PDF colors snap to the IDML swatch palette + +import { test } from 'node:test'; +import assert from 'node:assert/strict'; +import { parseIdmlBuffer } from '../../src/indesign/parse-idml.js'; +import { parsePdfBuffer } from '../../src/indesign/parse-pdf.js'; +import { buildIdml } from './helpers/build-idml.js'; +import { buildPdf } from './helpers/build-pdf.js'; + +const BRAND = [0, 102, 204]; // #0066cc +const INK = [0, 0, 0]; // #000000 + +function buildIdmlVersion() { + return buildIdml({ + name: 'Round Trip', + colors: [ + { id: 'col-brand', name: 'Brand Blue', space: 'RGB', values: BRAND }, + { id: 'col-ink', name: 'Ink', space: 'CMYK', values: [0, 0, 0, 100] }, + ], + fonts: [ + { id: 'font-helv-bold', family: 'Helvetica', style: 'Bold', postScriptName: 'Helvetica-Bold' }, + { id: 'font-helv-reg', family: 'Helvetica', style: 'Regular', postScriptName: 'Helvetica' }, + ], + styles: [ + { id: 'pstyle-h1', name: 'Heading 1', kind: 'paragraph', pointSize: 36, appliedFont: 'font-helv-bold', fillColor: 'col-brand' }, + { id: 'pstyle-body', name: 'Body', kind: 'paragraph', pointSize: 12, appliedFont: 'font-helv-reg', fillColor: 'col-ink' }, + ], + stories: [ + { id: 'story-headline', runs: [{ text: 'Welcome', paragraphStyle: 'pstyle-h1' }] }, + { + id: 'story-body', + runs: [{ text: 'Print to web in one pass with a usable styled result.', paragraphStyle: 'pstyle-body' }], + }, + ], + spreads: [ + { + id: 'spread-1', + pages: [{ id: 'page-1', bounds: [0, 0, 792, 612] }], + frames: [ + { kind: 'text', id: 'frame-headline', bounds: [72, 72, 130, 400], parentStory: 'story-headline' }, + { kind: 'text', id: 'frame-body', bounds: [140, 72, 220, 400], parentStory: 'story-body' }, + { kind: 'image', id: 'frame-hero', bounds: [250, 72, 430, 400], href: 'file:Resources/hero.jpg' }, + ], + }, + ], + }); +} + +function buildPdfVersion() { + const toUnit = ([r, g, b]) => [r / 255, g / 255, b / 255]; + return buildPdf({ + title: 'Round Trip', + pages: [ + { + width: 612, + height: 792, + texts: [ + { text: 'Welcome', x: 72, y: 96, size: 36, font: 'Helvetica-Bold', color: toUnit(BRAND) }, + { text: 'Print to web in one pass with a usable', x: 72, y: 150, size: 12, font: 'Helvetica', color: toUnit(INK) }, + { text: 'styled result that needs only light touch-ups.', x: 72, y: 166, size: 12, font: 'Helvetica', color: toUnit(INK) }, + ], + images: [{ x: 72, y: 220, width: 240, height: 160, rgb: { width: 6, height: 4, data: new Uint8Array(6 * 4 * 3).fill(128) } }], + }, + ], + }); +} + +test('round-trip: page, frame, and style-bucket counts agree within tolerance', async () => { + const idml = parseIdmlBuffer(buildIdmlVersion()); + const pdf = await parsePdfBuffer(buildPdfVersion(), { swatchPalette: idml.swatches }); + + // Page / spread count: exact. + assert.equal(pdf.spreads.length, idml.spreads.length); + + const countFrames = (ir, kind) => + ir.spreads.flatMap((s) => s.frames).filter((f) => f.kind === kind).length; + + // Image frames: exact. + assert.equal(countFrames(pdf, 'image'), countFrames(idml, 'image')); + + // Text frames: within ±1. + const idmlText = countFrames(idml, 'text'); + const pdfText = countFrames(pdf, 'text'); + assert.ok(Math.abs(idmlText - pdfText) <= 1, `text frames: idml=${idmlText} pdf=${pdfText}`); + + // Style buckets: within ±1 (IDML has h1 + body; PDF synthesizes the same two). + assert.ok(Math.abs(idml.styles.length - pdf.styles.length) <= 1, `styles: idml=${idml.styles.length} pdf=${pdf.styles.length}`); +}); + +test('round-trip: detected PDF colors snap onto the IDML swatch palette', async () => { + const idml = parseIdmlBuffer(buildIdmlVersion()); + const pdf = await parsePdfBuffer(buildPdfVersion(), { swatchPalette: idml.swatches }); + + const pdfSwatchIds = new Set(pdf.swatches.map((s) => s.id)); + assert.ok(pdfSwatchIds.has('col-brand'), 'brand blue should snap to col-brand'); + assert.ok(pdfSwatchIds.has('col-ink'), 'near-black should snap to col-ink'); + + // And the synthesized heading style references the shared swatch id. + const h1 = pdf.styles.find((s) => s.properties.role === 'heading'); + assert.equal(h1.fillColorRef, 'col-brand'); +}); + +test('round-trip: the same prose is recoverable from both IRs', async () => { + const idml = parseIdmlBuffer(buildIdmlVersion()); + const pdf = await parsePdfBuffer(buildPdfVersion()); + + const prose = (ir) => + ir.stories + .flatMap((s) => s.runs.map((r) => r.text)) + .join(' ') + .replace(/\s+/g, ' ') + .toLowerCase(); + + assert.ok(prose(idml).includes('welcome')); + assert.ok(prose(pdf).includes('welcome')); + assert.ok(prose(pdf).includes('print to web in one pass')); +}); diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index 5c6ec36..eafe875 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -38,6 +38,9 @@ importers: fflate: specifier: ^0.8.2 version: 0.8.3 + pdfjs-dist: + specifier: ^4.10.38 + version: 4.10.38 zod: specifier: ^3.23.8 version: 3.23.8 @@ -151,6 +154,76 @@ packages: '@lhci/utils@0.14.0': resolution: {integrity: sha512-LyP1RbvYQ9xNl7uLnl5AO8fDRata9MG/KYfVFKFkYenlsVS6QJsNjLzWNEoMIaE4jOPdQQlSp4tO7dtnyDxzbQ==} + '@napi-rs/canvas-android-arm64@0.1.100': + resolution: {integrity: sha512-hjhCKhntPv9+t4ckHymdx0phYNcVW+GKQR6Lzw2zE+pOVjOplSmtx9nNNknTjbEDLcuLZqA1y8ufKg1XfgftzQ==} + engines: {node: '>= 10'} + cpu: [arm64] + os: [android] + + '@napi-rs/canvas-darwin-arm64@0.1.100': + resolution: {integrity: sha512-2PcswRaC7Ly645DGt88///zuFDhJxJYdKAs1uU3mfk1atYkXufgcgLfBpk6Tm12nCQBaNt1wpybuPZ4qOhTo8A==} + engines: {node: '>= 10'} + cpu: [arm64] + os: [darwin] + + '@napi-rs/canvas-darwin-x64@0.1.100': + resolution: {integrity: sha512-ePNZtj7pNIva/siZMg+HmbeozkIjqUIYdoymH8HaA3qK7LfzFN4WMBM8G6HQ9ZC+H3+Dnn5pqtiXpgLykaPOhw==} + engines: {node: '>= 10'} + cpu: [x64] + os: [darwin] + + '@napi-rs/canvas-linux-arm-gnueabihf@0.1.100': + resolution: {integrity: sha512-d5cDB48oWFGU8/XPhUOFAlySgb/VAu7D+s8fi55K1Pcfg8aPplHWqMgibhVLU8ky7Pyg/fuiVLz4Nf3JrSTuUA==} + engines: {node: '>= 10'} + cpu: [arm] + os: [linux] + + '@napi-rs/canvas-linux-arm64-gnu@0.1.100': + resolution: {integrity: sha512-rDxgxRu69RvDlX/bh9o22DxLsGr8EqsNgotL9+RwQE1S0b0cqeatqsw6aW45mukm0B42DIAaAacKaYQ8cqS1nw==} + engines: {node: '>= 10'} + cpu: [arm64] + os: [linux] + + '@napi-rs/canvas-linux-arm64-musl@0.1.100': + resolution: {integrity: sha512-K3mDW66N+xT2/V439u1alFANiBUjdEx2gLiNYnCmUsva5jZMxWTjafBYwTzYK+EMFMHrUoabuU+T1BIP5CgbYQ==} + engines: {node: '>= 10'} + cpu: [arm64] + os: [linux] + + '@napi-rs/canvas-linux-riscv64-gnu@0.1.100': + resolution: {integrity: sha512-mooqUBTIsccZpnoQC4NgrC1v6C1vof39etLNMnBwCY+p0gajWJvAHLGQ6g/gGyS5YrpDW+GefSN4+Cvcr08UWw==} + engines: {node: '>= 10'} + cpu: [riscv64] + os: [linux] + + '@napi-rs/canvas-linux-x64-gnu@0.1.100': + resolution: {integrity: sha512-1eCvkDCazm7FFhsT7DfGOdSaHgZVK3bt/dSBl5EWHOWmnz+I7j8tPseJqqD81NF+MH21jKUK4wQSDjN0mdhnTg==} + engines: {node: '>= 10'} + cpu: [x64] + os: [linux] + + '@napi-rs/canvas-linux-x64-musl@0.1.100': + resolution: {integrity: sha512-20arT6lnI19S68qNlii73TSEDbECNgzMz2EpldC1V3mZFuRkeujXkcebRk0LRJe9SEUAooYiLokfMViY8IX7yA==} + engines: {node: '>= 10'} + cpu: [x64] + os: [linux] + + '@napi-rs/canvas-win32-arm64-msvc@0.1.100': + resolution: {integrity: sha512-DZFFT1wIAg37LJw37yhMRFfjATd3vTQzjZ1Yki8u2vhO6Hi5VE6BVaGQ1aaDu7xb4iMErz+9EOwjpS7xcxFeBw==} + engines: {node: '>= 10'} + cpu: [arm64] + os: [win32] + + '@napi-rs/canvas-win32-x64-msvc@0.1.100': + resolution: {integrity: sha512-MyT1j3mHC2+Lu4pBi9mKyMJhtP6U7k7EldY7sj/uS5gJA65gTXt8MefJQXLJo5d/vZbuWmfxzkEUNc/urV3pHA==} + engines: {node: '>= 10'} + cpu: [x64] + os: [win32] + + '@napi-rs/canvas@0.1.100': + resolution: {integrity: sha512-xglYA6q3XO5P3BNJYxVZ1IV7DLVjp1Py6nwag88YntrS+3vKHyYcMqXVS4ZztJmwz2uGvz1FWhI/4LgbR5uQDA==} + engines: {node: '>= 10'} + '@nodable/entities@2.1.0': resolution: {integrity: sha512-nyT7T3nbMyBI/lvr6L5TyWbFJAI9FTgVRakNoBqCD+PmID8DzFrrNdLLtHMwMszOtqZa8PAOV24ZqDnQrhQINA==} @@ -1124,6 +1197,10 @@ packages: path-to-regexp@0.1.13: resolution: {integrity: sha512-A/AGNMFN3c8bOlvV9RreMdrv7jsmF9XIfDeCd87+I8RNg6s78BhJxMu69NEMHBSJFxKidViTEdruRwEk/WIKqA==} + pdfjs-dist@4.10.38: + resolution: {integrity: sha512-/Y3fcFrXEAsMjJXeL9J8+ZG9U01LbuWaYypvDW2ycW1jL269L3js3DVBjDJ0Up9Np1uqDXsDrRihHANhZOlwdQ==} + engines: {node: '>=20'} + pend@1.2.0: resolution: {integrity: sha512-F3asv42UuXchdzt+xXqfW1OGlVBe+mxa2mqI0pg5yAHZPvFmY3Y6drSf/GQ1A86WgWEN9Kzh/WrgKa6iGcHXLg==} @@ -1739,6 +1816,54 @@ snapshots: - supports-color - utf-8-validate + '@napi-rs/canvas-android-arm64@0.1.100': + optional: true + + '@napi-rs/canvas-darwin-arm64@0.1.100': + optional: true + + '@napi-rs/canvas-darwin-x64@0.1.100': + optional: true + + '@napi-rs/canvas-linux-arm-gnueabihf@0.1.100': + optional: true + + '@napi-rs/canvas-linux-arm64-gnu@0.1.100': + optional: true + + '@napi-rs/canvas-linux-arm64-musl@0.1.100': + optional: true + + '@napi-rs/canvas-linux-riscv64-gnu@0.1.100': + optional: true + + '@napi-rs/canvas-linux-x64-gnu@0.1.100': + optional: true + + '@napi-rs/canvas-linux-x64-musl@0.1.100': + optional: true + + '@napi-rs/canvas-win32-arm64-msvc@0.1.100': + optional: true + + '@napi-rs/canvas-win32-x64-msvc@0.1.100': + optional: true + + '@napi-rs/canvas@0.1.100': + optionalDependencies: + '@napi-rs/canvas-android-arm64': 0.1.100 + '@napi-rs/canvas-darwin-arm64': 0.1.100 + '@napi-rs/canvas-darwin-x64': 0.1.100 + '@napi-rs/canvas-linux-arm-gnueabihf': 0.1.100 + '@napi-rs/canvas-linux-arm64-gnu': 0.1.100 + '@napi-rs/canvas-linux-arm64-musl': 0.1.100 + '@napi-rs/canvas-linux-riscv64-gnu': 0.1.100 + '@napi-rs/canvas-linux-x64-gnu': 0.1.100 + '@napi-rs/canvas-linux-x64-musl': 0.1.100 + '@napi-rs/canvas-win32-arm64-msvc': 0.1.100 + '@napi-rs/canvas-win32-x64-msvc': 0.1.100 + optional: true + '@nodable/entities@2.1.0': {} '@paulirish/trace_engine@0.0.23': {} @@ -2752,6 +2877,10 @@ snapshots: path-to-regexp@0.1.13: {} + pdfjs-dist@4.10.38: + optionalDependencies: + '@napi-rs/canvas': 0.1.100 + pend@1.2.0: {} picocolors@1.1.1: {}