Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions docs/pipeline/indesign-pdf-fidelity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# InDesign PDF fallback: fidelity guide

The InDesign-to-WordPress pipeline prefers **IDML** (`.idml`) as its input. When
IDML isn't available — a client only has the exported PDF, or you want to
cross-check IDML output — the pipeline can parse a **PDF exported from InDesign**
instead.

PDF is intentionally a *fallback*. A PDF has no named styles, no text frames,
and no swatch palette; it's a bag of absolutely-positioned glyph runs, fill
colors, and placed images. The parser reconstructs an approximate version of the
same [intermediate representation](../../packages/pipeline/src/indesign/ir.js)
the IDML parser produces, so the downstream mapper and generator can consume
either source. **We never aim for pixel-perfect reconstruction — we aim for a
usable, styled IR that the generator can turn into WordPress patterns with
manual touch-ups.**

## Usage

```bash
# Print the reconstructed IR as JSON; fidelity warnings go to stderr.
node packages/pipeline/bin/parse-pdf.mjs brochure.pdf > ir.json

# Also extract embedded images to an asset cache directory (as PNG).
node packages/pipeline/bin/parse-pdf.mjs brochure.pdf --asset-dir ./assets > ir.json
```

```js
import { parsePdf } from '@flavian/pipeline';

const ir = await parsePdf('./brochure.pdf', {
dpi: 96, // unit normalization (default 96)
assetCacheDir: './assets', // optional; write extracted images here
swatchPalette: idml.swatches, // optional; snap detected colors to IDML swatches
});
```

## How reconstruction works

| IR element | How it's derived from the PDF |
| --- | --- |
| `Spread` (one per page) | One spread per PDF page; page size from the MediaBox. |
| `TextFrame` | Glyph runs are grouped into lines (shared baseline), then lines into frames (vertically adjacent + horizontally overlapping). A wide horizontal gap on a shared baseline is treated as a **column gutter**, so side-by-side columns become separate frames. |
| `Story` / `TextRun` | One story per text frame; each run carries the paragraph-style reference of its font-size bucket. |
| `Style` | Synthesized from the font-size distribution: the most-used size is **Body**, larger sizes become **Heading 1..6** (largest first), smaller sizes become **Caption**. Each bucket records its dominant font and fill color. |
| `Font` | Resolved from each run's PostScript name (subset prefixes like `ABCDEF+` stripped); family/style split on the `-` and refined with pdfjs bold/italic flags. |
| `Swatch` | Distinct fill colors found in the content stream, normalized to hex. With a `swatchPalette`, each color snaps to the nearest IDML swatch (so PDF and IDML produce aligned token names). |
| `ImageFrame` | Image XObjects, placed via the current transform matrix. Pixels are PNG-encoded into the asset cache; `href` points at the cache-relative path. |
| `MasterSpread` | Always empty — PDF has no master pages. |

## Fidelity warnings

Every PDF parse attaches warnings describing the approximations made. They appear
on the CLI's stderr and in `ir.warnings`. Treat them as a checklist of things to
verify by eye.

| Code | Meaning | When |
| --- | --- | --- |
| `pdf-fallback` | The whole IR is approximate; prefer IDML if you have it. | Always |
| `text-reconstructed-from-glyphs` | Text came from positioned glyph runs; ligatures, hidden text, and reading order may differ. | Always |
| `styles-synthesized` | Paragraph styles are font-size buckets, not real named styles. | When any text exists |
| `no-embedded-fonts` | No fonts are embedded; family/style mapping is best-effort from PostScript names. | No embedded fonts found |
| `color-attribution-approximate` | Colors are bucketed by font size, not resolved per run. | When any colored text exists |
| `vector-paths-dropped` | Vector paths / image masks were detected but aren't represented in the IR. | When the page draws vector fills/strokes |
| `multi-column-layout` | A page was split into N columns / separate frames. | When >1 column is detected |
| `image-extract-failed` | An image couldn't be decoded (e.g. an unsupported filter). | Per failed image |
| `empty-page` | A page produced no text or image frames. | Per empty page |
| `asset-write-failed` | An extracted image couldn't be written to the asset cache. | Per failed write |

## Round-trip tolerances

The test suite builds the *same logical document* as both IDML and PDF and
asserts the two IRs agree within these tolerances (see
`packages/pipeline/tests/indesign/pdf-roundtrip.test.mjs`):

| Quantity | Tolerance |
| --- | --- |
| Page / spread count | Exact |
| Image frame count | Exact |
| Text frame count | Within ±1 |
| Style bucket count | Within ±1 |
| Swatch identity | Detected PDF colors snap onto the IDML swatch palette |

These are deliberately loose on text-frame and style counts: where InDesign knows
a frame is one frame, the PDF only shows glyph positions, so a heading and its
body paragraph may merge or split by ±1 depending on spacing.

## Known limitations

- **Geometry is approximate.** Frame rectangles are derived from glyph baselines
using nominal ascent/descent ratios (0.8 / 0.2 of font size), not true font
metrics. Rotated or skewed text is flattened to its axis-aligned bounding box.
- **Per-run color is not resolved.** Color is attributed at the style-bucket
(font-size) level, because the IR carries color on `Style`, not `TextRun`.
- **Vector art is dropped.** Backgrounds, rules, and shapes drawn as vector paths
are noted via `vector-paths-dropped` but not reconstructed.
- **Image masks aren't extracted.** Stencil-masked images paint the current fill
through a 1-bit mask and have no extractable raster; they're treated as vector
content.
- **Leading/tracking are omitted.** The fallback doesn't infer line spacing or
tracking; the mapper applies its own defaults.

When fidelity matters, export IDML from InDesign and use the
[IDML parser](../../packages/pipeline/README.md) instead.
63 changes: 50 additions & 13 deletions packages/pipeline/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,39 @@ Conversion pipeline for InDesign (and future) sources into WordPress FSE themes.

## Status

This package currently ships the **IDML parser and intermediate representation** (sub-issue #62 of the InDesign-to-WordPress epic). Downstream stages — PDF fallback (#63), style + token mapper (#64), output generator (#65) — will land as separate PRs. The IR shape produced here is the contract those stages consume.
This package ships the **IDML parser** (sub-issue #62) and the **PDF fallback parser** (sub-issue #63) of the InDesign-to-WordPress epic. Both emit the same intermediate representation. Downstream stages — style + token mapper (#64), output generator (#65) — will land as separate PRs. The IR shape produced here is the contract those stages consume.

IDML is the primary path (full access to stories, frames, styles, swatches, masters). PDF is a lossy fallback for when only the exported PDF is available, or as a verification source against IDML output — see [`docs/pipeline/indesign-pdf-fidelity.md`](../../docs/pipeline/indesign-pdf-fidelity.md).

## Layout

```
packages/pipeline/
├── bin/parse-idml.mjs CLI entry; prints validated IR JSON on stdout
├── bin/
│ ├── parse-idml.mjs CLI: IDML → validated IR JSON on stdout
│ └── parse-pdf.mjs CLI: PDF → reconstructed IR JSON on stdout
└── src/
├── index.js Re-exports the InDesign surface
└── indesign/
├── ir.js zod schemas + JSDoc typedefs for the IR
├── parse-idml.js Main entry: unzips + orchestrates + cross-refs + validates
├── parse-idml.js IDML entry: unzips + orchestrates + cross-refs + validates
├── parse-pdf.js PDF entry: extracts + clusters + classifies + validates
├── units.js pt/pc/mm/cm/in → px at configurable DPI
├── warnings.js Non-fatal warning collector
└── parsers/
├── xml.js fast-xml-parser wrapper
├── designmap.js designmap.xml → manifest with paths
├── resources.js Graphic.xml + Fonts.xml + Styles.xml
├── stories.js Stories/Story_*.xml → text runs
└── spreads.js Spreads/*.xml + MasterSpreads/*.xml
├── parsers/ IDML XML decoders
│ ├── xml.js fast-xml-parser wrapper
│ ├── designmap.js designmap.xml → manifest with paths
│ ├── resources.js Graphic.xml + Fonts.xml + Styles.xml
│ ├── stories.js Stories/Story_*.xml → text runs
│ └── spreads.js Spreads/*.xml + MasterSpreads/*.xml
└── pdf/ PDF reconstruction modules
├── pdfjs.js Lazy pdfjs-dist loader (headless, extraction-only)
├── extract.js Per-page: text runs, fonts, colors, images, vector flag
├── cluster.js Glyph runs → lines → frames; column detection (pure)
├── classify.js Font-size buckets → heading/body/caption styles (pure)
├── color.js RGB/gray/CMYK → hex; nearest-swatch matching (pure)
├── png.js Decoded pixels → PNG via node:zlib (pure)
└── assets.js Write extracted images to the asset cache
```

## Quick start
Expand All @@ -47,6 +60,26 @@ Or from the command line:
node packages/pipeline/bin/parse-idml.mjs my-document.idml > ir.json
```

### PDF fallback

When you only have a PDF exported from InDesign, use the fallback parser. It
emits the same IR, plus fidelity warnings describing every approximation it made.

```js
import { parsePdf } from '@flavian/pipeline';

const ir = await parsePdf('./brochure.pdf', {
assetCacheDir: './assets', // optional: write extracted images (PNG) here
swatchPalette: idml?.swatches, // optional: snap detected colors to IDML swatches
});
```

```bash
node packages/pipeline/bin/parse-pdf.mjs brochure.pdf --asset-dir ./assets > ir.json
```

PDF reconstruction is lossy by design. See [`docs/pipeline/indesign-pdf-fidelity.md`](../../docs/pipeline/indesign-pdf-fidelity.md) for how each IR element is derived, the full list of fidelity-warning codes, and the round-trip tolerances against IDML.

## IR shape

The intermediate representation is described in [`src/indesign/ir.js`](src/indesign/ir.js). At the top level:
Expand All @@ -70,18 +103,22 @@ Geometry (`Page.bounds`, `Frame.bounds`) is normalized to pixels at `dpi` (defau

## Failure mode

- **Throws** on structural problems that make the IR meaningless: missing `designmap.xml`, malformed zip, a `<Spread>` element that lacks `Self`.
- **Warns and continues** on everything else: missing optional resource files, dangling style references, unknown color spaces, empty stories, unrecognized unit suffixes.
Both parsers share the same philosophy: throw only when the document can't be read at all; otherwise emit a partial IR with warnings.

- **IDML throws** on missing `designmap.xml`, a malformed zip, or a `<Spread>` lacking `Self`; **warns** on missing optional resources, dangling references, unknown color spaces, empty stories, unrecognized units.
- **PDF throws** only when the file can't be opened as a PDF; **warns** on every approximation (text reconstructed from glyphs, synthesized styles, dropped vector paths, undecodable images, …). PDF parses always carry fidelity warnings — that's expected.

The CLI surfaces warnings on stderr and exits 0 unless the IR itself failed to build.
Each CLI surfaces warnings on stderr and exits 0 unless the IR itself failed to build.

## Testing

```bash
pnpm --filter @flavian/pipeline test
```

Tests build minimal IDML zips programmatically (see `tests/indesign/helpers/build-idml.js`) — no binary fixtures in git. The fixture builder mirrors the IDML XML grammar the parser reads, so adding a new test case is usually one option flag.
Tests build minimal fixtures programmatically — no binary fixtures in git. `tests/indesign/helpers/build-idml.js` emits IDML zips; `tests/indesign/helpers/build-pdf.js` emits PDFs (positioned text in base-14 fonts, FlateDecode image XObjects, vector fills). Building the *same logical document* both ways powers the IDML↔PDF round-trip test.

The PDF heuristics (clustering, classification, color, PNG encoding) are split into pure modules under `src/indesign/pdf/` and unit-tested without a PDF engine; only `extract.js` and the orchestrator touch pdfjs.

## Adding a new input format

Expand Down
85 changes: 85 additions & 0 deletions packages/pipeline/bin/parse-pdf.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#!/usr/bin/env node
// CLI: print the reconstructed IR as JSON on stdout, fidelity warnings on stderr.
//
// flavian-parse-pdf <path.pdf> [--dpi <n>] [--asset-dir <dir>] [--quiet]
//
// PDF is the fallback path. Expect fidelity warnings on every run — that's the
// parser telling you which parts are approximate.

import { parsePdf } from '../src/indesign/parse-pdf.js';

const args = process.argv.slice(2);
let inputPath;
let dpi;
let assetCacheDir;
let quiet = false;

for (let i = 0; i < args.length; i += 1) {
const arg = args[i];
if (arg === '--dpi') {
const next = args[i + 1];
if (!next || Number.isNaN(Number(next))) {
console.error('--dpi requires a positive number');
process.exit(2);
}
dpi = Number(next);
i += 1;
} else if (arg === '--asset-dir') {
const next = args[i + 1];
if (!next) {
console.error('--asset-dir requires a directory path');
process.exit(2);
}
assetCacheDir = next;
i += 1;
} else if (arg === '--quiet') {
quiet = true;
} else if (arg === '-h' || arg === '--help') {
printUsage();
process.exit(0);
} else if (!inputPath && !arg.startsWith('-')) {
inputPath = arg;
} else {
console.error(`Unknown argument: ${arg}`);
printUsage();
process.exit(2);
}
}

if (!inputPath) {
printUsage();
process.exit(2);
}

try {
const options = {};
if (dpi !== undefined) options.dpi = dpi;
if (assetCacheDir !== undefined) options.assetCacheDir = assetCacheDir;
const ir = await parsePdf(inputPath, options);
if (!quiet && ir.warnings.length > 0) {
for (const w of ir.warnings) {
const where = w.context?.file ? ` (${w.context.file}${w.context.id ? `#${w.context.id}` : ''})` : '';
process.stderr.write(`[${w.code}] ${w.message}${where}\n`);
}
process.stderr.write(`\n${ir.warnings.length} warning(s).\n`);
}
process.stdout.write(JSON.stringify(ir, null, 2) + '\n');
} catch (err) {
process.stderr.write(`error: ${err.message}\n`);
process.exit(1);
}

function printUsage() {
process.stderr.write(
[
'Usage: flavian-parse-pdf <path.pdf> [options]',
'',
'Options:',
' --dpi <n> Pixels per inch for unit normalization (default 96)',
' --asset-dir <dir> Write extracted images (PNG) under this directory',
' --quiet Suppress fidelity warnings on stderr',
' -h, --help Show this help',
'',
].join('\n'),
);
}
4 changes: 3 additions & 1 deletion packages/pipeline/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,16 @@
"./indesign": "./src/indesign/index.js"
},
"bin": {
"flavian-parse-idml": "./bin/parse-idml.mjs"
"flavian-parse-idml": "./bin/parse-idml.mjs",
"flavian-parse-pdf": "./bin/parse-pdf.mjs"
},
"scripts": {
"test": "node --test \"tests/**/*.test.mjs\""
},
"dependencies": {
"fast-xml-parser": "^5.7.0",
"fflate": "^0.8.2",
"pdfjs-dist": "^4.10.38",
"zod": "^3.23.8"
},
"engines": {
Expand Down
1 change: 1 addition & 0 deletions packages/pipeline/src/indesign/index.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
export { parseIdml, parseIdmlBuffer } from './parse-idml.js';
export { parsePdf, parsePdfBuffer } from './parse-pdf.js';
export * as ir from './ir.js';
export { WarningCollector } from './warnings.js';
export { lengthToPx, ptToPx, roundPx } from './units.js';
Loading
Loading