You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement a first-class PDF ingestion path for the InDesign-to-React pipeline. PDFs exported from InDesign are the primary handoff artifact in our real-world designer-to-developer workflow — designers routinely export to PDF and pass that file to engineering rather than sharing source .indd or .idml files. This pipeline must accept a PDF as a top-level input and produce the same intermediate representation (IR) that the IDML parser (#63) emits, so downstream stages (mapper #65, generator #66, agent #67) are PDF-source-agnostic.
This replaces the previous framing of "PDF as fallback parser." PDF is now a primary input, on equal footing with IDML. IDML remains preferred when available because it carries richer style metadata, but the pipeline must not assume IDML availability.
Motivation
The graphic designer driving the InDesign feature additions confirmed that their normal handoff to developers is a PDF, not an IDML/INDD file.
Many design-to-dev workflows in agency settings never expose source files at all — engineers only ever see the PDF.
Treating PDF as a "fallback" understates how often it is, in practice, the only available input.
Details
Accept .pdf as a top-level CLI input alongside .idml (e.g., aurelius indesign ./brochure.pdf and aurelius indesign ./brochure.idml are both valid entry points).
Use pdfjs-dist (or equivalent) to extract text runs with positional data, font metrics, image XObjects, and vector paths.
Group adjacent text runs into logical text frames using positional clustering and font/style continuity.
Detect columns, headings, body text, and captions heuristically (font size buckets, leading, horizontal alignment).
Extract embedded images and write them to the pipeline's asset cache.
Derive a swatch palette directly from the PDF (clustered colors from text fills, vector fills, and dominant image colors). When an IDML-derived palette is also available, prefer it; otherwise the PDF-derived palette becomes the canonical one for the run.
Emit the same IR shape as the IDML parser (Document, Spread → 1 spread per page, Frame, TextFrame, Story, ImageFrame).
Surface fidelity warnings (e.g., "no embedded fonts", "text reconstructed from glyph runs", "vector-only page — no extractable text") so downstream stages and humans know what was inferred vs. read.
Provide a --source-priority pdf|idml|auto flag so callers can force PDF ingestion even when both files are present (useful for verifying parity between the two paths).
The CLI accepts a .pdf as a primary input and runs the full pipeline end-to-end (PDF → IR → tokens → TSX components).
Parity test: the same source InDesign document exported to both IDML and PDF produces IRs whose page count, frame count, and detected style buckets agree within documented tolerances.
Embedded images are extracted and addressable from the IR.
Swatch palette is correctly derived from the PDF when no IDML is supplied.
Fidelity warnings appear in CLI output and are documented in docs/pipeline/indesign-pdf-fidelity.md.
README and the InDesign pipeline docs describe PDF as a first-class input (not a fallback), and include a "designer hands you a PDF" quickstart.
We still do not aim for pixel-perfect reconstruction from PDF. The goal is a usable, styled IR that the generator can turn into React components with manual touch-ups. The shift here is one of product framing and ergonomics: PDF is what designers actually deliver, so the pipeline must treat it as a normal, supported input rather than a degraded path.
Sub-issue of #62.
Summary
Implement a first-class PDF ingestion path for the InDesign-to-React pipeline. PDFs exported from InDesign are the primary handoff artifact in our real-world designer-to-developer workflow — designers routinely export to PDF and pass that file to engineering rather than sharing source
.inddor.idmlfiles. This pipeline must accept a PDF as a top-level input and produce the same intermediate representation (IR) that the IDML parser (#63) emits, so downstream stages (mapper #65, generator #66, agent #67) are PDF-source-agnostic.This replaces the previous framing of "PDF as fallback parser." PDF is now a primary input, on equal footing with IDML. IDML remains preferred when available because it carries richer style metadata, but the pipeline must not assume IDML availability.
Motivation
Details
.pdfas a top-level CLI input alongside.idml(e.g.,aurelius indesign ./brochure.pdfandaurelius indesign ./brochure.idmlare both valid entry points).pdfjs-dist(or equivalent) to extract text runs with positional data, font metrics, image XObjects, and vector paths.Document,Spread→ 1 spread per page,Frame,TextFrame,Story,ImageFrame).--source-priority pdf|idml|autoflag so callers can force PDF ingestion even when both files are present (useful for verifying parity between the two paths).Acceptance Criteria
.pdfas a primary input and runs the full pipeline end-to-end (PDF → IR → tokens → TSX components).docs/pipeline/indesign-pdf-fidelity.md.Dependencies
Notes
We still do not aim for pixel-perfect reconstruction from PDF. The goal is a usable, styled IR that the generator can turn into React components with manual touch-ups. The shift here is one of product framing and ergonomics: PDF is what designers actually deliver, so the pipeline must treat it as a normal, supported input rather than a degraded path.