Skip to content

[InDesign pipeline] PDF ingestion as primary input (designer-to-developer PDF handoff) #64

@PAMulligan

Description

@PAMulligan

Sub-issue of #62.

Summary

Implement a first-class PDF ingestion path for the InDesign-to-React pipeline. PDFs exported from InDesign are the primary handoff artifact in our real-world designer-to-developer workflow — designers routinely export to PDF and pass that file to engineering rather than sharing source .indd or .idml files. This pipeline must accept a PDF as a top-level input and produce the same intermediate representation (IR) that the IDML parser (#63) emits, so downstream stages (mapper #65, generator #66, agent #67) are PDF-source-agnostic.

This replaces the previous framing of "PDF as fallback parser." PDF is now a primary input, on equal footing with IDML. IDML remains preferred when available because it carries richer style metadata, but the pipeline must not assume IDML availability.

Motivation

  • The graphic designer driving the InDesign feature additions confirmed that their normal handoff to developers is a PDF, not an IDML/INDD file.
  • Many design-to-dev workflows in agency settings never expose source files at all — engineers only ever see the PDF.
  • Treating PDF as a "fallback" understates how often it is, in practice, the only available input.

Details

  • Accept .pdf as a top-level CLI input alongside .idml (e.g., aurelius indesign ./brochure.pdf and aurelius indesign ./brochure.idml are both valid entry points).
  • Use pdfjs-dist (or equivalent) to extract text runs with positional data, font metrics, image XObjects, and vector paths.
  • Group adjacent text runs into logical text frames using positional clustering and font/style continuity.
  • Detect columns, headings, body text, and captions heuristically (font size buckets, leading, horizontal alignment).
  • Extract embedded images and write them to the pipeline's asset cache.
  • Derive a swatch palette directly from the PDF (clustered colors from text fills, vector fills, and dominant image colors). When an IDML-derived palette is also available, prefer it; otherwise the PDF-derived palette becomes the canonical one for the run.
  • Emit the same IR shape as the IDML parser (Document, Spread → 1 spread per page, Frame, TextFrame, Story, ImageFrame).
  • Surface fidelity warnings (e.g., "no embedded fonts", "text reconstructed from glyph runs", "vector-only page — no extractable text") so downstream stages and humans know what was inferred vs. read.
  • Provide a --source-priority pdf|idml|auto flag so callers can force PDF ingestion even when both files are present (useful for verifying parity between the two paths).

Acceptance Criteria

  • Given a fixture PDF exported from InDesign, the parser produces a valid IR object compatible with the downstream mapper ([InDesign pipeline] Style and design-token mapper (paragraph/character styles + swatches → tokens + Tailwind preset) #65) and generator ([InDesign pipeline] React component generator (TSX output, Tailwind/CSS Modules, Storybook stories) #66) without requiring a companion IDML file.
  • The CLI accepts a .pdf as a primary input and runs the full pipeline end-to-end (PDF → IR → tokens → TSX components).
  • Parity test: the same source InDesign document exported to both IDML and PDF produces IRs whose page count, frame count, and detected style buckets agree within documented tolerances.
  • Embedded images are extracted and addressable from the IR.
  • Swatch palette is correctly derived from the PDF when no IDML is supplied.
  • Fidelity warnings appear in CLI output and are documented in docs/pipeline/indesign-pdf-fidelity.md.
  • README and the InDesign pipeline docs describe PDF as a first-class input (not a fallback), and include a "designer hands you a PDF" quickstart.
  • Unit tests cover: text-heavy PDF, image-heavy PDF, multi-column PDF, single-page brochure PDF, vector-only / outlined-text PDF.

Dependencies

Notes

We still do not aim for pixel-perfect reconstruction from PDF. The goal is a usable, styled IR that the generator can turn into React components with manual touch-ups. The shift here is one of product framing and ergonomics: PDF is what designers actually deliver, so the pipeline must treat it as a normal, supported input rather than a degraded path.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestpipelineFigma/Canva-to-React conversion pipelinereactReact-specific functionality

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions