Skip to content

[importer] w14:paraId from word/document.xml is preserved in the document-api JSON but not threaded into ProseMirror node attrs #3303

@bjohas

Description

@bjohas

Summary

w14:paraId (and the related w:rsidR / w:rsidRDefault / w:rsidP / w14:textId IDs) are preserved through the document-api JSON intermediate that mirrors OOXML, but they are dropped at the ProseMirror import boundary: they do not appear on the corresponding paragraph node's attrs in the editor, and they are therefore re-emitted as fresh IDs (or not at all) on save.

Why it matters

w14:paraId is the single stable per-paragraph identifier inside a .docx that survives edits, re-saves, and round-trips through Word, Google Docs, and most OOXML tooling. With it threaded through PM attrs, a consumer could:

  • Anchor external annotations (margin notes, review comments living outside the doc, AI critique) to specific paragraphs without inventing a shadow ID scheme.
  • Build deterministic deep links into a document (e.g. ?anchor=paraId:0F3E1A22) that survive re-saves.
  • Diff paragraphs across versions by identity, not by position.

Without it surfacing in PM, every consumer either invents its own IDs and decoration overlays (fragile under edits) or falls back to text matching (fragile under any text change).

Audit findings

In testing we did, comparing word/document.xml before and after a SuperDoc-based round-trip:

  • Long real-world document (~100+ pages): w14:paraId preservation 99.8% (483/484 paragraphs round-tripped with identical paraId). The one drop appeared to be a paragraph that PM had merged/split.
  • Smaller document originating from Google Docs: w14:paraId preservation 100% at the OOXML level.
  • PM layer: in both cases, 0% of paragraphs exposed paraId as a node attr — the data sits in the document-api JSON, gets re-attached at export, but is invisible to anything operating on the PM document.
  • Two independent SuperDoc-based consumers (an Electron-based editor POC and a separate headless tool) both produced byte-identical word/document.xml on save, confirming JSON-layer preservation is solid; the loss is purely at the PM boundary.

What's wrong architecturally

The PM paragraph node schema doesn't include paraId / rsidR / rsidRDefault / rsidP / textId as attrs. The importer reads them off the document-api JSON, then drops them when constructing the PM node. On export, they're rehydrated from the cached JSON, which is why round-trip preservation is high — but anything that wants to use them inside the editor session can't reach them.

What would help

  1. Schema: add paraId, rsidR, rsidRDefault, rsidP, textId to the paragraph node attrs (and/or to a meta attr bag if a flat shape is preferred).
  2. Importer: thread the values from the document-api JSON into PM attrs at import time.
  3. Exporter: prefer the PM attr value over the cached JSON value at export time, so PM is the source of truth once loaded; only fall back to the cached JSON if a paragraph was newly created in PM and has no paraId yet.
  4. Optionally: on PM paragraph split, decide whether to inherit the paraId on one side and mint a new one on the other (Word's behaviour) or to clear both and let export remint. Either is fine — just documented.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions