Skip to content

Enhancement: DOCX → ODF (.docx → .odt) conversion #331

@stevenobiajulu

Description

@stevenobiajulu

Summary

Add a capability to convert Microsoft Word .docx documents into OpenDocument Text .odt, building on the now-merged @usejunior/docx-core (DOCX model) and @usejunior/odf-core (ODT packaging + document view, landed in #328). This is the natural bridge between the two backends and is directly motivated by Germany's ODF mandate: organizations holding .docx will need faithful .odt equivalents.

Motivation

  • Germany's IT-Planungsrat ODF mandate creates demand for converting existing .docx corpora to .odt.
  • We already parse .docx into a structured model (docx-core) and write valid .odt packages (odf-core). A conversion path closes the loop and is a strong SEO/positioning surface ("convert DOCX to ODF", "edit ODT with AI agents").
  • During feat(odf-core): add ODF .odt core library (archive, view, replace) #328's post-merge smoke we converted a real NVCA .docx.odt via LibreOffice and round-tripped it through odf-core; that proved the shape works but relied on an external binary (see approach trade-offs below).

Scope (proposed)

  • In: .docx (WordprocessingML text documents) → .odt (OpenDocument Text). Body paragraphs, headings, basic run formatting (bold/italic/underline), lists, and tables.
  • Out (defer): .ods/.odp; tracked changes / comments fidelity; headers/footers/footnotes fidelity; pixel-faithful styling. Conversion is explicitly semantic, not byte- or layout-perfect (mirrors the existing export tooling's "intentionally lossy" stance).

Approaches & trade-offs (to settle in design)

  1. Native model-to-model mapping (docx-core DOM → odf-core document model → content.xml + styles.xml).
    • ✅ No external runtime dependency — consistent with the repo convention of a Node/TypeScript-only runtime (no LibreOffice/Aspose/Python at runtime).
    • ✅ Deterministic, testable, embeddable in the MCP server.
    • ❌ Larger effort; formatting/style mapping is broad; initial fidelity will be partial.
  2. Shell out to LibreOffice headless (soffice --convert-to odt).
    • ✅ High fidelity immediately; trivial to implement.
    • ❌ Heavy runtime dependency that violates the Node/TS-only runtime convention; not viable for the local-first MCP distribution. Acceptable only as a dev/CI reference oracle for differential testing, not as the shipped path.

Recommendation to evaluate in design: native mapping for the shipped capability, with LibreOffice used purely as a test oracle (convert the same .docx both ways and diff visible text / structure).

Suggested process

  • New capability → start with an OpenSpec change (add-docx-to-odf-conversion) before implementation, per repo convention.
  • Phase it: (1) text + headings + basic runs; (2) lists + tables; (3) richer styles. Gate each phase on a real-document corpus (e.g. the bundled NVCA / ILPA fixtures) with a LibreOffice-oracle differential test.
  • Keep ODF packages private: true until a real publish-readiness gate (release-isolation guard from ci(release): add ODF release-isolation guard and OpenSpec proposal #326).

Acceptance criteria (Phase 1)

  • A documented entry point converts a real .docx to a valid .odt (opens cleanly in LibreOffice).
  • Visible text and paragraph/heading structure are preserved (differential check vs. a LibreOffice-converted reference).
  • Produced .odt satisfies odf-core's packaging rules (mimetype first + uncompressed; validateOdfArchiveSafety passes).
  • No new runtime dependency on LibreOffice in the shipped code path.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions