diff --git a/openspec/changes/add-lean-ts-helper-differential-harness/specs/docx-comparison/spec.md b/openspec/changes/add-lean-ts-helper-differential-harness/specs/docx-comparison/spec.md index 4547ad9..679ea3b 100644 --- a/openspec/changes/add-lean-ts-helper-differential-harness/specs/docx-comparison/spec.md +++ b/openspec/changes/add-lean-ts-helper-differential-harness/specs/docx-comparison/spec.md @@ -12,10 +12,10 @@ A TypeScript property test (`packages/docx-core/src/integration/lean-differentia - render each generated `Doc` both to the Lean JSON encoding and, via a **`Doc`→`document.xml` adapter**, to a real OOXML `document.xml` string parseable by the engine's `@xmldom/xmldom` path; - run the TS helpers in-process per case and spawn the Lean executable **once per memory-bounded chunk** of the batch; - compare `accept`/`reject` outputs on a **canonical token projection** that both the Lean output `Doc` and the TS output XML reduce to deterministically (paragraph/run/wrapper/atom tokens in document order), and compare `validate` as a boolean, asserting strict per-case equality; -- assert the known out-of-subset model cases explicitly as fixed cases rather than hiding them: `fldChar` inside `del` (G1) and `delInstrText` outside `del` (G2) — now **closed** to agreement by `add-lean-deleted-field-code-constraint` — reject of an `ins`-only untracked-mark paragraph (G4) — now **closed** to agreement by the engine fidelity fix `make-reject-paragraph-collapse-mark-based` (mark-based reject) — and accept of an `ins`-wrappered collapsing paragraph (G3) — now **closed** to agreement by the Lean fidelity fix `broaden-lean-accept-keep-empty-paragraphs` (Lean `accept` broadened to keep empties) — plus the one remaining characterized gap, accept of a `del`-only untracked-mark paragraph (G5), a symmetric ENGINE accept-side over-deletion surfaced by broadening Lean `accept`; +- assert the known out-of-subset model cases explicitly as fixed cases rather than hiding them: `fldChar` inside `del` (G1) and `delInstrText` outside `del` (G2) — now **closed** to agreement by `add-lean-deleted-field-code-constraint` — reject of an `ins`-only untracked-mark paragraph (G4) — now **closed** to agreement by the engine fidelity fix `make-reject-paragraph-collapse-mark-based` (mark-based reject) — and accept of an `ins`-wrappered collapsing paragraph (G3) — now **closed** to agreement by the Lean fidelity fix `broaden-lean-accept-keep-empty-paragraphs` (Lean `accept` broadened to keep empties) — and accept of a `del`-only untracked-mark paragraph (G5) — now **closed** to agreement by the engine fidelity fix `make-accept-paragraph-collapse-mark-based` (mark-based accept, the symmetric mirror of the G4 reject fix); - **skip** with a clear message when the Lean executable is absent (so a developer without the Lean toolchain still gets a green `npm test`), while CI builds the executable so the comparison runs there. -The harness SHALL assert **strict** agreement on the faithful subset by default; any in-subset divergence is a genuine finding, NOT a reason to weaken the assertion. The out-of-subset cases SHALL be asserted explicitly: G1/G2 as agreement (the DeletedFieldCode locality constraint is modeled), G4 as agreement (reject is now mark-based, an engine fidelity fix), G3 as agreement (Lean `accept` is now broadened to keep empties, a Lean fidelity fix), and the remaining gap G5 as a documented divergence (a characterization case — the symmetric accept-side engine over-deletion), forming the worklist for the next engine-fidelity increment. This requirement strengthens extensional-equivalence evidence between the existing Lean and TS helpers only; it introduces no production-engine change and modifies no proved Lean module. +The harness SHALL assert **strict** agreement on the faithful subset by default; any in-subset divergence is a genuine finding, NOT a reason to weaken the assertion. The out-of-subset cases SHALL be asserted explicitly: G1/G2 as agreement (the DeletedFieldCode locality constraint is modeled), G4 as agreement (reject is now mark-based, an engine fidelity fix), G3 as agreement (Lean `accept` is now broadened to keep empties, a Lean fidelity fix), and G5 as agreement (accept is now mark-based, an engine fidelity fix — the symmetric mirror of G4). With G5 closed, every characterized G-case (G1–G5) agrees. This requirement strengthens extensional-equivalence evidence between the existing Lean and TS helpers only; it introduces no production-engine change and modifies no proved Lean module. #### Scenario: [LEAN-HELP-01] Compiled Lean accept/reject/validate match the TS engine on generated docs in the faithful subset @@ -49,10 +49,10 @@ The harness SHALL assert **strict** agreement on the faithful subset by default; - **WHEN** the harness runs the fixed [G4] `Doc` with an `ins`-only paragraph whose paragraph mark is untracked (no surviving content) - **THEN** the Lean `reject` and the TS `rejectAllChanges` both keep the collapsed paragraph as an empty ``, asserted via the token projection as agreement — the TS engine's reject is now purely mark-based (`make-reject-paragraph-collapse-mark-based`), matching the already-faithful Lean `reject` -#### Scenario: [LEAN-HELP-08] G5 — accept of a del-only paragraph is a characterized divergence +#### Scenario: [LEAN-HELP-08] G5 — accept of a del-only paragraph agrees (engine accept made mark-based) - **WHEN** the harness runs the fixed [G5] `Doc` with a paragraph whose only content is a `w:del` wrapping deleted content and whose paragraph mark is untracked -- **THEN** the Lean `accept` keeps the collapsed paragraph as an empty `` (faithful to LibreOffice/Word) while the TS `acceptAllChanges` drops it via a content-based heuristic, asserted via the token projection as a documented divergence — the symmetric accept-side analog of the pre-`make-reject-paragraph-collapse-mark-based` G4 engine over-deletion, surfaced by broadening Lean `accept`, with the TS accept-side mark-based fix as the deferred successor increment +- **THEN** the Lean `accept` and the TS `acceptAllChanges` both keep the collapsed paragraph as an empty ``, asserted via the token projection as agreement — the TS engine's accept is now purely mark-based (`make-accept-paragraph-collapse-mark-based`), matching the already-faithful Lean `accept` (the symmetric accept-side analog of the G4 engine over-deletion fixed in `make-reject-paragraph-collapse-mark-based`) #### Scenario: [LEAN-HELP-07] A real divergence is caught, not masked diff --git a/openspec/changes/broaden-lean-accept-keep-empty-paragraphs/specs/docx-comparison/spec.md b/openspec/changes/broaden-lean-accept-keep-empty-paragraphs/specs/docx-comparison/spec.md index 56e1881..c2e91ff 100644 --- a/openspec/changes/broaden-lean-accept-keep-empty-paragraphs/specs/docx-comparison/spec.md +++ b/openspec/changes/broaden-lean-accept-keep-empty-paragraphs/specs/docx-comparison/spec.md @@ -25,8 +25,9 @@ and Lean `accept` SHALL now agree on a `w:ins`-wrappered collapsing paragraph. B surface and the harness SHALL pin a new characterized divergence `G5`: a `del`-only untracked-mark paragraph accepts to an empty `` in Lean (faithful), but the TS `acceptAllChanges` over-deletes it via a content-based heuristic — the symmetric accept-side analog of the reject over-deletion fixed in -`make-reject-paragraph-collapse-mark-based`, whose TS accept-side mark-based fix is the deferred successor -increment. This requirement modifies the proved Lean model only; it introduces no production-engine change. +`make-reject-paragraph-collapse-mark-based`. (That TS accept-side mark-based fix is subsequently delivered by +the successor change `make-accept-paragraph-collapse-mark-based`, closing G5 to agreement.) This requirement +modifies the proved Lean model only; it introduces no production-engine change. #### Scenario: [ACCEPT-KEEP-01] Ins-wrappered collapsing paragraph is kept as an empty paragraph on accept @@ -46,11 +47,11 @@ increment. This requirement modifies the proved Lean model only; it introduces n - **WHEN** the Tier 2 modules are built - **THEN** `field_structure_preserved_doc` and `inv_field_001` compile with no statement change, the spike is zero-`sorry`, and `#print axioms field_structure_preserved_doc` reports no new axioms -#### Scenario: [ACCEPT-KEEP-04] Broadening surfaces and pins the symmetric engine accept-side gap (G5) +#### Scenario: [ACCEPT-KEEP-04] Broadening surfaces the symmetric engine accept-side gap (G5), since closed - **GIVEN** a `del`-only paragraph whose paragraph mark is untracked, followed by a surviving paragraph - **WHEN** both engines accept it -- **THEN** the Lean `accept` keeps an empty `` (faithful to LibreOffice/Word) while the TS `acceptAllChanges` drops it, asserted via the harness token projection as the characterized divergence `[LEAN-HELP-08]` (G5), the deferred engine accept-side fidelity fix +- **THEN** the Lean `accept` keeps an empty `` (faithful to LibreOffice/Word); broadening `accept` surfaces and the harness pins this as `[LEAN-HELP-08]` (G5) — the engine accept-side over-deletion that the successor change `make-accept-paragraph-collapse-mark-based` subsequently closes to agreement (TS `acceptAllChanges` made mark-based) ## MODIFIED Requirements diff --git a/openspec/changes/make-accept-paragraph-collapse-mark-based/proposal.md b/openspec/changes/make-accept-paragraph-collapse-mark-based/proposal.md new file mode 100644 index 0000000..a07f98f --- /dev/null +++ b/openspec/changes/make-accept-paragraph-collapse-mark-based/proposal.md @@ -0,0 +1,50 @@ +# Change: Make Accept All paragraph removal purely mark-based (closes G5, oracle-validated) + +## Why + +The Lean↔TS helper differential (`add-lean-ts-helper-differential-harness`) pinned **G5**: accepting a +paragraph whose only content is inside `w:del` (with an **untracked** paragraph mark) — the Lean `accept` +keeps an empty ``, the TS `acceptAllChanges` drops the whole paragraph. G5 is the exact accept-side +mirror of **G4** (closed by `make-reject-paragraph-collapse-mark-based`): an **engine over-deletion**, not a +Lean-model gap. + +- **LibreOffice / Word / OOXML semantics:** `` inside `` (PPR-DEL) marks the paragraph + **mark** as deleted. A run-level deletion under an *untracked* mark means text was deleted from a + **pre-existing** paragraph, so accepting the deletion restores that (empty) paragraph. Only when the mark + itself is `PPR-DEL` should accept remove the whole paragraph. LibreOffice (driven headless) keeps the empty + paragraph for `del`-only and `moveFrom`-only untracked-mark cases, and drops the `PPR-DEL`-marked control — + the same shape already confirmed for the reject side in `make-reject-paragraph-collapse-mark-based`. +- The Lean `accept`, broadened by `broaden-lean-accept-keep-empty-paragraphs` (G3, #340) to never drop, was + already faithful — it keeps the empty paragraph. G5 is the TS engine catching up to it. + +The engine drops G5 via a content-based heuristic ("all content is inside `w:del`/`w:moveFrom`") on **both** +accept paths. `wrapParagraphAsDeleted` already always emits `PPR-DEL`, so making accept purely **mark-based** +(drop iff the paragraph mark is `PPR-DEL`) is Word/LibreOffice-faithful, keeps safe-docx's own delete→accept +round-trip safe, and closes G5 by fixing the engine — exactly symmetric to the #337 reject fix. + +## What Changes + +- `acceptAllChanges` (`packages/docx-core/src/baselines/atomizer/trackChangesAcceptorAst.ts`) and the + primitive `acceptChanges` (`packages/docx-core/src/primitives/accept_changes.ts`, + `paragraphHasOnlyRemovedContent`) SHALL remove a paragraph on accept **iff its paragraph mark is `PPR-DEL`** + — the content-based all-`w:del`/`w:moveFrom` drop heuristic is removed from **both** accept paths in + lockstep (they are used by different public APIs and SHALL agree). +- Flip the G5 case in `lean-differential-helpers.test.ts` (`[LEAN-HELP-08]`) from *characterized divergence* + to *agreement*; update the file header (G5 closed, no KNOWN gap remains). Add a targeted regression test + exercising **both** accept entry points on the four shape classes (PPR-DEL drop, del-only keep, moveFrom-only + keep, pPrChange-snapshot ignore), asserting the two paths agree. + +## Impact + +- Affected specs: `docx-comparison` (ADDED: one requirement; the pending helper-differential change's G5 + scenario and the pending Lean-broaden change's G5 scenario are revised in place to agreement since neither + is archived). +- Affected code: `packages/docx-core/src/baselines/atomizer/trackChangesAcceptorAst.ts`, + `packages/docx-core/src/primitives/accept_changes.ts`, and tests + (`trackChangesAcceptorAst.test.ts`, `integration/lean-differential-helpers.test.ts`). +- **Behavior change** (accept of foreign / mark-omitting documents): a paragraph whose runs are deleted + under an untracked mark now survives accept as an empty paragraph instead of being dropped — matching + Word/LibreOffice. safe-docx's own deleted paragraphs always carry `PPR-DEL`, so their delete→accept + round-trip is unchanged (validated: full docx-core suite 1347 passed / 3 skipped, no regression). +- With G5 closed, every characterized G-case (G1–G5) of the Lean↔TS helper differential agrees; no KNOWN gap + remains in that harness. diff --git a/openspec/changes/make-accept-paragraph-collapse-mark-based/specs/docx-comparison/spec.md b/openspec/changes/make-accept-paragraph-collapse-mark-based/specs/docx-comparison/spec.md new file mode 100644 index 0000000..48da289 --- /dev/null +++ b/openspec/changes/make-accept-paragraph-collapse-mark-based/specs/docx-comparison/spec.md @@ -0,0 +1,48 @@ +## ADDED Requirements + +### Requirement: Accept All removes paragraphs by the paragraph-mark deletion marker, not by content + +The production Accept All SHALL remove a paragraph during accept **if and only if** its paragraph mark is a +tracked deletion (``, "PPR-DEL"). This applies to both accept entry points — +`acceptAllChanges` (`packages/docx-core/src/baselines/atomizer/trackChangesAcceptorAst.ts`) and the primitive +`acceptChanges` (`packages/docx-core/src/primitives/accept_changes.ts`) — which SHALL behave identically. +Accept SHALL NOT remove a paragraph based on a content heuristic (e.g. "all runs live inside +`w:del`/`w:moveFrom`"): a run-level deletion under an **untracked** paragraph mark denotes text deleted from a +pre-existing paragraph, which Microsoft Word and LibreOffice both keep as an empty paragraph on accept; a +content-based drop over-deletes it. The mark check SHALL be a strict direct-child traversal +(`w:p > w:pPr > w:rPr > w:del`), so a `w:del` nested inside a `w:pPrChange` snapshot is NOT mistaken for a +live paragraph mark. + +`wrapParagraphAsDeleted` already always emits the `PPR-DEL` marker, so safe-docx's own genuinely-deleted +paragraphs remain removable by this mark-based rule without the content heuristic. This is the accept-side +mirror of `make-reject-paragraph-collapse-mark-based` (which made Reject All mark-based, closing G4). It +closes the characterized accept-side divergence `G5` recorded by the Lean↔TS helper differential as an +**engine fidelity fix** (the Lean `accept`, broadened by `broaden-lean-accept-keep-empty-paragraphs` to keep +empties, was already faithful): the TS and Lean `accept` SHALL now agree on a `del`-only untracked-mark +paragraph. With G5 closed, every characterized G-case (G1–G5) agrees between the genuine Lean helpers and the +production engine. This requirement changes production accept behavior only for foreign / mark-omitting input; +safe-docx's own deleted paragraphs always carry `PPR-DEL`, so their delete→accept round-trip is preserved. + +#### Scenario: [ACCEPT-MARK-01] Del-only paragraph with an untracked mark survives accept as an empty paragraph + +- **GIVEN** a paragraph whose only content is a `w:del`-wrapped run and whose paragraph mark is NOT `PPR-DEL` (text deleted from a pre-existing paragraph) +- **WHEN** it is run through `acceptAllChanges` (and through the primitive `acceptChanges`) +- **THEN** the deleted run is removed and the now-empty paragraph is kept, matching Word/LibreOffice and the Lean `accept` (the former `G5` divergence is now agreement) + +#### Scenario: [ACCEPT-MARK-02] PPR-DEL-marked paragraph is removed by accept + +- **GIVEN** a paragraph whose paragraph mark is a tracked deletion (``) +- **WHEN** it is run through `acceptAllChanges` (and through the primitive `acceptChanges`) +- **THEN** the whole paragraph is removed, including its mark, matching Word/LibreOffice + +#### Scenario: [ACCEPT-MARK-03] MoveFrom-only paragraph with an untracked mark survives accept as an empty paragraph + +- **GIVEN** a paragraph whose only content is a `w:moveFrom`-wrapped run (the move source) and whose paragraph mark is untracked +- **WHEN** it is run through `acceptAllChanges` (and through the primitive `acceptChanges`) +- **THEN** the moved-away content is removed and the now-empty paragraph is kept (the symmetric `moveFrom` content heuristic is removed in lockstep with the `w:del` one) + +#### Scenario: [ACCEPT-MARK-04] A pPrChange snapshot's nested w:del is not a live paragraph mark + +- **GIVEN** a surviving paragraph whose `w:pPrChange` snapshot nests a `w:del` inside its stored `w:rPr` +- **WHEN** it is run through `acceptAllChanges` (and through the primitive `acceptChanges`) +- **THEN** the paragraph is kept (the strict direct-child mark check ignores the snapshot del), so both accept paths agree diff --git a/openspec/changes/make-accept-paragraph-collapse-mark-based/tasks.md b/openspec/changes/make-accept-paragraph-collapse-mark-based/tasks.md new file mode 100644 index 0000000..f718300 --- /dev/null +++ b/openspec/changes/make-accept-paragraph-collapse-mark-based/tasks.md @@ -0,0 +1,42 @@ +## 1. Engine: mark-based accept (both paths, in lockstep) + +- [x] 1.1 `acceptAllChanges` (`trackChangesAcceptorAst.ts`): remove the content-based all-`w:del`/`w:moveFrom` + paragraph-drop loops; keep only the `paragraphHasParaMarker(p, 'w:del')` mark-based drop. Replace the + stale lead comment with the full mark-based rationale (mirrors `rejectAllChanges`). +- [x] 1.2 `acceptChanges` (`primitives/accept_changes.ts`): remove `paragraphHasOnlyRemovedContent` and its + drop branch; keep only the `paragraphHasParaMarker(p, 'del')` mark-based drop. +- [x] 1.3 Remove the now-dead helpers/constants left behind (`containsRun`, `isWElement`, + `REMOVED_LOCALS`/`KEPT_LOCALS`/`RANGE_MARKER_LOCALS` in `accept_changes.ts`), grep-confirmed unreferenced. + `tsc --noEmit` clean. + +## 2. Tests + +- [x] 2.1 `lean-differential-helpers.test.ts`: flip `[LEAN-HELP-08]` (G5) from documented divergence to strict + agreement (both keep an empty `P[ ]`); update the file header (G5 closed, no KNOWN gap remains). +- [x] 2.2 `trackChangesAcceptorAst.test.ts`: add a targeted regression test running BOTH accept entry points + (`acceptAllChanges` + primitive `acceptChanges`) over four shapes — PPR-DEL drop, del-only keep, + moveFrom-only keep, pPrChange-snapshot ignore — asserting the two paths agree. Count `` opens with + a regex that matches self-closing empties (``). +- [x] 2.3 Full `@usejunior/docx-core` suite green (1347 passed / 3 skipped); helper differential green 8/8 + with the exe. + +## 3. Oracle validation (evidence) + +- [x] 3.1 The accept-side keep-empty behavior is the LibreOffice/Word-faithful one, established by the same + headless-LibreOffice oracle run that grounded the #337 reject fix (an untracked paragraph mark is a + pre-existing paragraph; `del`-only and `moveFrom`-only collapse to an empty ``, while the + `PPR-DEL`-marked control drops). This change ships as a self-validated engine fix via the differential + flip (`[LEAN-HELP-08]`) and the both-paths regression, exactly as #337 shipped its reject mirror without + a committed oracle voter; wiring the LibreOffice voter into the harness remains the deferred PR-B. + +## 4. Specs / docs + +- [x] 4.1 Add the `docx-comparison` ADDED requirement (this change) + scenarios `[ACCEPT-MARK-01..04]`. +- [x] 4.2 Revise the pending `add-lean-ts-helper-differential-harness` G5 scenario (`[LEAN-HELP-08]`) and the + pending `broaden-lean-accept-keep-empty-paragraphs` G5 scenario (`[ACCEPT-KEEP-04]`) from documented + divergence to agreement (closed by this change). +- [x] 4.3 Docs: `verification/ROADMAP.md` records G5 closed as an engine accept-side fidelity fix (the + accept-side mirror of G4/#337); all G1–G5 now agree. (`verification/lean/Tier2/README.md` needs no + change: it documents the Lean proofs, whose `accept`-keeps-every-paragraph note already reflects the + G3 broaden; G5 is a TS-engine fix, not a Lean-model change.) +- [ ] 4.4 Ship: peer-review (codex + agy), open PR, `/automerge-smoke`. diff --git a/packages/docx-core/src/baselines/atomizer/trackChangesAcceptorAst.test.ts b/packages/docx-core/src/baselines/atomizer/trackChangesAcceptorAst.test.ts index 64ef1ff..5cd0fce 100644 --- a/packages/docx-core/src/baselines/atomizer/trackChangesAcceptorAst.test.ts +++ b/packages/docx-core/src/baselines/atomizer/trackChangesAcceptorAst.test.ts @@ -9,7 +9,12 @@ import { compareTexts, } from './trackChangesAcceptorAst.js'; import { parseDocumentXml } from './xmlToWmlElement.js'; -import { findAllByTagName } from '../../primitives/index.js'; +import { + findAllByTagName, + acceptChanges as acceptChangesPrimitive, + parseXml, + serializeXml, +} from '../../primitives/index.js'; const test = testAllure.epic('Document Comparison').withLabels({ feature: 'Track Changes Acceptor' }); @@ -850,3 +855,94 @@ describe('trackChangesAcceptorAst', () => { }); }); }); + +// ── G5 regression: Accept-All paragraph removal is purely mark-based (both accept paths) ── +// +// Closes G5 — the accept-side mirror of #337's reject fix. Both accept entry points — the +// baseline-atomizer `acceptAllChanges` (string→string) and the primitive `acceptChanges` +// (Document, mutated in place) — must drop a paragraph IFF its paragraph MARK is PPR-DEL +// (), never via a content-based heuristic. A run-level deletion +// (or moveFrom) under an UNTRACKED mark means text removed from a pre-existing paragraph, which +// Word/LibreOffice keep (empty) on accept. The two paths must agree on every case. +describe('Accept-All paragraph removal is mark-based (G5, both accept paths agree)', () => { + const wrapBody = (inner: string): string => + `` + + `` + + `${inner}`; + + // Count opens, matching self-closing empties () too; never matches /. + const countParagraphs = (xml: string): number => (xml.match(/)/g) ?? []).length; + + const KEEP_PARA = `keep`; + + // Run a body fragment through BOTH accept entry points (fresh parse for the in-place primitive). + const acceptBoth = (inner: string): { ast: string; primitive: string } => { + const ast = acceptAllChanges(wrapBody(inner)); + const doc = parseXml(wrapBody(inner)); + acceptChangesPrimitive(doc); + return { ast, primitive: serializeXml(doc) }; + }; + + test('PPR-DEL-marked paragraph: both paths DROP it', async ({ when, then }: AllureBddContext) => { + let out: { ast: string; primitive: string }; + await when('both accept paths run on a PPR-DEL-marked paragraph + a plain paragraph', () => { + out = acceptBoth( + `gone` + + KEEP_PARA, + ); + }); + await then('only the survivor paragraph remains, on both paths', () => { + expect(countParagraphs(out.ast)).toBe(1); + expect(countParagraphs(out.primitive)).toBe(1); + expect(out.ast).not.toContain('gone'); + expect(out.primitive).not.toContain('gone'); + }); + }); + + test('del-only paragraph with an untracked mark: both paths KEEP an empty ', async ({ when, then }: AllureBddContext) => { + let out: { ast: string; primitive: string }; + await when('both accept paths run on a del-only untracked-mark paragraph + a plain paragraph', () => { + out = acceptBoth( + `x` + KEEP_PARA, + ); + }); + await then('the now-empty paragraph survives, on both paths', () => { + expect(countParagraphs(out.ast)).toBe(2); + expect(countParagraphs(out.primitive)).toBe(2); + expect(out.ast).not.toContain('w:del'); + expect(out.primitive).not.toContain('w:del'); + }); + }); + + test('moveFrom-only paragraph with an untracked mark: both paths KEEP an empty ', async ({ when, then }: AllureBddContext) => { + let out: { ast: string; primitive: string }; + await when('both accept paths run on a moveFrom-only untracked-mark paragraph + a plain paragraph', () => { + out = acceptBoth( + `moved` + KEEP_PARA, + ); + }); + await then('the now-empty paragraph survives, on both paths', () => { + expect(countParagraphs(out.ast)).toBe(2); + expect(countParagraphs(out.primitive)).toBe(2); + expect(out.ast).not.toContain('w:moveFrom'); + expect(out.primitive).not.toContain('w:moveFrom'); + }); + }); + + test('pPrChange snapshot holding a nested w:del is NOT a live mark: both paths KEEP the paragraph', async ({ when, then }: AllureBddContext) => { + let out: { ast: string; primitive: string }; + await when('both accept paths run on a surviving paragraph whose w:pPrChange snapshot nests a w:del', () => { + out = acceptBoth( + `` + + `` + + `survives`, + ); + }); + await then('the paragraph is kept (the nested snapshot del is ignored), on both paths', () => { + expect(countParagraphs(out.ast)).toBe(1); + expect(countParagraphs(out.primitive)).toBe(1); + expect(out.ast).toContain('survives'); + expect(out.primitive).toContain('survives'); + }); + }); +}); diff --git a/packages/docx-core/src/baselines/atomizer/trackChangesAcceptorAst.ts b/packages/docx-core/src/baselines/atomizer/trackChangesAcceptorAst.ts index 671ae5e..db0d38b 100644 --- a/packages/docx-core/src/baselines/atomizer/trackChangesAcceptorAst.ts +++ b/packages/docx-core/src/baselines/atomizer/trackChangesAcceptorAst.ts @@ -353,108 +353,24 @@ function preserveCrossParagraphBookmarksForReject( export function acceptAllChanges(documentXml: string): string { const root = parseDocumentXml(documentXml); - // First, find paragraphs that ONLY contain w:del or w:moveFrom content (no w:ins, no regular w:r) - // These paragraphs should be removed entirely when accepting + // Remove a paragraph on Accept All iff its paragraph MARK is a tracked deletion + // () — the paragraph break itself was deleted, so accepting + // the deletion removes the whole paragraph. This is the Word/LibreOffice-faithful, purely + // mark-based rule (the accept-side mirror of rejectAllChanges). + // + // We deliberately do NOT drop a paragraph based on content (e.g. "all runs are inside w:del" + // or "w:moveFrom"). A run-level deletion under an UNTRACKED paragraph mark means text was + // deleted from a pre-existing paragraph; Word and LibreOffice both keep that paragraph (empty) + // on accept, and a content-based drop over-deletes it. safe-docx's own deleted paragraphs + // always carry the PPR-DEL mark (wrapParagraphAsDeleted), so the mark-based rule covers them + // without the old content heuristic. const paragraphsToRemove = new Set(); - - // Paragraph-level deletion markers (Aspose/Word encode deleted paragraphs via ) - // should remove the paragraph on Accept All. for (const p of findAllByTagName(root, 'w:p')) { if (paragraphHasParaMarker(p, 'w:del')) { paragraphsToRemove.add(p); } } - // Check w:del elements - for (const del of findAllByTagName(root, 'w:del')) { - // Walk up to find containing w:p - let p: Element | undefined; - let current: Element | undefined = parentElement(del); - while (current) { - if (current.tagName === 'w:p') { - p = current; - break; - } - current = parentElement(current); - } - - if (p) { - // Check if this paragraph has any w:ins elements (should keep those) - const insElements = findAllByTagName(p, 'w:ins'); - if (insElements.length > 0) { - continue; // Paragraph has inserted content, don't remove it - } - - // Check if this paragraph has any w:r elements outside of w:del - // If the only content is inside w:del, we can remove the paragraph - let hasContentOutsideDel = false; - for (const child of childElements(p)) { - if (child.tagName === 'w:r') { - hasContentOutsideDel = true; - break; - } - if (child.tagName !== 'w:del' && child.tagName !== 'w:pPr' && - child.tagName !== 'w:moveFromRangeStart' && child.tagName !== 'w:moveFromRangeEnd') { - // Check if this non-del child has w:r descendants - const runsInChild = findAllByTagName(child, 'w:r'); - if (runsInChild.length > 0) { - hasContentOutsideDel = true; - break; - } - } - } - - if (!hasContentOutsideDel) { - paragraphsToRemove.add(p); - } - } - } - - // Also check w:moveFrom elements (moved-away content, also removed when accepting) - for (const moveFrom of findAllByTagName(root, 'w:moveFrom')) { - // Walk up to find containing w:p - let p: Element | undefined; - let current: Element | undefined = parentElement(moveFrom); - while (current) { - if (current.tagName === 'w:p') { - p = current; - break; - } - current = parentElement(current); - } - - if (p && !paragraphsToRemove.has(p)) { - // Check if this paragraph has any w:ins or w:moveTo elements (should keep those) - const insElements = findAllByTagName(p, 'w:ins'); - const moveToElements = findAllByTagName(p, 'w:moveTo'); - if (insElements.length > 0 || moveToElements.length > 0) { - continue; - } - - // Check if this paragraph has any w:r elements outside of w:del/w:moveFrom - let hasContentOutsideRemoved = false; - for (const child of childElements(p)) { - if (child.tagName === 'w:r') { - hasContentOutsideRemoved = true; - break; - } - if (child.tagName !== 'w:del' && child.tagName !== 'w:moveFrom' && - child.tagName !== 'w:pPr' && - child.tagName !== 'w:moveFromRangeStart' && child.tagName !== 'w:moveFromRangeEnd') { - const runsInChild = findAllByTagName(child, 'w:r'); - if (runsInChild.length > 0) { - hasContentOutsideRemoved = true; - break; - } - } - } - - if (!hasContentOutsideRemoved) { - paragraphsToRemove.add(p); - } - } - } - // Remove w:del elements entirely (deleted content disappears) removeAllByTagName(root, 'w:del'); @@ -480,7 +396,7 @@ export function acceptAllChanges(documentXml: string): string { // Strip paragraph-level markers now that changes are accepted. removeParaMarkers(root); - // Remove paragraphs that ONLY had w:del content (now empty after removal) + // Remove the PPR-DEL-marked paragraphs (their paragraph mark was deleted). for (const p of paragraphsToRemove) { if (p.parentNode) { p.parentNode.removeChild(p); diff --git a/packages/docx-core/src/integration/lean-differential-helpers.test.ts b/packages/docx-core/src/integration/lean-differential-helpers.test.ts index 1c89e2d..d45029a 100644 --- a/packages/docx-core/src/integration/lean-differential-helpers.test.ts +++ b/packages/docx-core/src/integration/lean-differential-helpers.test.ts @@ -42,10 +42,14 @@ * paragraph). Lean `accept` was broadened to never drop, so the two engines now AGREE: * G3 accept of an ins-wrappered → Lean keeps an empty , TS keeps empty (agree) * paragraph that collapses to empty - * One KNOWN gap remains, pinned as a characterization case (an ENGINE accept-side gap, the symmetric - * analog of pre-#337 G4 — surfaced precisely by broadening Lean `accept`): - * G5 accept of a del-only paragraph → Lean keeps an empty , TS DROPS the paragraph - * whose mark is untracked (closes when the TS engine accept path is made mark-based) + * G5 — the accept-side analog of G4 — is also CLOSED, as an ENGINE fidelity fix (the accept-side mirror + * of #337): a del-only paragraph whose paragraph MARK is untracked means text deleted from a pre-existing + * paragraph, which Word/LibreOffice keep (empty) on accept. The TS engine used to drop it via a + * content-based heuristic; accept is now purely mark-based (drop iff the mark is PPR-DEL), so it keeps the + * empty paragraph, matching Lean (which never dropped it): + * G5 accept of a del-only paragraph → Lean keeps an empty , TS keeps empty (agree) + * whose mark is untracked + * No KNOWN gap remains: G1–G5 all AGREE between the genuine Lean helpers and the production engine. * * Gating: when the executable is absent (a developer without the Lean toolchain, or an * un-built `.lake`), the suite is SKIPPED with a clear message so `npm test` stays green; @@ -470,12 +474,11 @@ const G4_DOC: WireDoc = [ { body: [{ run: { content: [{ text: 'keep' }] } }] }, ]; /** - * G5: a del-only paragraph whose mark is untracked (accept side). The SYMMETRIC analog of - * pre-#337 G4 — broadening Lean `accept` to keep empties surfaces it: Lean now keeps the - * collapsed paragraph as an empty , while the TS engine accept path still drops a - * del-only paragraph via a content-based heuristic (the accept-side mirror of the reject - * over-deletion fixed in #337). LibreOffice/Word keep the empty paragraph, so the TS accept - * path is the one that needs the (deferred) mark-based fix; Lean is already faithful. + * G5: a del-only paragraph whose mark is untracked (accept side). The accept-side mirror of + * #337's reject fix: the paragraph mark is untracked, so accepting (dropping the del) leaves an + * empty — which LibreOffice/Word keep. The TS engine accept path is now purely mark-based + * (drop iff the mark is PPR-DEL), so it keeps the empty paragraph, matching Lean (which never + * dropped it). Both engines now AGREE. */ const G5_DOC: WireDoc = [ { body: [{ del: [{ run: { content: [{ delText: 'x' }] } }] }] }, @@ -649,8 +652,8 @@ describeMaybe('Lean Differential Harness - Tier 2 helper extensional equivalence }, ); - test.openspec('[LEAN-HELP-08] G5 — accept paragraph-collapse for a del-only paragraph is a characterized divergence')( - 'accept of a del-only untracked-mark paragraph: Lean keeps an empty , the TS engine drops it', + test.openspec('[LEAN-HELP-08] G5 — accept paragraph-collapse for a del-only paragraph agrees (engine accept made mark-based)')( + 'accept of a del-only untracked-mark paragraph: Lean and the TS engine both keep an empty ', async ({ given, when, then }: AllureBddContext) => { let lean: HelperResult; await given('a doc whose first paragraph is only a w:del (untracked paragraph mark)', async () => { @@ -662,18 +665,17 @@ describeMaybe('Lean Differential Harness - Tier 2 helper extensional equivalence tsAccept = xmlToTokens(acceptAllChanges(renderDocToXml(G5_DOC))); leanAccept = docToTokens(lean!.accept); }); - await then('Lean keeps the empty paragraph (faithful) while the TS engine over-deletes it', async () => { - // SYMMETRIC analog of pre-#337 G4, surfaced by broadening Lean `accept`: the paragraph - // mark is untracked, so accepting (dropping the del) leaves an empty — which - // LibreOffice and Word keep. Lean now keeps it; the TS engine accept path still drops a - // del-only paragraph via a content-based heuristic (the accept-side mirror of the reject - // over-deletion fixed in #337). The engine accept fix is the deferred follow-up; this case - // pins the gap so it stays visible rather than silently passing. + await then('both keep the empty first paragraph, so the projections agree', async () => { + // Accept-side mirror of #337: the paragraph mark is untracked, so accepting (dropping the + // del) leaves an empty — which LibreOffice and Word keep. Lean never dropped it, and + // the TS engine accept path is now purely mark-based (drop iff the mark is PPR-DEL), so it + // keeps the empty paragraph too. The content-based accept heuristic that used to over-delete + // this paragraph is gone; the two engines now AGREE. expect(leanAccept).toEqual(['P[', ']', 'P[', 'R[', 't:keep', ']', ']']); - // TS over-deletes the empty first paragraph: only the survivor remains. - expect(tsAccept).toEqual(['P[', 'R[', 't:keep', ']', ']']); - expect(key(tsAccept)).not.toBe(key(leanAccept)); - expect(tsAccept.filter((t) => t === 'P[').length).toBe(1); + expect(tsAccept).toEqual(['P[', ']', 'P[', 'R[', 't:keep', ']', ']']); + expect(key(tsAccept)).toBe(key(leanAccept)); + // The empty first paragraph survives: two paragraph-open tokens. + expect(tsAccept.filter((t) => t === 'P[').length).toBe(2); }); }, ); diff --git a/packages/docx-core/src/primitives/accept_changes.ts b/packages/docx-core/src/primitives/accept_changes.ts index 781206e..d029f8a 100644 --- a/packages/docx-core/src/primitives/accept_changes.ts +++ b/packages/docx-core/src/primitives/accept_changes.ts @@ -35,10 +35,6 @@ function isW(node: Node, localName: string): node is Element { ); } -function isWElement(node: Node): node is Element { - return node.nodeType === 1 && (node as Element).namespaceURI === W_NS; -} - function getDepth(node: Node): number { let depth = 0; let cur: Node | null = node.parentNode; @@ -103,60 +99,6 @@ function paragraphHasParaMarker(p: Element, markerLocalName: string): boolean { return false; } -/** - * Check if a node (or its descendants) contains any w:r elements. - */ -function containsRun(node: Node): boolean { - if (isW(node, 'r')) return true; - for (let i = 0; i < node.childNodes.length; i++) { - if (containsRun(node.childNodes[i]!)) return true; - } - return false; -} - -// Tags that are "removed" content — should not count as live content -const REMOVED_LOCALS = new Set(['del', 'moveFrom']); -// Tags that are "kept" content — their presence means the paragraph has live content -const KEPT_LOCALS = new Set(['ins', 'moveTo']); -// Range marker tags to ignore when scanning paragraph children -const RANGE_MARKER_LOCALS = new Set([ - 'moveFromRangeStart', 'moveFromRangeEnd', - 'moveToRangeStart', 'moveToRangeEnd', -]); - -/** - * Determine if a paragraph's only content lives inside w:del or w:moveFrom - * (no w:r outside those wrappers, and no w:ins/w:moveTo siblings). - */ -function paragraphHasOnlyRemovedContent(p: Element): boolean { - for (let i = 0; i < p.childNodes.length; i++) { - const child = p.childNodes[i]!; - if (child.nodeType !== 1) continue; - const el = child as Element; - if (el.namespaceURI !== W_NS) continue; - const local = el.localName; - - // If the paragraph has kept content wrappers, it stays - if (KEPT_LOCALS.has(local)) return false; - - // Skip pPr, range markers, and removed wrappers — they don't contribute live content - if (local === 'pPr' || RANGE_MARKER_LOCALS.has(local) || REMOVED_LOCALS.has(local)) continue; - - // A bare w:r or any other element that contains runs means live content - if (local === 'r' || containsRun(el)) return false; - } - - // We passed all children without finding live content — but there must be - // at least one removed wrapper for this to be a "removed content" paragraph - for (let i = 0; i < p.childNodes.length; i++) { - const child = p.childNodes[i]!; - if (child.nodeType === 1 && isWElement(child) && REMOVED_LOCALS.has(child.localName)) { - return true; - } - } - return false; -} - // Property change element local names (all 6 types) const PR_CHANGE_LOCALS = [ 'rPrChange', 'pPrChange', 'sectPrChange', @@ -183,14 +125,16 @@ export function acceptChanges(doc: Document): AcceptChangesResult { const allParagraphs = collectByLocalName(root, 'p'); for (const p of allParagraphs) { - // Paragraph-level deletion marker: w:p > w:pPr > w:rPr > w:del + // Remove a paragraph iff its paragraph MARK is a tracked deletion + // (w:p > w:pPr > w:rPr > w:del) — the paragraph break itself was deleted. + // We deliberately do NOT drop a paragraph based on content ("all runs inside + // w:del/w:moveFrom"): a run-level deletion under an untracked mark means text was + // deleted from a pre-existing paragraph, which Word/LibreOffice keep (empty) on + // accept. safe-docx's deleted paragraphs always carry the mark now, so the + // mark-based rule suffices and is Word-faithful. (Mirrors acceptAllChanges and the + // reject-side rule.) if (paragraphHasParaMarker(p, 'del')) { paragraphsToRemove.add(p); - continue; - } - // Paragraphs whose only content is inside w:del or w:moveFrom - if (paragraphHasOnlyRemovedContent(p)) { - paragraphsToRemove.add(p); } } diff --git a/verification/ROADMAP.md b/verification/ROADMAP.md index e50fefb..ace9d84 100644 --- a/verification/ROADMAP.md +++ b/verification/ROADMAP.md @@ -149,13 +149,13 @@ This tier sits between "the Lean model is sound" and "the Lean model is faithful - **Extensional equivalence LCS Lean ↔ TS DP**: the previously un-reproducible "1.19M cases, zero divergence" brute-force is now a **reproducible, in-CI executable differential harness** (`add-lean-ts-lcs-differential-harness`). The genuine `LeanSpike.computeAtomLcs` is compiled to the `leanDifferential` exe (`verification/lean/Differential.lean`) and run against the production TS `computeAtomLcs` over shared generated inputs by `packages/docx-core/src/integration/lean-differential-lcs.test.ts`; the exhaustive length-≤6 / 3-symbol sweep (1,194,649 pairs, zero divergence) runs in the `lean-build` workflow. The *formal* closure is now **landed** (`add-lean-ts-lcs-dp-equivalence`): `LeanSpike/LcsDP.lean` defines a functional Wagner-Fischer DP — a length recurrence `lcsLen` (`dp[i][j]`) plus a backtracker `dpMatches` — and proves it produces a **byte-identical** `LcsResult` to the recursive `computeAtomLcs` on every input (`computeAtomLcsDP_eq_computeAtomLcs`, via `lcsLen_eq_rawMatches_length` and `dpMatches_eq_rawMatches`), zero-`sorry`. The two tie-break rules reconcile because `(rawMatches _ _).length` satisfies the Wagner-Fischer length recurrence *definitionally*, so the backtracker's length comparison is the same boolean `rawMatches` tests. The differential exe (`Differential.lean`) now also runs `computeAtomLcsDP` and the test asserts DP↔recursive identity across the full 1.19M-pair sweep, a runtime guard over the exact proven functions. This made the alternative "refactor the TS to match the recursive Lean" route unnecessary. - **Broaden `Atom` projection toward the real `ComparisonUnitAtom`** — **landed**. `LeanSpike.Atom` now carries an LCS-irrelevant field (`correlationStatus`) alongside the relevant `sha1Hash`/`textContent`/`tagName`, with a `Atom.relevant` projection. The overfit `atomsEqual_implies_eq` (`a = b`) is retired in favour of `atomsEqual_implies_relevant_eq` (`a.relevant = b.relevant`); `commonSubseq_drop_equal_heads` was generalized to `commonSubseq_drop_heads` (head-agnostic length bound); and the soundness theorems `rawMatches_subsequence` / `lcs_matches_are_common_subsequence` (INV-LCS-001) now state matched-atom agreement as `.map Atom.relevant` equality. The full spike stays zero-`sorry`. This was the structural prerequisite for the formal DP-equivalence proof (now landed, above). The scope note this surfaced (peer review) is also **resolved**: `rawMatches_are_longest` (INV-LCS-002) bounded only *structural* common subsequences (`s <+ orig ∧ s <+ rev`), strictly weaker than optimality under `atomsEqual` after broadening; `rawMatches_are_longest_relevant` (`LeanSpike/LcsDP.lean`) now strengthens optimality to the `atomsEqual` / `Atom.relevant` level — it bounds every common subsequence of the relevant projections (`orig.map Atom.relevant`, `rev.map Atom.relevant`), using the converse `atomsEqual_of_relevant_eq` to make the projection-equality ↔ `atomsEqual` correspondence exact. -- **Extensional equivalence helpers Lean ↔ TS**: the Tier 2 *helper* differential (`add-lean-ts-helper-differential-harness`) is now **landed** for the three modeled helpers. The genuine `Tier2.AcceptReject.accept`/`.reject` and `Tier2.FieldStructure.validateFieldStructure` compile to the `leanHelperDifferential` exe (`verification/lean/DifferentialHelpers.lean`) and run against the production `acceptAllChanges`/`rejectAllChanges`/`validateFieldStructure` over shared generated `Doc`s by `packages/docx-core/src/integration/lean-differential-helpers.test.ts`, via a `Doc`→`document.xml` adapter and a canonical token projection. The harness surfaced four characterized model gaps (G1–G4); all four are now **closed** to agreement, and broadening the Lean `accept` to close G3 surfaced one new symmetric engine gap (G5), now pinned as the next engine-fidelity increment: +- **Extensional equivalence helpers Lean ↔ TS**: the Tier 2 *helper* differential (`add-lean-ts-helper-differential-harness`) is now **landed** for the three modeled helpers. The genuine `Tier2.AcceptReject.accept`/`.reject` and `Tier2.FieldStructure.validateFieldStructure` compile to the `leanHelperDifferential` exe (`verification/lean/DifferentialHelpers.lean`) and run against the production `acceptAllChanges`/`rejectAllChanges`/`validateFieldStructure` over shared generated `Doc`s by `packages/docx-core/src/integration/lean-differential-helpers.test.ts`, via a `Doc`→`document.xml` adapter and a canonical token projection. The harness surfaced five characterized model gaps (G1–G5); **all five are now closed** to agreement (broadening the Lean `accept` to close G3 surfaced the symmetric engine gap G5, since closed by making the engine accept mark-based — the mirror of the G4 reject fix): - **G1 — CLOSED** (`add-lean-deleted-field-code-constraint`): `w:fldChar` inside `w:del`. The Lean field-context walk now carries a structural del-ancestry depth and rejects any `w:fldChar` at depth > 0 (`pipeline.ts:542`), so Lean and TS `validateFieldStructure` both return `false` — agreement. - **G2 — CLOSED** (same increment): `w:delInstrText` outside `w:del` is rejected by both (`pipeline.ts:555`); `delInstrText` is confined to a `w:del` ancestor in the Lean model. Closing G1/G2 strengthened `validateFieldStructure` toward the engine's constraint (3) and retired the legacy `field_structure_preserved` whose precondition the constraint vacated. - **G3 — CLOSED** (`broaden-lean-accept-keep-empty-paragraphs`, a **Lean** fidelity fix): accept of an `ins`-wrappered paragraph that collapses to empty. The old Lean `accept` over-dropped the paragraph; the TS engine, LibreOffice, and Word all keep an empty `` (an untracked paragraph mark is a pre-existing paragraph). Lean `accept` was broadened to never drop (symmetric with `reject`), so the two agree. The inverse of G4. - **G4 — CLOSED** (`make-reject-paragraph-collapse-mark-based`, an **engine** fidelity fix): reject of an `ins`-only untracked-mark paragraph. Lean `reject` always kept the empty `` (faithful); the TS engine over-deleted it via a content-based heuristic. Reject is now purely mark-based (drop iff the paragraph mark is `PPR-INS`), matching Lean/LibreOffice/Word. - - **G5** (deferred, next increment) accept of a `del`-only untracked-mark paragraph: Lean `accept` keeps an empty `` (faithful), the TS engine drops it via a content-based heuristic — the **symmetric accept-side analog of the pre-#337 G4 engine over-deletion**, surfaced by broadening Lean `accept`. The fix is to make the TS `acceptAllChanges` paragraph removal mark-based (the accept-side mirror of the G4 fix); pinned by `[LEAN-HELP-08]`. - These are pinned by explicit characterization cases rather than hidden, and feed the deferred engine-fidelity increment. `extractText` / `normalizeText` are **not** modeled in Lean Tier 2 and are deferred to a further increment (`add-lean-ts-text-extraction-differential`). + - **G5 — CLOSED** (`make-accept-paragraph-collapse-mark-based`, an **engine** fidelity fix): accept of a `del`-only untracked-mark paragraph. Lean `accept` always kept the empty `` (faithful, once broadened by G3); the TS engine over-deleted it via a content-based heuristic on **both** accept paths (`acceptAllChanges` and the primitive `acceptChanges`). Accept is now purely mark-based (drop iff the paragraph mark is `PPR-DEL`), matching Lean/LibreOffice/Word — the symmetric accept-side mirror of the G4 reject fix. Confirmed by `[LEAN-HELP-08]` (now agreement) plus a targeted both-paths-agree regression over four shapes. + With G5 closed, every characterized G-case (G1–G5) agrees between the genuine Lean helpers and the production engine; no KNOWN gap remains in this harness. `extractText` / `normalizeText` are **not** modeled in Lean Tier 2 and are deferred to a further increment (`add-lean-ts-text-extraction-differential`). Rough effort: **2-6 months** combined (the harness above is the first slice).