Skip to content

Fix/linebreaks conversion quotes#7511

Open
Elimpizza wants to merge 19 commits intomainfrom
fix/linebreaks-conversion-quotes
Open

Fix/linebreaks conversion quotes#7511
Elimpizza wants to merge 19 commits intomainfrom
fix/linebreaks-conversion-quotes

Conversation

@Elimpizza
Copy link
Copy Markdown
Contributor

@Elimpizza Elimpizza commented Feb 26, 2026

Summary

Normalize quote selectors so we store and match what users see. <p>foo<br>bar</p> becomes "foo bar", not "foobar". Prefix/suffix are normalized too.

Details

HTML (TextQuoteAnchor)

DOM walk produces rendered text (spaces at <br> and block boundaries, whitespace collapsed) plus forward/reverse offset maps between raw textContent and rendered text.

  • fromRange uses the maps to slice exact / prefix / suffix from rendered text.
  • toPositionAnchor matches against rendered text and maps offsets back to raw DOM coordinates for TextPositionSelector.

Why use explicit maps instead of reusing translateOffsets

translateOffsets aligns by counting non-whitespace chars, which works for PDF where both strings have the same characters with different spacing. Our rendered text contains synthesized characters (the space at a <br> has no source in textContent); counting can't align those, and using translateOffsets shifted real anchors by one character. The maps record the correspondence during the walk, so synthesized characters are tracked correctly.

PDF
Selectors and page text normalized via normalizePDFText; match offsets translated back via translateOffsets (no synthesized characters here, so it's the right tool). isSpace/isNotSpace lifted into util/normalize.ts to share with HTML. Redundant [\r\n]+ in normalizePDFText dropped (\s+ covers it).

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.61%. Comparing base (b07c1b8) to head (fc8eb1b).
⚠️ Report is 31 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main    #7511    +/-   ##
========================================
  Coverage   99.61%   99.61%            
========================================
  Files         283      285     +2     
  Lines       11877    11991   +114     
  Branches     2898     2927    +29     
========================================
+ Hits        11831    11945   +114     
  Misses         46       46            

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Elimpizza Elimpizza marked this pull request as ready for review February 26, 2026 16:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request normalizes quote selector creation and anchoring for both HTML and PDF documents to ensure that stored selectors match what users see in rendered text. The key change is that line breaks (from <br> tags and block elements in HTML, or newlines in PDF) are now converted to spaces, and consecutive whitespace is collapsed to single spaces. This prevents issues where text like <p>foo<br>bar</p> was previously stored as "foobar" but is now correctly stored as "foo bar" to match the visual rendering.

Changes:

  • Introduced rendered-text.ts module that builds normalized text from HTML DOM with offset mappings between raw and normalized positions
  • Updated TextQuoteAnchor to use normalized text when creating and matching selectors
  • Applied consistent PDF text normalization in selector creation and anchoring
  • Normalized quote display in the UI component to match the stored format
  • Updated test fixtures and baselines to reflect normalized selector output

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/annotator/anchoring/rendered-text.ts New module providing HTML text normalization with offset mapping for converting between raw and normalized positions
src/annotator/anchoring/types.ts Updated TextQuoteAnchor to use normalized text for selector creation and matching
src/annotator/anchoring/pdf.ts Applied consistent PDF text normalization in describe() and anchor() paths
src/sidebar/components/Annotation/AnnotationQuote.tsx Normalized quote display in UI to match stored format
src/annotator/anchoring/test/rendered-text-test.js New tests for the rendered-text normalization module
src/annotator/anchoring/test/types-test.js Updated test expectations to match normalized selector format and relaxed some assertions
src/annotator/anchoring/test/pdf-test.js Updated test expectations and relaxed some assertions to accommodate normalization
src/annotator/anchoring/test/html-test.js Added normalization helpers and updated tests to compare normalized selectors
src/annotator/anchoring/test/html-baselines/wikipedia-regression-testing.json Updated baseline expectations with normalized prefix/suffix values
src/annotator/anchoring/test/html-baselines/minimal.json Updated baseline expectations with normalized prefix/suffix values

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/annotator/anchoring/pdf.ts Outdated
Comment thread src/annotator/anchoring/types.ts Outdated
Comment thread src/annotator/anchoring/types.ts Outdated
Comment thread src/sidebar/components/Annotation/AnnotationQuote.tsx Outdated
Comment thread src/annotator/anchoring/test/types-test.js Outdated
Comment thread src/annotator/anchoring/test/pdf-test.js Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/annotator/anchoring/pdf.ts Outdated
Comment thread src/annotator/anchoring/pdf.ts Outdated
Comment thread src/annotator/anchoring/pdf.ts
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/annotator/anchoring/types.ts Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@karenrasmussen karenrasmussen changed the title Fix/linebreaks conversion quotes WIP Fix/linebreaks conversion quotes Apr 14, 2026
@karenrasmussen karenrasmussen changed the title WIP Fix/linebreaks conversion quotes Fix/linebreaks conversion quotes Apr 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Comment thread src/annotator/util/normalize.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/pdf.ts Outdated
Comment thread src/sidebar/components/Annotation/AnnotationQuote.tsx Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/rendered-text.ts Outdated
Comment thread src/annotator/anchoring/types.ts Outdated
Comment thread src/annotator/anchoring/pdf.ts Outdated
Comment thread src/annotator/anchoring/test/html-test.js Outdated
Comment on lines +471 to +479

// Verify that anchoring lands on the saved quote: the rendered
// text of the anchored range should contain the selector's `exact`.
// This is what production cares about — that re-anchoring an old
// annotation finds the originally-highlighted text.
const quote = selectors.find(s => s.type === 'TextQuoteSelector');
const div = document.createElement('div');
div.appendChild(range.cloneContents());
assert.include(renderedTextOf(div).text, quote.exact);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the helper was being used to test a case that doesn't ever happen in prod, which is the comparison of selectors (and that's why the helper is unnecessary). The old test anchors the saved selectors, then re-described the resulting range and compared the two sets of selectors. And it reqiured normalizeQuoteSelector because baseline JSON has raw whitespace and describe produces normalized output. Btu now it tests directly what prod does; anchoring an old annotation lands on the originally highlighted text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants