Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7511 +/- ##
========================================
Coverage 99.61% 99.61%
========================================
Files 283 285 +2
Lines 11877 11991 +114
Branches 2898 2927 +29
========================================
+ Hits 11831 11945 +114
Misses 46 46 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This pull request normalizes quote selector creation and anchoring for both HTML and PDF documents to ensure that stored selectors match what users see in rendered text. The key change is that line breaks (from <br> tags and block elements in HTML, or newlines in PDF) are now converted to spaces, and consecutive whitespace is collapsed to single spaces. This prevents issues where text like <p>foo<br>bar</p> was previously stored as "foobar" but is now correctly stored as "foo bar" to match the visual rendering.
Changes:
- Introduced
rendered-text.tsmodule that builds normalized text from HTML DOM with offset mappings between raw and normalized positions - Updated
TextQuoteAnchorto use normalized text when creating and matching selectors - Applied consistent PDF text normalization in selector creation and anchoring
- Normalized quote display in the UI component to match the stored format
- Updated test fixtures and baselines to reflect normalized selector output
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/annotator/anchoring/rendered-text.ts | New module providing HTML text normalization with offset mapping for converting between raw and normalized positions |
| src/annotator/anchoring/types.ts | Updated TextQuoteAnchor to use normalized text for selector creation and matching |
| src/annotator/anchoring/pdf.ts | Applied consistent PDF text normalization in describe() and anchor() paths |
| src/sidebar/components/Annotation/AnnotationQuote.tsx | Normalized quote display in UI to match stored format |
| src/annotator/anchoring/test/rendered-text-test.js | New tests for the rendered-text normalization module |
| src/annotator/anchoring/test/types-test.js | Updated test expectations to match normalized selector format and relaxed some assertions |
| src/annotator/anchoring/test/pdf-test.js | Updated test expectations and relaxed some assertions to accommodate normalization |
| src/annotator/anchoring/test/html-test.js | Added normalization helpers and updated tests to compare normalized selectors |
| src/annotator/anchoring/test/html-baselines/wikipedia-regression-testing.json | Updated baseline expectations with normalized prefix/suffix values |
| src/annotator/anchoring/test/html-baselines/minimal.json | Updated baseline expectations with normalized prefix/suffix values |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // Verify that anchoring lands on the saved quote: the rendered | ||
| // text of the anchored range should contain the selector's `exact`. | ||
| // This is what production cares about — that re-anchoring an old | ||
| // annotation finds the originally-highlighted text. | ||
| const quote = selectors.find(s => s.type === 'TextQuoteSelector'); | ||
| const div = document.createElement('div'); | ||
| div.appendChild(range.cloneContents()); | ||
| assert.include(renderedTextOf(div).text, quote.exact); |
There was a problem hiding this comment.
Why did you change this?
There was a problem hiding this comment.
Because the helper was being used to test a case that doesn't ever happen in prod, which is the comparison of selectors (and that's why the helper is unnecessary). The old test anchors the saved selectors, then re-described the resulting range and compared the two sets of selectors. And it reqiured normalizeQuoteSelector because baseline JSON has raw whitespace and describe produces normalized output. Btu now it tests directly what prod does; anchoring an old annotation lands on the originally highlighted text.
Summary
Normalize quote selectors so we store and match what users see.
<p>foo<br>bar</p>becomes"foo bar", not"foobar". Prefix/suffix are normalized too.Details
HTML (
TextQuoteAnchor)DOM walk produces rendered text (spaces at
<br>and block boundaries, whitespace collapsed) plus forward/reverse offset maps between rawtextContentand rendered text.fromRangeuses the maps to sliceexact/prefix/suffixfrom rendered text.toPositionAnchormatches against rendered text and maps offsets back to raw DOM coordinates forTextPositionSelector.Why use explicit maps instead of reusing
translateOffsetstranslateOffsetsaligns by counting non-whitespace chars, which works for PDF where both strings have the same characters with different spacing. Our rendered text contains synthesized characters (the space at a<br>has no source intextContent); counting can't align those, and usingtranslateOffsetsshifted real anchors by one character. The maps record the correspondence during the walk, so synthesized characters are tracked correctly.PDF
Selectors and page text normalized via
normalizePDFText; match offsets translated back viatranslateOffsets(no synthesized characters here, so it's the right tool).isSpace/isNotSpacelifted intoutil/normalize.tsto share with HTML. Redundant[\r\n]+innormalizePDFTextdropped (\s+covers it).