readability: extend score propagation for deeply nested SPA pages#98
Merged
Conversation
Owner
|
Closing — superseded by #97 and subsequent changes on master. The deep nesting fix can be revisited separately if needed. |
Modern SPA-rendered pages (React, Vue, etc.) wrap content in 3-5 levels of wrapper divs. The original algorithm only propagated paragraph scores to parent (1x) and grandparent (0.5x), so the actual article container never accumulated enough score to be selected as best candidate. Changes: - Extend score propagation to 4 ancestor levels (1x, 0.5x, 0.33x, 0.25x) using a configurable _ANCESTOR_WEIGHTS list - Lower sibling inclusion threshold from 0.2 to 0.1 to capture more content sections when the best candidate is found This fixes extraction of documentation pages like Alibaba Cloud help docs where rendered HTML uses deeply nested div structures with no <p> tags. Before: 97 chars extracted (single paragraph). After: 1118 chars (full article). Regression tested against GitHub Docs and large Alibaba Cloud ECS pages - traditional article extraction is unaffected.
2aceeea to
ca0f8c9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Modern SPA-rendered pages (React, Vue, etc.) wrap content in 3-5 levels of
<div>wrappers. The readability scoring algorithm only propagated paragraph scores to parent (1×) and grandparent (½×), so the actual article container never accumulated enough score to be selected.Result: On pages like Alibaba Cloud help docs, only a single paragraph (97 chars) was extracted instead of the full article (1118 chars).
Root cause
Each
paragraph-wrappergets scored independently and the best one wins — but it only contains one paragraph.Fix
Extended score propagation from 2 to 4 ancestor levels with diminishing weights:
[1.0, 0.5, 0.333, 0.25](configurable via_ANCESTOR_WEIGHTS).Lowered sibling inclusion threshold from
best_score × 0.2tobest_score × 0.1, allowing more adjacent content sections to be included in the extracted article.Test results
No regressions on traditional article pages.