Skip to content

readability: extend score propagation for deeply nested SPA pages#98

Merged
Oaklight merged 2 commits into
masterfrom
fix/readability-deep-nesting
Jun 12, 2026
Merged

readability: extend score propagation for deeply nested SPA pages#98
Oaklight merged 2 commits into
masterfrom
fix/readability-deep-nesting

Conversation

@clementine-oaklight

Copy link
Copy Markdown
Contributor

Problem

Modern SPA-rendered pages (React, Vue, etc.) wrap content in 3-5 levels of <div> wrappers. The readability scoring algorithm only propagated paragraph scores to parent (1×) and grandparent (½×), so the actual article container never accumulated enough score to be selected.

Result: On pages like Alibaba Cloud help docs, only a single paragraph (97 chars) was extracted instead of the full article (1118 chars).

Root cause

<div class="doc-body">                    ← 4 levels up, never scored
  <div class="rich-text-container">       ← 3 levels up, never scored
    <div class="section-wrapper">         ← grandparent, gets ½× score
      <div class="paragraph-wrapper">     ← parent, gets 1× score
        <p class="text-content">Content</p>  ← scored element

Each paragraph-wrapper gets scored independently and the best one wins — but it only contains one paragraph.

Fix

  1. Extended score propagation from 2 to 4 ancestor levels with diminishing weights: [1.0, 0.5, 0.333, 0.25] (configurable via _ANCESTOR_WEIGHTS).

  2. Lowered sibling inclusion threshold from best_score × 0.2 to best_score × 0.1, allowing more adjacent content sections to be included in the extracted article.

Test results

Page Before After
Synthetic Alibaba Cloud docs (deep nesting) 97 chars ❌ 1118 chars ✅
GitHub Docs (cloning-a-repository) 4316 chars ✅ 4316 chars ✅
Alibaba Cloud ECS instance families (7.1MB) 199821 chars ✅ 199821 chars ✅

No regressions on traditional article pages.

@Oaklight

Copy link
Copy Markdown
Owner

Closing — superseded by #97 and subsequent changes on master. The deep nesting fix can be revisited separately if needed.

@Oaklight Oaklight closed this Jun 12, 2026
@Oaklight Oaklight reopened this Jun 12, 2026
clementine-oaklight Bot and others added 2 commits June 12, 2026 16:32
Modern SPA-rendered pages (React, Vue, etc.) wrap content in 3-5 levels
of wrapper divs.  The original algorithm only propagated paragraph scores
to parent (1x) and grandparent (0.5x), so the actual article container
never accumulated enough score to be selected as best candidate.

Changes:
- Extend score propagation to 4 ancestor levels (1x, 0.5x, 0.33x, 0.25x)
  using a configurable _ANCESTOR_WEIGHTS list
- Lower sibling inclusion threshold from 0.2 to 0.1 to capture more
  content sections when the best candidate is found

This fixes extraction of documentation pages like Alibaba Cloud help docs
where rendered HTML uses deeply nested div structures with no <p> tags.
Before: 97 chars extracted (single paragraph). After: 1118 chars (full article).

Regression tested against GitHub Docs and large Alibaba Cloud ECS pages -
traditional article extraction is unaffected.
@Oaklight Oaklight force-pushed the fix/readability-deep-nesting branch from 2aceeea to ca0f8c9 Compare June 12, 2026 21:33
@Oaklight Oaklight merged commit d13f871 into master Jun 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant