Skip to content

Implement updates tracking OWID grapher#6

Open
xrendan wants to merge 703 commits intoBuildCanada:masterfrom
owid:master
Open

Implement updates tracking OWID grapher#6
xrendan wants to merge 703 commits intoBuildCanada:masterfrom
owid:master

Conversation

@xrendan
Copy link
Member

@xrendan xrendan commented Feb 26, 2026

Context

Links to issues, Figma, Slack, and a technical introduction to the work.

Screenshots / Videos / Diagrams

Add if relevant, i.e. might not be necessary when there are no UI changes.

Testing guidance

Step-by-step instructions on how to test this change

  • Does the change work in the archive?
  • Does the staging experience have sign-off from product stakeholders?

Reminder to annotate the PR diff with design notes, alternatives you considered, and any other helpful context.

Checklist

(delete all that do not apply)

Before merging

  • Google Analytics events were adapted to fit the changes in this PR
  • Changes to CSS/HTML were checked on Desktop and Mobile Safari at all three breakpoints
  • Changes to HTML were checked for accessibility concerns

If DB migrations exists:

  • If columns have been added/deleted, all necessary views were recreated
  • The DB type definitions have been updated
  • The DB types in the ETL have been updated
  • If tables/views were added/removed, the Datasette export has been updated to take this into account
  • Update the documentation in db/docs

After merging

  • If a table was touched that is synced to R2, the sync script to update R2 has been run

mlbrgl and others added 7 commits February 26, 2026 17:42
Mixed-content cells (e.g. a paragraph + a list) only indexed the list
items, silently discarding all other visible text. Now each cell's HTML
is converted to enriched blocks via htmlToEnrichedBlocks and processed
through the same enrichedBlocksToIndexableText path as regular table
cells.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use a generic type constraint instead of a union of specific interfaces,
adding OwidGdocProfileInterface support for search indexing.
- Test "returns undefined for component blocks" now uses prominent-link
  (a true no-text block) instead of chart which can have a caption
- Rename "skip component blocks" test to clarify it targets caption-less charts
- Add new test verifying chart captions are included in indexable text

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r search (#6103)

## Summary

Replaces the indirect markdown-to-plaintext pipeline with a new `enrichedBlocksToIndexableText` module that converts enriched Gdoc blocks directly to plaintext for Algolia search indexing, with comprehensive test coverage.

## Rationale: why branch out of the markdown pipeline

The previous approach to generating search-indexable text repurposed the markdown pipeline: enriched blocks → markdown (via `enrichedToMarkdown`) → strip custom component tags → strip markdown formatting → regex cleanup.

This worked but had two structural issues:

**The markdown round-trip is lossy and wasteful.** `spanToMarkdown` wraps formatting spans in markdown syntax (`**bold**`, `_italic_`, `[text](url)`) and `formatGdocMarkdown` immediately strips it back out — imprecisely, because `MarkdownTextWrap` couldn't handle all variants. The workaround was to remove all asterisks wholesale and use regex heuristics to strip footnote numbers (`word.1` → `word.`). These are fragile patches on a representation that discards the structural information (like `span-ref` for footnotes) that would have made clean extraction trivial.

**The markdown pipeline's inclusion decisions didn't match the search use case.** Some blocks that aren't meaningful narrative content (e.g. `prominent-link` titles/URLs, `research-and-writing` URL lists, `pill-row` navigation links) were included in the markdown and survived the stripping pipeline into search results. This could have been fixed in `enrichedBlocksToMarkdown` itself, but it highlights that search was inheriting inclusion decisions that didn't match its scope.

The new `enrichedBlocksToIndexableText` module sidesteps both issues by operating directly on the enriched block AST with an explicit, search-specific indexing policy:

- **No round-trip:** formatting spans are unwrapped to text content in one step; footnote refs (`span-ref`) are simply skipped — no regex needed
- **Explicit policy per block type:** narrative content is indexed (text, headings, blockquotes, callouts, lists, tables, key insights, captions); navigational/promotional/UI blocks are explicitly excluded
- **Paragraph-aware chunking:** block boundaries are preserved as `\n\n` so `chunkParagraphs` can split semantically, then flattened per-chunk — the old pipeline collapsed all newlines before chunking
- **No dependency on** **`gdoc.markdown`:** reads directly from `gdoc.content.body`, decoupling search indexing from the markdown pipeline

Linked callout resolution carries over from the old pipeline — both resolve `span-callout` values identically.

## Test cases

Issues failing on production, fixed on staging (points the next staging server up the stack to get better testing tools with admin preview):

| Problem | Production | Staging |
| --- | --- | --- |
| Missing spaces between paragraphs with a chart in between | [link](https://ourworldindata.org/search?q=Complications+from+measles+are+most+severe&resultType=writing) — snippet shows `infection.Complications` with no space between sentences | [link](http://staging-site-feat-add-plain-text-preview/search?q=Complications+from+measles+are+most+severe&resultType=writing) — fixed: chart blocks return `undefined`, `joinBlocksAsParagraphs` inserts `\n\n` separators so output becomes `infection. Complications` |
| Missing spaces/delimiters around cells of raw HTML tables | [link](https://ourworldindata.org/search?q=Armed+conflicts%3A+interstate%2C+intrastate%2C+extrastate&resultType=writing) — `toPlaintext()` strips HTML tags but adds no whitespace between cells, producing `UCDPArmed conflicts: interstate, intrastate, extrastate...` | [link](http://staging-site-feat-add-plain-text-preview/search?q=Armed+conflicts%3A+interstate%2C+intrastate%2C+extrastate&resultType=writing) — fixed: cheerio parses HTML tables, cells joined with `\|`, list items with `; ` |
| Missing spaces around delimiters of regular tables | [link](https://ourworldindata.org/search?q=Estimate+of+the+effect+size&resultType=writing) — snippet shows `\|Intervention\|Estimate of the effect size\| \|Handwashing with soap\|48% risk …` with no spaces around pipe delimiters | [link](http://staging-site-feat-add-plain-text-preview/search?q=Estimate+of+the+effect+size&resultType=writing) — fixed: snippet shows `Intervention \| Estimate of the effect size \| Handwashing with soap \| 48% risk reduction` with proper spacing |
| Missing spaces around headers | [link](https://ourworldindata.org/search?q=emissions+changed+over+time+in+the+visualizations+above&resultType=writing) — heading text runs into adjacent paragraph: `...visualizations above.How have emissions changed over time` | [link](http://staging-site-feat-add-plain-text-preview/search?q=emissions+changed+over+time+in+the+visualizations+above&resultType=writing) — fixed: `joinBlocksAsParagraphs` inserts `\n\n` between all blocks including headings |
| Href of prominent links to non-gdoc URLs shown | [link](https://ourworldindata.org/search?q=childhood+stunting&resultType=writing) — snippet shows `What is childhood stunting?https://ourworldindata.org/stunting-definitionExplore our page on …` with raw URL leaked into text | [link](http://staging-site-feat-add-plain-text-preview/search?q=childhood+stunting&resultType=writing) — fixed: `prominent-link` blocks excluded entirely, no raw URLs in snippets |
| Endnotes not being filtered out | [link](https://ourworldindata.org/search?q=L%C3%BChrmann%2C+Anna%2C+Marcus+Tannnberg%2C+and+Staffan+Lindberg&resultType=writing) — snippet shows endnote citation text: `Lührmann, Anna, Marcus Tannnberg, and Staffan Lindberg. 2018. Regimes of the World (RoW): Opening New Avenues …` | [link](http://staging-site-feat-add-plain-text-preview/search?q=L%C3%BChrmann%2C+Anna%2C+Marcus+Tannnberg%2C+and+Staffan+Lindberg&resultType=writing) — fixed: no results returned — endnote content no longer indexed (`span-ref` returns `""`) |
| Footnote numbers not preceded by "." not excluded in body | [link](https://ourworldindata.org/search?q=and+distinguishes+between+two+types+of+democracies&resultType=writing) — snippet shows stray footnote number: `(V-Dem) project2 and distinguishes between two types of democracies` | [link](http://staging-site-feat-add-plain-text-preview/search?q=and+distinguishes+between+two+types+of+democracies&resultType=writing) — fixed: snippet shows `(V-Dem) project and distinguishes between two types of democracies` — stray `2` removed |
| Missing spaces around headers | [link](https://ourworldindata.org/search?q=How+effective+is+the+measles+vaccine%2C+and+is+it+safe%3F&resultType=writing) — heading merges into paragraph: `...end of paragraph.How effective is the measles vaccine` | [link](http://staging-site-feat-add-plain-text-preview/search?q=How+effective+is+the+measles+vaccine%2C+and+is+it+safe%3F&resultType=writing) — fixed: `joinBlocksAsParagraphs` adds `\n\n` separators and `.` terminators, ensuring spaces around all headings |

## Test plan

_A global before/after comparison would be too noisy to be useful. A more useful approach is to look at this from the perspective of what content should make it into the index, compare against the extraction rules and promote as a new baseline_

- [x] Run `yarn test run db/model/Gdoc/enrichedToIndexableText.test.ts` — all tests pass
- [x] Run `yarn typecheck` — no type errors
- [x] Verify search results on staging maintain proper formatting across paragraph/sentence boundaries
- [x] Verify all failure modes before/after
- [ ] Does the staging experience have sign-off from product stakeholders?

🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Context

This PR adds a new "Plain text" preview mode to the Algolia index preview drawer in the admin site. This allows users to view the raw text content extracted from a Google Doc, which can be useful for debugging search indexing issues.

## Screenshots / Videos / Diagrams


![Screenshot 2026-02-13 at 10.07.09.png](https://app.graphite.com/user-attachments/assets/34ee1f60-1e5f-400e-891d-f8e11d3f1ea5.png)

## Testing guidance

1. Open a Google Doc in the admin site
2. Click on the "Index" button to open the index preview drawer
3. Toggle between "Algolia records" and "Plain text" modes
4. Verify that both modes display the expected content
5. Verify that the loading states work correctly for both modes
## Context

The BDD test "Search from homepage with country extraction" was flaky — it failed on 2 of 3 retries in [build #28275](https://buildkite.com/our-world-in-data/grapher-automated-staging-environment/builds/28275).

**Root cause**: Country extraction and URL sanitization happen in a React `useEffect` that runs after the first paint. The test's synchronous `page.url()` check could see the stale URL (e.g. `?q=co2+france`) before the effect rewrote it to the sanitized form (`?q=co2&countries=France`).

**Fix**: Replace all synchronous `page.url()` assertions with Playwright's polling `expect(page).toHaveURL()`, which retries until the URL matches. Consolidated all URL param helpers into a single generic `expectUrlParam` that handles exact match, absence, and multi-value `~`-separated params.

## Testing guidance

- BDD tests should pass consistently without needing retries for the country extraction scenario.
sophiamersmann and others added 22 commits February 27, 2026 16:28
* Add containerTitle to charts search index

This enables searching by multi-dim and explorer titles, which are
sometimes different than the titles of their views.

Test case queries:

- multi-dim: `childhood vaccination coverage`
- explorer: `species habitat availability`

Fixes #5243

* 🐝 trigger CI

---------

Co-authored-by: Marigold <mojmir.vinkler@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
sophiamersmann and others added 30 commits March 19, 2026 09:40
Small Grapher refactors I did as part of the Causes of Death Treemap

- Rename GrapherTooltipAnchor options — renames enum values in GrapherTypes.ts (e.g. for clarity)                                        
- Make TooltipValue component more flexible — adjusts TooltipContents.tsx to be more reusable                                            
- Extract sparkline component — moves the sparkline out of DataTable.tsx into a new sparkline/Sparkline.tsx file                         
- Extract SASS variables — pulls shared SCSS variables (colors, sizes, etc.) from grapher.scss into a new core/variables.scss            
- Split tooltip components into separate files — breaks the monolithic Tooltip.tsx into TooltipCard.tsx and TooltipContainer.tsx         
- Drop NO_DATA_LABEL import from ColorScale — removes an unused import                                                                   
- Move makeAxisLabel to AxisUtils — relocates the helper from ChartUtils.tsx to axis/AxisUtils.ts
Refactors TextWrap and MarkdownTextWrap. The main motivation is to refactor MobX away so that we can use these utilities in bespoke projects, but I went a bit further and also split state from rendering and introduced a common interface for TextWrap and MarkdownTextWrap.

In summary,
- Removed MobX from `TextWrap` and `MarkdownTextWrap`, either dropping `@computed` or replacing it with `@imemo`
- Convert MarkdownTextWrap from a React component to a plain class, removing the JSX rendering pattern         
- Separate state from rendering by extracting render methods into standalone React components: TextWrapSvg, TextWrapHtml, MarkdownTextWrapSvg, MarkdownTextWrapHtml                                                                                                
- Introduce a shared TextWrap interface
This PR renders Grapher’s Dropdown component in the example bespoke viz, since most bespoke viz projects will likely need it, but there is a bit of setup required to make React Aria work with the Shadow DOM.

I initially thought we’d need to portal the popover into the shadow DOM. But while porting this code over from the Causes of Death project, where I first experimented with this, I realised the solution is much simpler: it’s actually fine to attach the popover to the body outside the shadow DOM.

What’s then missing are the popover styles, because they don’t exist on the demo page. But they should exist on any OWID page, since they’re bundled into owid.css, right? So the simplest fix seems to be to just import those styles on the demo page.

Of course, this is a bit brittle, because we’re relying on the embedder to provide those styles. But I think it should be fine in practice, since we’ll always be embedding these in GDoc articles that live on our site.
Adds a few shared components for bespoke projects and adds an example chart to the example project.

In summary,                                                                                                                                           
- Adds reusable components for bespoke projects: ChartHeader, ChartFooter, Frame,
  TimeSlider, and BezierArrow                                            
- Adds a shared useDimensions hook for responsive chart sizing via ResizeObserver
- Adds a new "chart" variant to the example bespoke project that demonstrates how to compose these shared components using `@visx` packages
- Improve the layout of the demo page (the boxes looked nice for the small examples, but I found it distracting when working on the Causes of Death Treemap)
Explicitly pass ADMIN_SERVER_PORT, VITE_PORT, WRANGLER_PORT, and
COMPOSE_PROJECT_NAME to tmux shell commands so user overrides are
respected. Also use TMUX_SESSION_NAME in up.devcontainer instead of
a hardcoded session name.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Forward env vars to tmux subshells in Makefile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.