Implement updates tracking OWID grapher#6
Open
xrendan wants to merge 703 commits intoBuildCanada:masterfrom
Open
Implement updates tracking OWID grapher#6xrendan wants to merge 703 commits intoBuildCanada:masterfrom
xrendan wants to merge 703 commits intoBuildCanada:masterfrom
Conversation
Mixed-content cells (e.g. a paragraph + a list) only indexed the list items, silently discarding all other visible text. Now each cell's HTML is converted to enriched blocks via htmlToEnrichedBlocks and processed through the same enrichedBlocksToIndexableText path as regular table cells. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use a generic type constraint instead of a union of specific interfaces, adding OwidGdocProfileInterface support for search indexing.
- Test "returns undefined for component blocks" now uses prominent-link (a true no-text block) instead of chart which can have a caption - Rename "skip component blocks" test to clarify it targets caption-less charts - Add new test verifying chart captions are included in indexable text Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r search (#6103) ## Summary Replaces the indirect markdown-to-plaintext pipeline with a new `enrichedBlocksToIndexableText` module that converts enriched Gdoc blocks directly to plaintext for Algolia search indexing, with comprehensive test coverage. ## Rationale: why branch out of the markdown pipeline The previous approach to generating search-indexable text repurposed the markdown pipeline: enriched blocks → markdown (via `enrichedToMarkdown`) → strip custom component tags → strip markdown formatting → regex cleanup. This worked but had two structural issues: **The markdown round-trip is lossy and wasteful.** `spanToMarkdown` wraps formatting spans in markdown syntax (`**bold**`, `_italic_`, `[text](url)`) and `formatGdocMarkdown` immediately strips it back out — imprecisely, because `MarkdownTextWrap` couldn't handle all variants. The workaround was to remove all asterisks wholesale and use regex heuristics to strip footnote numbers (`word.1` → `word.`). These are fragile patches on a representation that discards the structural information (like `span-ref` for footnotes) that would have made clean extraction trivial. **The markdown pipeline's inclusion decisions didn't match the search use case.** Some blocks that aren't meaningful narrative content (e.g. `prominent-link` titles/URLs, `research-and-writing` URL lists, `pill-row` navigation links) were included in the markdown and survived the stripping pipeline into search results. This could have been fixed in `enrichedBlocksToMarkdown` itself, but it highlights that search was inheriting inclusion decisions that didn't match its scope. The new `enrichedBlocksToIndexableText` module sidesteps both issues by operating directly on the enriched block AST with an explicit, search-specific indexing policy: - **No round-trip:** formatting spans are unwrapped to text content in one step; footnote refs (`span-ref`) are simply skipped — no regex needed - **Explicit policy per block type:** narrative content is indexed (text, headings, blockquotes, callouts, lists, tables, key insights, captions); navigational/promotional/UI blocks are explicitly excluded - **Paragraph-aware chunking:** block boundaries are preserved as `\n\n` so `chunkParagraphs` can split semantically, then flattened per-chunk — the old pipeline collapsed all newlines before chunking - **No dependency on** **`gdoc.markdown`:** reads directly from `gdoc.content.body`, decoupling search indexing from the markdown pipeline Linked callout resolution carries over from the old pipeline — both resolve `span-callout` values identically. ## Test cases Issues failing on production, fixed on staging (points the next staging server up the stack to get better testing tools with admin preview): | Problem | Production | Staging | | --- | --- | --- | | Missing spaces between paragraphs with a chart in between | [link](https://ourworldindata.org/search?q=Complications+from+measles+are+most+severe&resultType=writing) — snippet shows `infection.Complications` with no space between sentences | [link](http://staging-site-feat-add-plain-text-preview/search?q=Complications+from+measles+are+most+severe&resultType=writing) — fixed: chart blocks return `undefined`, `joinBlocksAsParagraphs` inserts `\n\n` separators so output becomes `infection. Complications` | | Missing spaces/delimiters around cells of raw HTML tables | [link](https://ourworldindata.org/search?q=Armed+conflicts%3A+interstate%2C+intrastate%2C+extrastate&resultType=writing) — `toPlaintext()` strips HTML tags but adds no whitespace between cells, producing `UCDPArmed conflicts: interstate, intrastate, extrastate...` | [link](http://staging-site-feat-add-plain-text-preview/search?q=Armed+conflicts%3A+interstate%2C+intrastate%2C+extrastate&resultType=writing) — fixed: cheerio parses HTML tables, cells joined with `\|`, list items with `; ` | | Missing spaces around delimiters of regular tables | [link](https://ourworldindata.org/search?q=Estimate+of+the+effect+size&resultType=writing) — snippet shows `\|Intervention\|Estimate of the effect size\| \|Handwashing with soap\|48% risk …` with no spaces around pipe delimiters | [link](http://staging-site-feat-add-plain-text-preview/search?q=Estimate+of+the+effect+size&resultType=writing) — fixed: snippet shows `Intervention \| Estimate of the effect size \| Handwashing with soap \| 48% risk reduction` with proper spacing | | Missing spaces around headers | [link](https://ourworldindata.org/search?q=emissions+changed+over+time+in+the+visualizations+above&resultType=writing) — heading text runs into adjacent paragraph: `...visualizations above.How have emissions changed over time` | [link](http://staging-site-feat-add-plain-text-preview/search?q=emissions+changed+over+time+in+the+visualizations+above&resultType=writing) — fixed: `joinBlocksAsParagraphs` inserts `\n\n` between all blocks including headings | | Href of prominent links to non-gdoc URLs shown | [link](https://ourworldindata.org/search?q=childhood+stunting&resultType=writing) — snippet shows `What is childhood stunting?https://ourworldindata.org/stunting-definitionExplore our page on …` with raw URL leaked into text | [link](http://staging-site-feat-add-plain-text-preview/search?q=childhood+stunting&resultType=writing) — fixed: `prominent-link` blocks excluded entirely, no raw URLs in snippets | | Endnotes not being filtered out | [link](https://ourworldindata.org/search?q=L%C3%BChrmann%2C+Anna%2C+Marcus+Tannnberg%2C+and+Staffan+Lindberg&resultType=writing) — snippet shows endnote citation text: `Lührmann, Anna, Marcus Tannnberg, and Staffan Lindberg. 2018. Regimes of the World (RoW): Opening New Avenues …` | [link](http://staging-site-feat-add-plain-text-preview/search?q=L%C3%BChrmann%2C+Anna%2C+Marcus+Tannnberg%2C+and+Staffan+Lindberg&resultType=writing) — fixed: no results returned — endnote content no longer indexed (`span-ref` returns `""`) | | Footnote numbers not preceded by "." not excluded in body | [link](https://ourworldindata.org/search?q=and+distinguishes+between+two+types+of+democracies&resultType=writing) — snippet shows stray footnote number: `(V-Dem) project2 and distinguishes between two types of democracies` | [link](http://staging-site-feat-add-plain-text-preview/search?q=and+distinguishes+between+two+types+of+democracies&resultType=writing) — fixed: snippet shows `(V-Dem) project and distinguishes between two types of democracies` — stray `2` removed | | Missing spaces around headers | [link](https://ourworldindata.org/search?q=How+effective+is+the+measles+vaccine%2C+and+is+it+safe%3F&resultType=writing) — heading merges into paragraph: `...end of paragraph.How effective is the measles vaccine` | [link](http://staging-site-feat-add-plain-text-preview/search?q=How+effective+is+the+measles+vaccine%2C+and+is+it+safe%3F&resultType=writing) — fixed: `joinBlocksAsParagraphs` adds `\n\n` separators and `.` terminators, ensuring spaces around all headings | ## Test plan _A global before/after comparison would be too noisy to be useful. A more useful approach is to look at this from the perspective of what content should make it into the index, compare against the extraction rules and promote as a new baseline_ - [x] Run `yarn test run db/model/Gdoc/enrichedToIndexableText.test.ts` — all tests pass - [x] Run `yarn typecheck` — no type errors - [x] Verify search results on staging maintain proper formatting across paragraph/sentence boundaries - [x] Verify all failure modes before/after - [ ] Does the staging experience have sign-off from product stakeholders? 🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Context This PR adds a new "Plain text" preview mode to the Algolia index preview drawer in the admin site. This allows users to view the raw text content extracted from a Google Doc, which can be useful for debugging search indexing issues. ## Screenshots / Videos / Diagrams  ## Testing guidance 1. Open a Google Doc in the admin site 2. Click on the "Index" button to open the index preview drawer 3. Toggle between "Algolia records" and "Plain text" modes 4. Verify that both modes display the expected content 5. Verify that the loading states work correctly for both modes
## Context The BDD test "Search from homepage with country extraction" was flaky — it failed on 2 of 3 retries in [build #28275](https://buildkite.com/our-world-in-data/grapher-automated-staging-environment/builds/28275). **Root cause**: Country extraction and URL sanitization happen in a React `useEffect` that runs after the first paint. The test's synchronous `page.url()` check could see the stale URL (e.g. `?q=co2+france`) before the effect rewrote it to the sanitized form (`?q=co2&countries=France`). **Fix**: Replace all synchronous `page.url()` assertions with Playwright's polling `expect(page).toHaveURL()`, which retries until the URL matches. Consolidated all URL param helpers into a single generic `expectUrlParam` that handles exact match, absence, and multi-value `~`-separated params. ## Testing guidance - BDD tests should pass consistently without needing retries for the country extraction scenario.
Improves accessibility and fixes #5930.
* Add containerTitle to charts search index This enables searching by multi-dim and explorer titles, which are sometimes different than the titles of their views. Test case queries: - multi-dim: `childhood vaccination coverage` - explorer: `species habitat availability` Fixes #5243 * 🐝 trigger CI --------- Co-authored-by: Marigold <mojmir.vinkler@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Small Grapher refactors I did as part of the Causes of Death Treemap - Rename GrapherTooltipAnchor options — renames enum values in GrapherTypes.ts (e.g. for clarity) - Make TooltipValue component more flexible — adjusts TooltipContents.tsx to be more reusable - Extract sparkline component — moves the sparkline out of DataTable.tsx into a new sparkline/Sparkline.tsx file - Extract SASS variables — pulls shared SCSS variables (colors, sizes, etc.) from grapher.scss into a new core/variables.scss - Split tooltip components into separate files — breaks the monolithic Tooltip.tsx into TooltipCard.tsx and TooltipContainer.tsx - Drop NO_DATA_LABEL import from ColorScale — removes an unused import - Move makeAxisLabel to AxisUtils — relocates the helper from ChartUtils.tsx to axis/AxisUtils.ts
Refactors TextWrap and MarkdownTextWrap. The main motivation is to refactor MobX away so that we can use these utilities in bespoke projects, but I went a bit further and also split state from rendering and introduced a common interface for TextWrap and MarkdownTextWrap. In summary, - Removed MobX from `TextWrap` and `MarkdownTextWrap`, either dropping `@computed` or replacing it with `@imemo` - Convert MarkdownTextWrap from a React component to a plain class, removing the JSX rendering pattern - Separate state from rendering by extracting render methods into standalone React components: TextWrapSvg, TextWrapHtml, MarkdownTextWrapSvg, MarkdownTextWrapHtml - Introduce a shared TextWrap interface
This PR renders Grapher’s Dropdown component in the example bespoke viz, since most bespoke viz projects will likely need it, but there is a bit of setup required to make React Aria work with the Shadow DOM. I initially thought we’d need to portal the popover into the shadow DOM. But while porting this code over from the Causes of Death project, where I first experimented with this, I realised the solution is much simpler: it’s actually fine to attach the popover to the body outside the shadow DOM. What’s then missing are the popover styles, because they don’t exist on the demo page. But they should exist on any OWID page, since they’re bundled into owid.css, right? So the simplest fix seems to be to just import those styles on the demo page. Of course, this is a bit brittle, because we’re relying on the embedder to provide those styles. But I think it should be fine in practice, since we’ll always be embedding these in GDoc articles that live on our site.
Adds a few shared components for bespoke projects and adds an example chart to the example project. In summary, - Adds reusable components for bespoke projects: ChartHeader, ChartFooter, Frame, TimeSlider, and BezierArrow - Adds a shared useDimensions hook for responsive chart sizing via ResizeObserver - Adds a new "chart" variant to the example bespoke project that demonstrates how to compose these shared components using `@visx` packages - Improve the layout of the demo page (the boxes looked nice for the small examples, but I found it distracting when working on the Causes of Death Treemap)
Explicitly pass ADMIN_SERVER_PORT, VITE_PORT, WRANGLER_PORT, and COMPOSE_PROJECT_NAME to tmux shell commands so user overrides are respected. Also use TMUX_SESSION_NAME in up.devcontainer instead of a hardcoded session name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Forward env vars to tmux subshells in Makefile
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Links to issues, Figma, Slack, and a technical introduction to the work.
Screenshots / Videos / Diagrams
Add if relevant, i.e. might not be necessary when there are no UI changes.
Testing guidance
Step-by-step instructions on how to test this change
Reminder to annotate the PR diff with design notes, alternatives you considered, and any other helpful context.
Checklist
(delete all that do not apply)
Before merging
If DB migrations exists:
After merging