Add multi-recipe page detection and background extraction#73
Conversation
When search results contain multi-recipe pages ("Top 10 Pasta Recipes"),
the system now:
1. Detects multi-recipe titles via regex heuristics (IsMultiRecipeTitle)
2. Marks them with is_multi=true and a multi_id in search results
3. Fetches the page in background and extracts all JSON-LD Recipe blocks
4. Creates individual recipe cards with title/image for each
5. Starts parallel background extraction of each recipe via Claude
6. Caches extracted recipes in the canonical repo
7. Never restarts extraction for already-tracked URLs
New endpoints:
- GET /v1/recipes/search/resolve/:multi_id — poll resolution status
- POST /v1/recipes/search/check-multi — late detection when a result
is clicked and turns out to be multi-recipe
Components:
- MultiRecipeRegistry: in-memory tracking of extraction state per URL
- MultiRecipeResolver: orchestrates detection, card extraction, and
parallel background full extraction
- fetchHTML(): dedicated HTML fetch with Firecrawl fallback
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ceba9805da
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Detection now happens during PreviewFromURL (when user clicks a result),
not during search result post-processing. This eliminates latency from
background page fetching during search.
- Remove PostProcessSearchResults and search-time title detection
- Remove IsMulti/MultiID from SearchResult (not set at search time)
- Remove MultiResolver from SearchService (search is unaware of multi)
- Add PreviewFromURLWithMultiCheck on ImportService: fetches HTML,
counts JSON-LD Recipe blocks, returns multi-recipe response if >1
- Update PreviewFromURL handler to return {is_multi, multi_id, recipes}
when multiple recipes detected
- Wire MultiResolver into ImportHandler for click-time detection
Flow: user clicks result → preview endpoint fetches page → detects
multiple recipes → returns cards → frontend expands in place →
background extraction continues for each card.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Pass page HTML to ExtractRecipeFromText when extracting individual cards — previously only an instruction string was passed with no source content, causing hallucination/failure. 2. Use slug query param (_recipe=title-slug) instead of URL fragment for canonical cache keys — NormalizeURL strips fragments, so all cards from the same page were overwriting each other. 3. Clear extraction error in fetchAndExtractWithHTML when HTML fetch succeeds — callers need the HTML for multi-recipe card detection even if single-recipe extraction failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3b3ce883cb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
1. Add ValidateExternalURL call in CheckMultiRecipe handler before any network fetches — closes SSRF hole where crafted URLs could reach internal/private targets. Export the validator for handler use. 2. Store distinct URL (with _recipe= slug) as OriginalURL in canonical entries — refreshStaleCanonicals was re-extracting from the shared listicle URL, overwriting all cards with the same recipe. 3. Clear pageHTML after extraction completes and add 30-minute TTL eviction loop for resolved/failed registry entries — prevents unbounded memory growth from retained HTML blobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Eliminate double-fetch in fetchAndExtractWithHTML — now calls fetchHTML once and extracts from the returned HTML via extractRecipeFromHTML, instead of calling extractFromURL (which fetches internally) then fetchHTML again. 2. Truncate HTML to 100KB before passing to Claude in extractSingleCard — prevents context window overflow on large listicle pages. 3. PreviewFromURLWithMultiCheck no longer falls back to PreviewFromURL (which would re-fetch) — extracts from the HTML already in hand, both JSON-LD and AI fallback paths. 4. Deep-copy RecipeDef pointer in GetCards — prevents data races between extractSingleCard writing and handlers reading. 5. Eviction loop no longer holds registry write lock while acquiring entry read locks — collects candidates under read lock first, then deletes under write lock. 6. Remove dead code in ResolveFromURL — the recipeDef != nil branch that returned nil unconditionally. 7. Remove unused IsMultiRecipeTitle and multiRecipePatterns — leftover from the removed search-time detection. Also: fetchHTML now returns typed ExtractionError (not_found, site_blocked, fetch_failed) matching extractFromURL behavior, fixing the preview handler test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa638c23d4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
1. Skip canonical cache early-return when multi-recipe resolver is active — previously-cached listicle URLs now reach multi-recipe detection instead of always returning a single recipe. 2. Add typed ExtractionError (not_found, site_blocked) to fetchHTML's skip-direct-fetch path — was returning raw Firecrawl errors that didn't map to proper HTTP status codes in handlers. 3. ResolveFromURL now calls fetchHTML directly instead of fetchAndExtractWithHTML — avoids paying for an AI extraction call whose result is discarded (only JSON-LD card detection is needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5a7f612663
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
1. Re-enable canonical cache for confirmed single-recipe URLs — check
registry for resolved/failed status with <=1 cards before using
cache, instead of blanket-disabling when resolver is non-nil.
2. Skip multi-recipe card entries (those with _recipe= in OriginalURL)
during refreshStaleCanonicals — extractFromURL grabs the first
recipe from the page, which overwrites card-specific cache entries.
3. Fall back to "card-{idx}" when slug mapper produces empty string
from non-ASCII titles, preventing canonical key collisions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 875f0c5f4b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
1. Add semaphore (max 3 concurrent) to extractAllRecipes — prevents unbounded burst of LLM requests on pages with many recipe cards. 2. Append card index to all canonical slugs (e.g. "pasta-0", "curry-1") — prevents key collisions from titles that produce identical slugs due to punctuation/spacing differences. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
When a user clicks a search result that contains multiple recipes (e.g. "Top 10 Pasta Recipes"), the system now detects this during the preview step and expands the single result into individual recipe cards.
How it works:
{is_multi: true, recipes: [...cards...]}with title/image for eachComponents:
MultiRecipeRegistry— in-memory tracking of extraction state per URL (prevents duplicate work)MultiRecipeResolver— orchestrates card extraction from JSON-LD and parallel background AI extractionPreviewFromURLWithMultiCheck— preview endpoint variant that detects multi-recipe pages on clickfetchHTML()— dedicated HTML fetch with Firecrawl fallbackNew endpoints:
GET /v1/recipes/search/resolve/:multi_id— poll resolution status and individual card extraction progressPOST /v1/recipes/search/check-multi— explicit multi-recipe check for a URLModified endpoint:
POST /v1/recipes/preview/url— now returns{is_multi, multi_id, recipes}when multiple recipes detectedTest plan
go build && go vet && go test ./internal/...all passis_multi: truewith individual cards🤖 Generated with Claude Code