Skip to content

Add multi-recipe page detection and background extraction#73

Merged
windoze95 merged 8 commits intomainfrom
feat/multi-recipe-resolution
Apr 5, 2026
Merged

Add multi-recipe page detection and background extraction#73
windoze95 merged 8 commits intomainfrom
feat/multi-recipe-resolution

Conversation

@windoze95
Copy link
Copy Markdown
Owner

@windoze95 windoze95 commented Apr 5, 2026

Summary

When a user clicks a search result that contains multiple recipes (e.g. "Top 10 Pasta Recipes"), the system now detects this during the preview step and expands the single result into individual recipe cards.

How it works:

  1. User clicks a search result — no detection overhead during search itself
  2. Preview endpoint fetches the page and counts JSON-LD Recipe blocks
  3. If multiple recipes found: returns {is_multi: true, recipes: [...cards...]} with title/image for each
  4. Background parallel extraction of each recipe starts immediately via Claude
  5. Extracted recipes cached in canonical repo with distinct per-card keys
  6. Clicking an already-extracting card uses the cached/in-progress result (no re-extraction)

Components:

  • MultiRecipeRegistry — in-memory tracking of extraction state per URL (prevents duplicate work)
  • MultiRecipeResolver — orchestrates card extraction from JSON-LD and parallel background AI extraction
  • PreviewFromURLWithMultiCheck — preview endpoint variant that detects multi-recipe pages on click
  • fetchHTML() — dedicated HTML fetch with Firecrawl fallback

New endpoints:

  • GET /v1/recipes/search/resolve/:multi_id — poll resolution status and individual card extraction progress
  • POST /v1/recipes/search/check-multi — explicit multi-recipe check for a URL

Modified endpoint:

  • POST /v1/recipes/preview/url — now returns {is_multi, multi_id, recipes} when multiple recipes detected

Test plan

  • go build && go vet && go test ./internal/... all pass
  • Click a multi-recipe search result — preview returns is_multi: true with individual cards
  • Poll resolve endpoint — cards show extraction progress (pending → extracting → done)
  • Click an individual card that's already extracting — uses cached result, no restart
  • Click a single-recipe result — normal preview flow unchanged

🤖 Generated with Claude Code

When search results contain multi-recipe pages ("Top 10 Pasta Recipes"),
the system now:

1. Detects multi-recipe titles via regex heuristics (IsMultiRecipeTitle)
2. Marks them with is_multi=true and a multi_id in search results
3. Fetches the page in background and extracts all JSON-LD Recipe blocks
4. Creates individual recipe cards with title/image for each
5. Starts parallel background extraction of each recipe via Claude
6. Caches extracted recipes in the canonical repo
7. Never restarts extraction for already-tracked URLs

New endpoints:
- GET /v1/recipes/search/resolve/:multi_id — poll resolution status
- POST /v1/recipes/search/check-multi — late detection when a result
  is clicked and turns out to be multi-recipe

Components:
- MultiRecipeRegistry: in-memory tracking of extraction state per URL
- MultiRecipeResolver: orchestrates detection, card extraction, and
  parallel background full extraction
- fetchHTML(): dedicated HTML fetch with Firecrawl fallback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ceba9805da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/service/multi_recipe.go Outdated
Comment thread internal/service/multi_recipe.go Outdated
Comment thread internal/service/import.go Outdated
windoze95 and others added 2 commits April 4, 2026 22:25
Detection now happens during PreviewFromURL (when user clicks a result),
not during search result post-processing. This eliminates latency from
background page fetching during search.

- Remove PostProcessSearchResults and search-time title detection
- Remove IsMulti/MultiID from SearchResult (not set at search time)
- Remove MultiResolver from SearchService (search is unaware of multi)
- Add PreviewFromURLWithMultiCheck on ImportService: fetches HTML,
  counts JSON-LD Recipe blocks, returns multi-recipe response if >1
- Update PreviewFromURL handler to return {is_multi, multi_id, recipes}
  when multiple recipes detected
- Wire MultiResolver into ImportHandler for click-time detection

Flow: user clicks result → preview endpoint fetches page → detects
multiple recipes → returns cards → frontend expands in place →
background extraction continues for each card.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Pass page HTML to ExtractRecipeFromText when extracting individual
   cards — previously only an instruction string was passed with no
   source content, causing hallucination/failure.

2. Use slug query param (_recipe=title-slug) instead of URL fragment
   for canonical cache keys — NormalizeURL strips fragments, so all
   cards from the same page were overwriting each other.

3. Clear extraction error in fetchAndExtractWithHTML when HTML fetch
   succeeds — callers need the HTML for multi-recipe card detection
   even if single-recipe extraction failed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@windoze95 windoze95 marked this pull request as draft April 5, 2026 03:31
@windoze95 windoze95 marked this pull request as ready for review April 5, 2026 03:31
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b3ce883cb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/handlers/search.go
Comment thread internal/service/multi_recipe.go
Comment thread internal/service/multi_recipe.go
windoze95 and others added 2 commits April 4, 2026 22:43
1. Add ValidateExternalURL call in CheckMultiRecipe handler before
   any network fetches — closes SSRF hole where crafted URLs could
   reach internal/private targets. Export the validator for handler use.

2. Store distinct URL (with _recipe= slug) as OriginalURL in canonical
   entries — refreshStaleCanonicals was re-extracting from the shared
   listicle URL, overwriting all cards with the same recipe.

3. Clear pageHTML after extraction completes and add 30-minute TTL
   eviction loop for resolved/failed registry entries — prevents
   unbounded memory growth from retained HTML blobs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Eliminate double-fetch in fetchAndExtractWithHTML — now calls
   fetchHTML once and extracts from the returned HTML via
   extractRecipeFromHTML, instead of calling extractFromURL
   (which fetches internally) then fetchHTML again.

2. Truncate HTML to 100KB before passing to Claude in
   extractSingleCard — prevents context window overflow on
   large listicle pages.

3. PreviewFromURLWithMultiCheck no longer falls back to
   PreviewFromURL (which would re-fetch) — extracts from the
   HTML already in hand, both JSON-LD and AI fallback paths.

4. Deep-copy RecipeDef pointer in GetCards — prevents data
   races between extractSingleCard writing and handlers reading.

5. Eviction loop no longer holds registry write lock while
   acquiring entry read locks — collects candidates under read
   lock first, then deletes under write lock.

6. Remove dead code in ResolveFromURL — the recipeDef != nil
   branch that returned nil unconditionally.

7. Remove unused IsMultiRecipeTitle and multiRecipePatterns —
   leftover from the removed search-time detection.

Also: fetchHTML now returns typed ExtractionError (not_found,
site_blocked, fetch_failed) matching extractFromURL behavior,
fixing the preview handler test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@windoze95 windoze95 marked this pull request as draft April 5, 2026 03:54
@windoze95 windoze95 marked this pull request as ready for review April 5, 2026 03:54
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa638c23d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/service/import.go
Comment thread internal/service/import.go Outdated
Comment thread internal/service/multi_recipe.go Outdated
1. Skip canonical cache early-return when multi-recipe resolver is
   active — previously-cached listicle URLs now reach multi-recipe
   detection instead of always returning a single recipe.

2. Add typed ExtractionError (not_found, site_blocked) to fetchHTML's
   skip-direct-fetch path — was returning raw Firecrawl errors that
   didn't map to proper HTTP status codes in handlers.

3. ResolveFromURL now calls fetchHTML directly instead of
   fetchAndExtractWithHTML — avoids paying for an AI extraction call
   whose result is discarded (only JSON-LD card detection is needed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@windoze95 windoze95 marked this pull request as draft April 5, 2026 04:06
@windoze95 windoze95 marked this pull request as ready for review April 5, 2026 04:06
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5a7f612663

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/service/import.go Outdated
Comment thread internal/service/multi_recipe.go
Comment thread internal/service/multi_recipe.go
1. Re-enable canonical cache for confirmed single-recipe URLs — check
   registry for resolved/failed status with <=1 cards before using
   cache, instead of blanket-disabling when resolver is non-nil.

2. Skip multi-recipe card entries (those with _recipe= in OriginalURL)
   during refreshStaleCanonicals — extractFromURL grabs the first
   recipe from the page, which overwrites card-specific cache entries.

3. Fall back to "card-{idx}" when slug mapper produces empty string
   from non-ASCII titles, preventing canonical key collisions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@windoze95 windoze95 marked this pull request as draft April 5, 2026 04:24
@windoze95 windoze95 marked this pull request as ready for review April 5, 2026 04:24
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 875f0c5f4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/service/multi_recipe.go
Comment thread internal/service/multi_recipe.go
1. Add semaphore (max 3 concurrent) to extractAllRecipes — prevents
   unbounded burst of LLM requests on pages with many recipe cards.

2. Append card index to all canonical slugs (e.g. "pasta-0", "curry-1")
   — prevents key collisions from titles that produce identical slugs
   due to punctuation/spacing differences.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@windoze95 windoze95 merged commit 1064d2b into main Apr 5, 2026
1 check passed
@windoze95 windoze95 deleted the feat/multi-recipe-resolution branch April 5, 2026 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant