Add multi-recipe page detection and background extraction by windoze95 · Pull Request #73 · windoze95/saltybytes

windoze95 · 2026-04-05T03:11:53Z

Summary

When a user clicks a search result that contains multiple recipes (e.g. "Top 10 Pasta Recipes"), the system now detects this during the preview step and expands the single result into individual recipe cards.

How it works:

User clicks a search result — no detection overhead during search itself
Preview endpoint fetches the page and counts JSON-LD Recipe blocks
If multiple recipes found: returns {is_multi: true, recipes: [...cards...]} with title/image for each
Background parallel extraction of each recipe starts immediately via Claude
Extracted recipes cached in canonical repo with distinct per-card keys
Clicking an already-extracting card uses the cached/in-progress result (no re-extraction)

Components:

MultiRecipeRegistry — in-memory tracking of extraction state per URL (prevents duplicate work)
MultiRecipeResolver — orchestrates card extraction from JSON-LD and parallel background AI extraction
PreviewFromURLWithMultiCheck — preview endpoint variant that detects multi-recipe pages on click
fetchHTML() — dedicated HTML fetch with Firecrawl fallback

New endpoints:

GET /v1/recipes/search/resolve/:multi_id — poll resolution status and individual card extraction progress
POST /v1/recipes/search/check-multi — explicit multi-recipe check for a URL

Modified endpoint:

POST /v1/recipes/preview/url — now returns {is_multi, multi_id, recipes} when multiple recipes detected

Test plan

go build && go vet && go test ./internal/... all pass
Click a multi-recipe search result — preview returns is_multi: true with individual cards
Poll resolve endpoint — cards show extraction progress (pending → extracting → done)
Click an individual card that's already extracting — uses cached result, no restart
Click a single-recipe result — normal preview flow unchanged

🤖 Generated with Claude Code

When search results contain multi-recipe pages ("Top 10 Pasta Recipes"), the system now: 1. Detects multi-recipe titles via regex heuristics (IsMultiRecipeTitle) 2. Marks them with is_multi=true and a multi_id in search results 3. Fetches the page in background and extracts all JSON-LD Recipe blocks 4. Creates individual recipe cards with title/image for each 5. Starts parallel background extraction of each recipe via Claude 6. Caches extracted recipes in the canonical repo 7. Never restarts extraction for already-tracked URLs New endpoints: - GET /v1/recipes/search/resolve/:multi_id — poll resolution status - POST /v1/recipes/search/check-multi — late detection when a result is clicked and turns out to be multi-recipe Components: - MultiRecipeRegistry: in-memory tracking of extraction state per URL - MultiRecipeResolver: orchestrates detection, card extraction, and parallel background full extraction - fetchHTML(): dedicated HTML fetch with Firecrawl fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ceba9805da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Detection now happens during PreviewFromURL (when user clicks a result), not during search result post-processing. This eliminates latency from background page fetching during search. - Remove PostProcessSearchResults and search-time title detection - Remove IsMulti/MultiID from SearchResult (not set at search time) - Remove MultiResolver from SearchService (search is unaware of multi) - Add PreviewFromURLWithMultiCheck on ImportService: fetches HTML, counts JSON-LD Recipe blocks, returns multi-recipe response if >1 - Update PreviewFromURL handler to return {is_multi, multi_id, recipes} when multiple recipes detected - Wire MultiResolver into ImportHandler for click-time detection Flow: user clicks result → preview endpoint fetches page → detects multiple recipes → returns cards → frontend expands in place → background extraction continues for each card. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Pass page HTML to ExtractRecipeFromText when extracting individual cards — previously only an instruction string was passed with no source content, causing hallucination/failure. 2. Use slug query param (_recipe=title-slug) instead of URL fragment for canonical cache keys — NormalizeURL strips fragments, so all cards from the same page were overwriting each other. 3. Clear extraction error in fetchAndExtractWithHTML when HTML fetch succeeds — callers need the HTML for multi-recipe card detection even if single-recipe extraction failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b3ce883cb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

1. Add ValidateExternalURL call in CheckMultiRecipe handler before any network fetches — closes SSRF hole where crafted URLs could reach internal/private targets. Export the validator for handler use. 2. Store distinct URL (with _recipe= slug) as OriginalURL in canonical entries — refreshStaleCanonicals was re-extracting from the shared listicle URL, overwriting all cards with the same recipe. 3. Clear pageHTML after extraction completes and add 30-minute TTL eviction loop for resolved/failed registry entries — prevents unbounded memory growth from retained HTML blobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Eliminate double-fetch in fetchAndExtractWithHTML — now calls fetchHTML once and extracts from the returned HTML via extractRecipeFromHTML, instead of calling extractFromURL (which fetches internally) then fetchHTML again. 2. Truncate HTML to 100KB before passing to Claude in extractSingleCard — prevents context window overflow on large listicle pages. 3. PreviewFromURLWithMultiCheck no longer falls back to PreviewFromURL (which would re-fetch) — extracts from the HTML already in hand, both JSON-LD and AI fallback paths. 4. Deep-copy RecipeDef pointer in GetCards — prevents data races between extractSingleCard writing and handlers reading. 5. Eviction loop no longer holds registry write lock while acquiring entry read locks — collects candidates under read lock first, then deletes under write lock. 6. Remove dead code in ResolveFromURL — the recipeDef != nil branch that returned nil unconditionally. 7. Remove unused IsMultiRecipeTitle and multiRecipePatterns — leftover from the removed search-time detection. Also: fetchHTML now returns typed ExtractionError (not_found, site_blocked, fetch_failed) matching extractFromURL behavior, fixing the preview handler test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa638c23d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

1. Skip canonical cache early-return when multi-recipe resolver is active — previously-cached listicle URLs now reach multi-recipe detection instead of always returning a single recipe. 2. Add typed ExtractionError (not_found, site_blocked) to fetchHTML's skip-direct-fetch path — was returning raw Firecrawl errors that didn't map to proper HTTP status codes in handlers. 3. ResolveFromURL now calls fetchHTML directly instead of fetchAndExtractWithHTML — avoids paying for an AI extraction call whose result is discarded (only JSON-LD card detection is needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5a7f612663

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

1. Re-enable canonical cache for confirmed single-recipe URLs — check registry for resolved/failed status with <=1 cards before using cache, instead of blanket-disabling when resolver is non-nil. 2. Skip multi-recipe card entries (those with _recipe= in OriginalURL) during refreshStaleCanonicals — extractFromURL grabs the first recipe from the page, which overwrites card-specific cache entries. 3. Fall back to "card-{idx}" when slug mapper produces empty string from non-ASCII titles, preventing canonical key collisions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 875f0c5f4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

1. Add semaphore (max 3 concurrent) to extractAllRecipes — prevents unbounded burst of LLM requests on pages with many recipe cards. 2. Append card index to all canonical slugs (e.g. "pasta-0", "curry-1") — prevents key collisions from titles that produce identical slugs due to punctuation/spacing differences. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector bot reviewed Apr 5, 2026

View reviewed changes

Comment thread internal/service/multi_recipe.go Outdated

Comment thread internal/service/multi_recipe.go Outdated

Comment thread internal/service/import.go Outdated

windoze95 mentioned this pull request Apr 5, 2026

Add multi-recipe resolution UI for search results windoze95/saltybytes-app#33

Merged

4 tasks

windoze95 and others added 2 commits April 4, 2026 22:25

windoze95 marked this pull request as draft April 5, 2026 03:31

windoze95 marked this pull request as ready for review April 5, 2026 03:31

chatgpt-codex-connector bot reviewed Apr 5, 2026

View reviewed changes

Comment thread internal/handlers/search.go

Comment thread internal/service/multi_recipe.go

Comment thread internal/service/multi_recipe.go

windoze95 and others added 2 commits April 4, 2026 22:43

windoze95 marked this pull request as draft April 5, 2026 03:54

windoze95 marked this pull request as ready for review April 5, 2026 03:54

chatgpt-codex-connector bot reviewed Apr 5, 2026

View reviewed changes

Comment thread internal/service/import.go

Comment thread internal/service/import.go Outdated

Comment thread internal/service/multi_recipe.go Outdated

windoze95 marked this pull request as draft April 5, 2026 04:06

windoze95 marked this pull request as ready for review April 5, 2026 04:06

chatgpt-codex-connector bot reviewed Apr 5, 2026

View reviewed changes

Comment thread internal/service/import.go Outdated

Comment thread internal/service/multi_recipe.go

Comment thread internal/service/multi_recipe.go

windoze95 marked this pull request as draft April 5, 2026 04:24

windoze95 marked this pull request as ready for review April 5, 2026 04:24

chatgpt-codex-connector bot reviewed Apr 5, 2026

View reviewed changes

Comment thread internal/service/multi_recipe.go

Comment thread internal/service/multi_recipe.go

windoze95 merged commit 1064d2b into main Apr 5, 2026
1 check passed

windoze95 deleted the feat/multi-recipe-resolution branch April 5, 2026 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-recipe page detection and background extraction#73

Add multi-recipe page detection and background extraction#73
windoze95 merged 8 commits intomainfrom
feat/multi-recipe-resolution

windoze95 commented Apr 5, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

windoze95 commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

windoze95 commented Apr 5, 2026 •

edited

Loading