Skip to content

Strip HTML to text before AI multi-recipe detection#75

Merged
windoze95 merged 1 commit intomainfrom
fix/multi-recipe-html-stripping
Apr 5, 2026
Merged

Strip HTML to text before AI multi-recipe detection#75
windoze95 merged 1 commit intomainfrom
fix/multi-recipe-html-stripping

Conversation

@windoze95
Copy link
Copy Markdown
Owner

Summary

Fixes production failure where multi-recipe detection was:

  • Burning through Haiku's 50K input token/min rate limit (sending 80KB raw HTML)
  • Taking 27+ seconds for detection (AI parsing CSS/JS/nav noise)
  • Failing to detect recipes on real listicle pages like natashaskitchen.com

Fix: stripHTMLToText() removes script/style/nav/header/footer blocks, HTML comments, and all tags before sending to AI. Produces clean text ~10-20x smaller.

  • Detection input: 15KB stripped text (was 80KB raw HTML)
  • Per-card extraction input: 30KB stripped text (was 100KB raw HTML)
  • Both stay well under Haiku token limits

Test plan

  • go build && go vet && go test ./internal/... pass
  • Click a multi-recipe result — detection completes in <2s instead of 27s
  • No more 429 rate limit errors on follow-up extraction

🤖 Generated with Claude Code

Root cause of production failure: sending 80KB of raw HTML (with CSS,
JS, nav, ads) to CookingQA for multi-recipe detection was:
1. Burning through the Haiku 50K input token/min rate limit
2. Producing unreliable detection results (AI couldn't find recipes
   buried in HTML noise)
3. Taking 27+ seconds for detection alone

Fix: add stripHTMLToText() that removes script/style/nav/header/footer
blocks, HTML comments, and tags, then collapses whitespace. Produces
clean text ~10-20x smaller than raw HTML.

- Detection: uses stripped text capped at 15KB (was 80KB raw HTML)
- Per-card extraction: uses stripped text capped at 30KB (was 100KB raw)
- Both paths now stay well under Haiku's token limits

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@windoze95 windoze95 merged commit e20d778 into main Apr 5, 2026
1 check passed
@windoze95 windoze95 deleted the fix/multi-recipe-html-stripping branch April 5, 2026 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant