[SCR-371] Fix crawl --extract-rules saving non-seed pages as .html#18
Merged
kostas-jakeliunas-sb merged 2 commits intomainfrom Apr 17, 2026
Merged
Conversation
_preferred_extension_from_scrape_params didn't recognize extract_rules, ai_extract_rules, or ai_query — even though those params always return JSON. extension_for_crawl then fell through to its URL-path heuristic, which picks up '.html' from URLs like /catalogue/foo_123/index.html and wins before body sniff would detect JSON. Result: only the seed URL (no path extension) saved as .json; every discovered page saved as .html despite a JSON body, breaking `crawl + extract + export → CSV`. Add the three params to the preferred-extension function so they force 'json' explicitly. This aligns with _requires_discovery_phase, which already treats the same set as 'always JSON'. ai_selector is excluded — it's a modifier for ai_query/ai_extract_rules, not a JSON producer on its own. Tests: cover each param in TestPreferredExtensionFromScrapeParams plus two integration tests in TestSpiderSaveResponse that verify a JSON body is saved as .json even when the URL ends in .html.
dbulbukas-sbee
approved these changes
Apr 17, 2026
Bump version 1.4.0 → 1.4.1 and document the SCR-371 fix in CHANGELOG. The v1.4.0 "Crawl extension priority" entry only covered the seed page; this release extends the fix to every discovered page so the full `crawl + --extract-rules + export` pipeline produces N-row CSVs instead of silently dropping to 1 row. Version bumped in pyproject.toml, src/scrapingbee_cli/__init__.py, the canonical .agents/skills/ SKILL.md tree, the claude-plugin manifest, the AGENTS.md upgrade-hint, uv.lock, and the synced platform skill trees (.github, .kiro, .opencode, plugins) via scripts/sync-skills.sh. Added `Changelog` and `Issues` entries to [project.urls] so PyPI surfaces direct links alongside Homepage / Documentation / Repository.
0d459e4 to
42ac828
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
_preferred_extension_from_scrape_params didn't recognize extract_rules, ai_extract_rules, or ai_query — even though those params always return JSON. extension_for_crawl then fell through to its URL-path heuristic, which picks up '.html' from URLs like /catalogue/foo_123/index.html and wins before body sniff would detect JSON. Result: only the seed URL (no path extension) saved as .json; every discovered page saved as .html despite a JSON body, breaking
crawl + extract + export → CSV.Add the three params to the preferred-extension function so they force 'json' explicitly. This aligns with _requires_discovery_phase, which already treats the same set as 'always JSON'. ai_selector is excluded — it's a modifier for ai_query/ai_extract_rules, not a JSON producer on its own.
Tests: cover each param in TestPreferredExtensionFromScrapeParams plus two integration tests in TestSpiderSaveResponse that verify a JSON body is saved as .json even when the URL ends in .html.