[SCR-371] Fix crawl --extract-rules saving non-seed pages as .html by kostas-jakeliunas-sb · Pull Request #18 · ScrapingBee/scrapingbee-cli

kostas-jakeliunas-sb · 2026-04-17T12:57:30Z

_preferred_extension_from_scrape_params didn't recognize extract_rules, ai_extract_rules, or ai_query — even though those params always return JSON. extension_for_crawl then fell through to its URL-path heuristic, which picks up '.html' from URLs like /catalogue/foo_123/index.html and wins before body sniff would detect JSON. Result: only the seed URL (no path extension) saved as .json; every discovered page saved as .html despite a JSON body, breaking crawl + extract + export → CSV.

Add the three params to the preferred-extension function so they force 'json' explicitly. This aligns with _requires_discovery_phase, which already treats the same set as 'always JSON'. ai_selector is excluded — it's a modifier for ai_query/ai_extract_rules, not a JSON producer on its own.

Tests: cover each param in TestPreferredExtensionFromScrapeParams plus two integration tests in TestSpiderSaveResponse that verify a JSON body is saved as .json even when the URL ends in .html.

_preferred_extension_from_scrape_params didn't recognize extract_rules, ai_extract_rules, or ai_query — even though those params always return JSON. extension_for_crawl then fell through to its URL-path heuristic, which picks up '.html' from URLs like /catalogue/foo_123/index.html and wins before body sniff would detect JSON. Result: only the seed URL (no path extension) saved as .json; every discovered page saved as .html despite a JSON body, breaking `crawl + extract + export → CSV`. Add the three params to the preferred-extension function so they force 'json' explicitly. This aligns with _requires_discovery_phase, which already treats the same set as 'always JSON'. ai_selector is excluded — it's a modifier for ai_query/ai_extract_rules, not a JSON producer on its own. Tests: cover each param in TestPreferredExtensionFromScrapeParams plus two integration tests in TestSpiderSaveResponse that verify a JSON body is saved as .json even when the URL ends in .html.

kostas-jakeliunas-sb · 2026-04-17T13:03:06Z

https://linear.app/scrapingbee/issue/SCR-371/crawl-extract-rules-saves-non-seed-pages-as-html-despite-json-body

Bump version 1.4.0 → 1.4.1 and document the SCR-371 fix in CHANGELOG. The v1.4.0 "Crawl extension priority" entry only covered the seed page; this release extends the fix to every discovered page so the full `crawl + --extract-rules + export` pipeline produces N-row CSVs instead of silently dropping to 1 row. Version bumped in pyproject.toml, src/scrapingbee_cli/__init__.py, the canonical .agents/skills/ SKILL.md tree, the claude-plugin manifest, the AGENTS.md upgrade-hint, uv.lock, and the synced platform skill trees (.github, .kiro, .opencode, plugins) via scripts/sync-skills.sh. Added `Changelog` and `Issues` entries to [project.urls] so PyPI surfaces direct links alongside Homepage / Documentation / Repository.

kostas-jakeliunas-sb self-assigned this Apr 17, 2026

kostas-jakeliunas-sb added the bug Something isn't working label Apr 17, 2026

dbulbukas-sbee approved these changes Apr 17, 2026

View reviewed changes

kostas-jakeliunas-sb changed the title ~~SCR-371: Fix crawl --extract-rules saving non-seed pages as .html~~ [SCR-371] Fix crawl --extract-rules saving non-seed pages as .html Apr 17, 2026

kostas-jakeliunas-sb force-pushed the SCR-371/crawl-extract-rules-json-extension branch from 0d459e4 to 42ac828 Compare April 17, 2026 13:27

kostas-jakeliunas-sb merged commit bde6788 into main Apr 17, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SCR-371] Fix crawl --extract-rules saving non-seed pages as .html#18

[SCR-371] Fix crawl --extract-rules saving non-seed pages as .html#18
kostas-jakeliunas-sb merged 2 commits intomainfrom
SCR-371/crawl-extract-rules-json-extension

kostas-jakeliunas-sb commented Apr 17, 2026

Uh oh!

kostas-jakeliunas-sb commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kostas-jakeliunas-sb commented Apr 17, 2026

Uh oh!

kostas-jakeliunas-sb commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants