Skip to content

[SCR-371] Fix crawl --extract-rules saving non-seed pages as .html#18

Merged
kostas-jakeliunas-sb merged 2 commits intomainfrom
SCR-371/crawl-extract-rules-json-extension
Apr 17, 2026
Merged

[SCR-371] Fix crawl --extract-rules saving non-seed pages as .html#18
kostas-jakeliunas-sb merged 2 commits intomainfrom
SCR-371/crawl-extract-rules-json-extension

Conversation

@kostas-jakeliunas-sb
Copy link
Copy Markdown
Contributor

_preferred_extension_from_scrape_params didn't recognize extract_rules, ai_extract_rules, or ai_query — even though those params always return JSON. extension_for_crawl then fell through to its URL-path heuristic, which picks up '.html' from URLs like /catalogue/foo_123/index.html and wins before body sniff would detect JSON. Result: only the seed URL (no path extension) saved as .json; every discovered page saved as .html despite a JSON body, breaking crawl + extract + export → CSV.

Add the three params to the preferred-extension function so they force 'json' explicitly. This aligns with _requires_discovery_phase, which already treats the same set as 'always JSON'. ai_selector is excluded — it's a modifier for ai_query/ai_extract_rules, not a JSON producer on its own.

Tests: cover each param in TestPreferredExtensionFromScrapeParams plus two integration tests in TestSpiderSaveResponse that verify a JSON body is saved as .json even when the URL ends in .html.

_preferred_extension_from_scrape_params didn't recognize extract_rules,
ai_extract_rules, or ai_query — even though those params always return
JSON. extension_for_crawl then fell through to its URL-path heuristic,
which picks up '.html' from URLs like /catalogue/foo_123/index.html and
wins before body sniff would detect JSON. Result: only the seed URL (no
path extension) saved as .json; every discovered page saved as .html
despite a JSON body, breaking `crawl + extract + export → CSV`.

Add the three params to the preferred-extension function so they force
'json' explicitly. This aligns with _requires_discovery_phase, which
already treats the same set as 'always JSON'. ai_selector is excluded —
it's a modifier for ai_query/ai_extract_rules, not a JSON producer on
its own.

Tests: cover each param in TestPreferredExtensionFromScrapeParams plus
two integration tests in TestSpiderSaveResponse that verify a JSON body
is saved as .json even when the URL ends in .html.
@kostas-jakeliunas-sb kostas-jakeliunas-sb self-assigned this Apr 17, 2026
@kostas-jakeliunas-sb kostas-jakeliunas-sb added the bug Something isn't working label Apr 17, 2026
@kostas-jakeliunas-sb kostas-jakeliunas-sb changed the title SCR-371: Fix crawl --extract-rules saving non-seed pages as .html [SCR-371] Fix crawl --extract-rules saving non-seed pages as .html Apr 17, 2026
@kostas-jakeliunas-sb
Copy link
Copy Markdown
Contributor Author

Bump version 1.4.0 → 1.4.1 and document the SCR-371 fix in CHANGELOG.

The v1.4.0 "Crawl extension priority" entry only covered the seed page;
this release extends the fix to every discovered page so the full
`crawl + --extract-rules + export` pipeline produces N-row CSVs instead
of silently dropping to 1 row.

Version bumped in pyproject.toml, src/scrapingbee_cli/__init__.py, the
canonical .agents/skills/ SKILL.md tree, the claude-plugin manifest, the
AGENTS.md upgrade-hint, uv.lock, and the synced platform skill trees
(.github, .kiro, .opencode, plugins) via scripts/sync-skills.sh.

Added `Changelog` and `Issues` entries to [project.urls] so PyPI
surfaces direct links alongside Homepage / Documentation / Repository.
@kostas-jakeliunas-sb kostas-jakeliunas-sb force-pushed the SCR-371/crawl-extract-rules-json-extension branch from 0d459e4 to 42ac828 Compare April 17, 2026 13:27
@kostas-jakeliunas-sb kostas-jakeliunas-sb merged commit bde6788 into main Apr 17, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants