Skip to content

revert single-curl fetch, keep scanlinks optimizations#3

Open
caiot5 wants to merge 2 commits intomolivil:masterfrom
caiot5:feature/performance-optimization
Open

revert single-curl fetch, keep scanlinks optimizations#3
caiot5 wants to merge 2 commits intomolivil:masterfrom
caiot5:feature/performance-optimization

Conversation

@caiot5
Copy link
Contributor

@caiot5 caiot5 commented Mar 10, 2026

Turns out the Internet Archive handles big sites like microsoft.com, aol.com and so on differently from smaller ones, returning a direct 200 on bare domain roots instead of the usual redirect chain. The single-curl rewrite from the previous PR relies on the redirect chain to recover the real filename, so it silently fails on those sites and the crawl dies after the first page.
The original HEAD+redirect loop+wget pattern doesn't have this problem because wget always gets a fully-resolved URL with the real filename in it regardless of what IA does upstream.
Went back to the original geturl from 2.1.4. Kept the scanlinks speed improvements from the previous PR and also snuck in the LC_ALL=C fix for sites with non-ASCII filenames since it was already sitting in another open PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant