Skip to content

fix: correct path parsing for URLs containing non-ASCII and latin-1 encoded filenames#2

Open
caiot5 wants to merge 2 commits intomolivil:masterfrom
caiot5:fix/non-ascii-path-parsing
Open

fix: correct path parsing for URLs containing non-ASCII and latin-1 encoded filenames#2
caiot5 wants to merge 2 commits intomolivil:masterfrom
caiot5:fix/non-ascii-path-parsing

Conversation

@caiot5
Copy link
Contributor

@caiot5 caiot5 commented Mar 9, 2026

On sites with non-ASCII characters in URLs or filenames (common on older sites using ISO-8859-1 or latin-1 encoding) particularly Brazilian and other non-English sites from the late 90s and early 2000s: rev emits a "stdin: Invalid or incomplete" multibyte or wide character warning to stderr and may produce incorrect results when a multibyte character sequence falls at or near a path separator boundary, causing filenames and paths to be parsed incorrectly. This would result in files being saved with mangled names or redundant download attempts.
Prepending LC_ALL=C to the two rev-involving assignments in the main loop fixes this by telling rev to treat the string as raw bytes rather than attempting multibyte character interpretation. The / delimiter remains unambiguous in all cases since UTF-8 guarantees that no multibyte sequence contains 0x2F, and ISO-8859-1 is single-byte by definition so byte reversal is always safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant