fix: correct path parsing for URLs containing non-ASCII and latin-1 encoded filenames by caiot5 · Pull Request #2 · molivil/warnick

caiot5 · 2026-03-09T21:14:45Z

On sites with non-ASCII characters in URLs or filenames (common on older sites using ISO-8859-1 or latin-1 encoding) particularly Brazilian and other non-English sites from the late 90s and early 2000s: rev emits a "stdin: Invalid or incomplete" multibyte or wide character warning to stderr and may produce incorrect results when a multibyte character sequence falls at or near a path separator boundary, causing filenames and paths to be parsed incorrectly. This would result in files being saved with mangled names or redundant download attempts.
Prepending LC_ALL=C to the two rev-involving assignments in the main loop fixes this by telling rev to treat the string as raw bytes rather than attempting multibyte character interpretation. The / delimiter remains unambiguous in all cases since UTF-8 guarantees that no multibyte sequence contains 0x2F, and ISO-8859-1 is single-byte by definition so byte reversal is always safe.

…ncoded filenames

caiot5 added 2 commits March 9, 2026 19:28

fix: correct path parsing for URLs containing non-ASCII and latin-1 e…

62c1ff1

…ncoded filenames

Merge branch 'molivil:master' into fix/non-ascii-path-parsing

6227b49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct path parsing for URLs containing non-ASCII and latin-1 encoded filenames#2

fix: correct path parsing for URLs containing non-ASCII and latin-1 encoded filenames#2
caiot5 wants to merge 2 commits intomolivil:masterfrom
caiot5:fix/non-ascii-path-parsing

caiot5 commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

caiot5 commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant