fix: correct path parsing for URLs containing non-ASCII and latin-1 encoded filenames#2
Open
caiot5 wants to merge 2 commits intomolivil:masterfrom
Open
fix: correct path parsing for URLs containing non-ASCII and latin-1 encoded filenames#2caiot5 wants to merge 2 commits intomolivil:masterfrom
caiot5 wants to merge 2 commits intomolivil:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On sites with non-ASCII characters in URLs or filenames (common on older sites using ISO-8859-1 or latin-1 encoding) particularly Brazilian and other non-English sites from the late 90s and early 2000s: rev emits a "stdin: Invalid or incomplete" multibyte or wide character warning to stderr and may produce incorrect results when a multibyte character sequence falls at or near a path separator boundary, causing filenames and paths to be parsed incorrectly. This would result in files being saved with mangled names or redundant download attempts.
Prepending LC_ALL=C to the two rev-involving assignments in the main loop fixes this by telling rev to treat the string as raw bytes rather than attempting multibyte character interpretation. The / delimiter remains unambiguous in all cases since UTF-8 guarantees that no multibyte sequence contains 0x2F, and ISO-8859-1 is single-byte by definition so byte reversal is always safe.