Windows: Don't error on broken non UTF-8 output#134534
Windows: Don't error on broken non UTF-8 output#134534ChrisDenton wants to merge 3 commits intorust-lang:mainfrom
Conversation
This comment has been minimized.
This comment has been minimized.
Also update the tests to avoid testing implementation details.
| pub struct Stdout { | ||
| incomplete_utf8: IncompleteUtf8, | ||
| } | ||
| pub struct Stdout {} |
There was a problem hiding this comment.
The reason for the removal of incomplete UTF-8 handling at the end of the string is not clear to me from the commit description. Why was that removed?
There was a problem hiding this comment.
Because it now simply truncates the write to remove incomplete UTF-8 from the end and instead leaves the buffering to buffer types, i.e. LineWriter in this case.
There was a problem hiding this comment.
A LineWriter will flush an incomplete line if its buffer capacity is exceeded. If that happens, the output must support partial UTF-8 writes, or non-ASCII characters might get lost or replaced with the replacement character.
There was a problem hiding this comment.
That can only result in broken UTF-8 if the user writes incomplete UTF-8 to LineWriter themselves.
There was a problem hiding this comment.
I see, digging through the source code, BufWriter makes sure to not split writes that the user issued.
What is the motivation for truncating invalid UTF-8 at the end of the string?
All else being equal, I'd rather expect the previous behavior, that I can construct UTF-8 output byte-by-byte.
There was a problem hiding this comment.
Rather than having a secret stack buffer that can't be inspected or flushed, I'd strongly prefer buffering be done at a higher level. It's also a lot of added complexity for an edge case where the better solution is to set the console code page.
There was a problem hiding this comment.
In any case, if that behavior is wanted, it should probably be documented in the commit message so that it is clear to future readers that this change was on purpose.
Rather than ignoring trailing invalid UTF-8, I think it'd be better to replace it with a replacement character so that it becomes clear that something was removed.
There was a problem hiding this comment.
Rather than ignoring trailing invalid UTF-8, I think it'd be better to replace it with a replacement character so that it becomes clear that something was removed.
That's what happens in this code. No bytes are ever lost. Either the caller is informed that less bytes were written than were provided or, if there is only an incomplete code point, then that is written to the console (which will be converted to replacement characters when lossy translating to UTF-16).
There was a problem hiding this comment.
Ah yes, that makes sense. I didn't realize the caller would be informed by the return value of write.
What is the motivation for special casing trailing invalid UTF-8? It seems to increase the code complexity a little as well, and is not necessary for std's own use cases.
Is it for supporting a potential non-std buffered writer?
There was a problem hiding this comment.
Sure, it could be removed. Stderr is not buffered by us though and there have been proposals for unbuffered stdout.
| assert!(result != 0, "Unexpected error in MultiByteToWideChar"); | ||
|
|
||
| // The only way an error can happen here is if we've messed up. | ||
| debug_assert!(result != 0, "Unexpected error in MultiByteToWideChar"); |
There was a problem hiding this comment.
| debug_assert!(result != 0, "Unexpected error in MultiByteToWideChar"); | |
| assert!(result != 0, "Unexpected error in MultiByteToWideChar"); |
I think this should be an assert since this isn't performance critical — we've just done a syscall.
|
I'm going to split off the LineWriter and |
|
☔ The latest upstream changes (presumably #134822) made this pull request unmergeable. Please resolve the merge conflicts. |
There was a problem hiding this comment.
I like the idea of moving the buffering the the outer buffer, especially since we already wrap Stdout/Stderr in a LineWriter. trim_last_char_boundary looks sound. I haven't reviewed the rest as closely, but nothing stands out as wrong. I'm not yet convinced that LineWriter won't endlessly try to write the same incomplete single-char sequence, as I haven't reviewed that type with this invariant in mind.
Currently, on Windows, the standard library will error if you try to write invalid UTF-8. Whereas other platforms allow this. The issue arises because the console uses UTF-16 so Rust has to re-encode the output.
This PR fixes it in two ways:
Fixes #116871