[Fix]: Preserve UTF-8 codepoints across PTY 4 KiB read boundaries#173
Open
matiaspalmac wants to merge 2 commits intoTrixtyAI:mainfrom
Open
[Fix]: Preserve UTF-8 codepoints across PTY 4 KiB read boundaries#173matiaspalmac wants to merge 2 commits intoTrixtyAI:mainfrom
matiaspalmac wants to merge 2 commits intoTrixtyAI:mainfrom
Conversation
The PTY reader thread was calling `String::from_utf8_lossy` on each raw 4096-byte chunk straight from the OS. That works for ASCII but breaks on any multi-byte UTF-8 sequence that happens to straddle a chunk boundary: the last byte(s) of a codepoint end up decoded as the lead of an "incomplete" sequence and get replaced with U+FFFD, so the terminal shows `�` for accented letters, CJK text, emoji, box-drawing characters, and most non-English shell prompts whenever the read splits on a bad offset. The fix is a small leftover buffer: on every read we combine any held bytes with the new chunk, find the longest valid UTF-8 prefix, emit that, and keep the 1–3 trailing truncated bytes for the next read. Actually invalid sequences are still replaced with U+FFFD so one bad byte cannot stall the stream forever, and the leftover is flushed on EOF/error so the final prompt byte is never silently swallowed. Extracted the decode rule as `split_utf8_safely` and covered it with unit tests for ASCII, 4-byte codepoints split at every internal boundary, mixed multibyte runs with 1-byte reads, invalid bytes, empty input, and a truncated tail on close. Fixes TrixtyAI#70
|
Thanks for the contribution! I'll review it as soon as possible. If you have still changes, please mark this PR as draft and all reviews will be cancelled. Tests reviews will be re-run only when the PR is marked as ready for review. |
There was a problem hiding this comment.
Pull request overview
Fixes PTY output decoding so UTF-8 multi-byte codepoints aren’t corrupted when reads split across the fixed 4 KiB boundary (fixes #70).
Changes:
- Added
split_utf8_safelyhelper to identify an emit-safe UTF-8 prefix and retain truncated tails. - Updated the PTY reader thread to prepend any leftover bytes to the next read and flush leftovers on EOF/error.
- Added unit tests covering boundary splits, invalid bytes, empty input, and truncated tails.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Copilot review on TrixtyAI#173 flagged three issues on the reader thread, all valid: 1. Per-iteration allocations — `combined` was rebuilt with `std::mem::take(&mut leftover)` + `extend_from_slice` + a `combined[split..].to_vec()` on every read, so a busy shell paid two Vec allocations per 4 KiB chunk. 2. Post-invalid bytes delayed until the next read — when a chunk was `[valid][invalid][valid]`, the old split rule used `valid_up_to + error_len` as the emit boundary and kept the trailing valid suffix as `leftover`. That did not lose data but delayed it by a full read. 3. Docstring for `split_utf8_safely` said "truncated tail" but the code also branched on mid-buffer invalid sequences, which did not match the contract. Fix: - `split_utf8_safely` now returns `buf.len()` for any invalid UTF-8 that is not a truncated tail, so callers emit the whole chunk (`from_utf8_lossy` substitutes `U+FFFD` for the bad bytes) instead of holding the tail back. - The reader thread now keeps a single `Vec<u8>` assembly buffer with capacity `4096 + 4` and uses `drain(..split)` to consume emitted bytes in place — no allocation per read. - Docstring rewritten to match the new contract. - Added a regression test (`invalid_byte_mid_chunk_does_not_delay_valid_trailing_bytes`) that would fail under the old implementation.
Contributor
Author
|
Addressed all three Copilot observations:
Added |
Open
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The PTY reader thread was calling
String::from_utf8_lossyon each raw 4096-byte chunk returned byreader.read. That works for pure ASCII, but breaks on any multi-byte UTF-8 sequence that happens to straddle a chunk boundary: the last byte(s) of a codepoint get decoded as the lead of an "incomplete" sequence and are substituted with U+FFFD, so the terminal shows�for accented letters, CJK text, emoji, box-drawing characters, and most non-English shell prompts whenever the read lands on a bad offset.The fix is a small leftover buffer: on every read we combine any held-back bytes with the new chunk, find the longest valid UTF-8 prefix, emit that, and keep the 1–3 trailing truncated bytes for the next read. Actually invalid sequences are still replaced with U+FFFD so one bad byte cannot stall the stream forever, and the leftover is flushed on EOF/error so the final prompt byte is never silently swallowed.
Related Issue
Fixes #70
Context / Problem
On Windows PowerShell a prompt containing a non-ASCII character such as
»,日, or the spinner emoji from e.g.npm/pnpmreliably produced�in the Trixty terminal whenever the output happened to flush at a 4096-byte boundary that landed inside the codepoint. It was intermittent and looked like a font issue, but it was the decoder discarding bytes that hadn't arrived yet.Change
split_utf8_safely(buf: &[u8]) -> usizethat returns how many bytes ofbufare safe to emit now.Verification
cargo test --lib→ 33 passed, 0 failed (includes the 6 new tests).cargo clippy --lib --no-deps -- -D warnings→ clean.Checklist