Skip to content

[Fix]: Preserve UTF-8 codepoints across PTY 4 KiB read boundaries#173

Open
matiaspalmac wants to merge 2 commits intoTrixtyAI:mainfrom
matiaspalmac:fix/pty-utf8-split
Open

[Fix]: Preserve UTF-8 codepoints across PTY 4 KiB read boundaries#173
matiaspalmac wants to merge 2 commits intoTrixtyAI:mainfrom
matiaspalmac:fix/pty-utf8-split

Conversation

@matiaspalmac
Copy link
Copy Markdown
Contributor

Description

The PTY reader thread was calling String::from_utf8_lossy on each raw 4096-byte chunk returned by reader.read. That works for pure ASCII, but breaks on any multi-byte UTF-8 sequence that happens to straddle a chunk boundary: the last byte(s) of a codepoint get decoded as the lead of an "incomplete" sequence and are substituted with U+FFFD, so the terminal shows for accented letters, CJK text, emoji, box-drawing characters, and most non-English shell prompts whenever the read lands on a bad offset.

The fix is a small leftover buffer: on every read we combine any held-back bytes with the new chunk, find the longest valid UTF-8 prefix, emit that, and keep the 1–3 trailing truncated bytes for the next read. Actually invalid sequences are still replaced with U+FFFD so one bad byte cannot stall the stream forever, and the leftover is flushed on EOF/error so the final prompt byte is never silently swallowed.

Related Issue

Fixes #70

Context / Problem

On Windows PowerShell a prompt containing a non-ASCII character such as », , or the spinner emoji from e.g. npm/pnpm reliably produced in the Trixty terminal whenever the output happened to flush at a 4096-byte boundary that landed inside the codepoint. It was intermittent and looked like a font issue, but it was the decoder discarding bytes that hadn't arrived yet.

Change

  • Extracted the decoding rule into a small pure helper split_utf8_safely(buf: &[u8]) -> usize that returns how many bytes of buf are safe to emit now.
  • Rewrote the reader loop to hold back the truncated tail and prepend it to the next read.
  • Flush any trailing leftover on EOF/error so the final output is never dropped.
  • Added six unit tests covering ASCII, 4-byte codepoints split at every internal boundary (1/2/3), mixed multi-byte runs with 1-byte reads, invalid bytes (U+FFFD recovery, no stall), empty input, and truncated tail on close.

Verification

  • cargo test --lib → 33 passed, 0 failed (includes the 6 new tests).
  • cargo clippy --lib --no-deps -- -D warnings → clean.
  • OS: Windows 11
  • Version: v1.0.10

Checklist

  • I have tested this fix thoroughly.
  • I have followed the project's coding guidelines.
  • My changes generate no new warnings or errors.
  • I have verified the fix on:
    • OS: Windows
    • Version: v1.0.10

The PTY reader thread was calling `String::from_utf8_lossy` on each raw
4096-byte chunk straight from the OS. That works for ASCII but breaks
on any multi-byte UTF-8 sequence that happens to straddle a chunk
boundary: the last byte(s) of a codepoint end up decoded as the lead
of an "incomplete" sequence and get replaced with U+FFFD, so the
terminal shows `�` for accented letters, CJK text, emoji, box-drawing
characters, and most non-English shell prompts whenever the read
splits on a bad offset.

The fix is a small leftover buffer: on every read we combine any held
bytes with the new chunk, find the longest valid UTF-8 prefix, emit
that, and keep the 1–3 trailing truncated bytes for the next read.
Actually invalid sequences are still replaced with U+FFFD so one bad
byte cannot stall the stream forever, and the leftover is flushed on
EOF/error so the final prompt byte is never silently swallowed.

Extracted the decode rule as `split_utf8_safely` and covered it with
unit tests for ASCII, 4-byte codepoints split at every internal
boundary, mixed multibyte runs with 1-byte reads, invalid bytes,
empty input, and a truncated tail on close.

Fixes TrixtyAI#70
Copilot AI review requested due to automatic review settings April 21, 2026 04:38
@github-actions github-actions bot added bug Something isn't working dependencies Pull requests that update a dependency file labels Apr 21, 2026
@github-actions
Copy link
Copy Markdown

Thanks for the contribution! I'll review it as soon as possible. If you have still changes, please mark this PR as draft and all reviews will be cancelled. Tests reviews will be re-run only when the PR is marked as ready for review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes PTY output decoding so UTF-8 multi-byte codepoints aren’t corrupted when reads split across the fixed 4 KiB boundary (fixes #70).

Changes:

  • Added split_utf8_safely helper to identify an emit-safe UTF-8 prefix and retain truncated tails.
  • Updated the PTY reader thread to prepend any leftover bytes to the next read and flush leftovers on EOF/error.
  • Added unit tests covering boundary splits, invalid bytes, empty input, and truncated tails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apps/desktop/src-tauri/src/pty.rs Outdated
Comment thread apps/desktop/src-tauri/src/pty.rs Outdated
Comment thread apps/desktop/src-tauri/src/pty.rs Outdated
Copilot review on TrixtyAI#173 flagged three issues on the reader thread, all
valid:

1. Per-iteration allocations — `combined` was rebuilt with
   `std::mem::take(&mut leftover)` + `extend_from_slice` + a
   `combined[split..].to_vec()` on every read, so a busy shell paid
   two Vec allocations per 4 KiB chunk.

2. Post-invalid bytes delayed until the next read — when a chunk was
   `[valid][invalid][valid]`, the old split rule used
   `valid_up_to + error_len` as the emit boundary and kept the
   trailing valid suffix as `leftover`. That did not lose data but
   delayed it by a full read.

3. Docstring for `split_utf8_safely` said "truncated tail" but the
   code also branched on mid-buffer invalid sequences, which did not
   match the contract.

Fix:

- `split_utf8_safely` now returns `buf.len()` for any invalid UTF-8
  that is not a truncated tail, so callers emit the whole chunk
  (`from_utf8_lossy` substitutes `U+FFFD` for the bad bytes) instead
  of holding the tail back.
- The reader thread now keeps a single `Vec<u8>` assembly buffer with
  capacity `4096 + 4` and uses `drain(..split)` to consume emitted
  bytes in place — no allocation per read.
- Docstring rewritten to match the new contract.
- Added a regression test (`invalid_byte_mid_chunk_does_not_delay_valid_trailing_bytes`)
  that would fail under the old implementation.
@matiaspalmac
Copy link
Copy Markdown
Contributor Author

Addressed all three Copilot observations:

  1. Allocation per read — reader now keeps a single Vec<u8> assembly buffer with capacity 4096 + 4 and calls drain(..split) in place, so no std::mem::take + to_vec churn on each chunk.
  2. Post-invalid bytes delayed until the next readsplit_utf8_safely no longer uses valid_up_to + error_len as the boundary. On any mid-buffer invalid sequence it now returns buf.len() and the caller emits the whole chunk through from_utf8_lossy (one U+FFFD per bad sequence, no stalled suffix).
  3. Docstring vs implementation mismatch — rewrote the doc comment to reflect the new "only truncated tails are held back" contract.

Added invalid_byte_mid_chunk_does_not_delay_valid_trailing_bytes as a regression test that would fail under the old implementation. 7/7 pty tests pass, clippy clean. (5ca2037)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: PTY UTF-8 output is split mid-codepoint between 4KB reads, producing replacement characters

2 participants