[Fix]: Preserve UTF-8 codepoints across PTY 4 KiB read boundaries by matiaspalmac · Pull Request #173 · TrixtyAI/ide

matiaspalmac · 2026-04-21T04:38:33Z

Description

The PTY reader thread was calling String::from_utf8_lossy on each raw 4096-byte chunk returned by reader.read. That works for pure ASCII, but breaks on any multi-byte UTF-8 sequence that happens to straddle a chunk boundary: the last byte(s) of a codepoint get decoded as the lead of an "incomplete" sequence and are substituted with U+FFFD, so the terminal shows � for accented letters, CJK text, emoji, box-drawing characters, and most non-English shell prompts whenever the read lands on a bad offset.

The fix is a small leftover buffer: on every read we combine any held-back bytes with the new chunk, find the longest valid UTF-8 prefix, emit that, and keep the 1–3 trailing truncated bytes for the next read. Actually invalid sequences are still replaced with U+FFFD so one bad byte cannot stall the stream forever, and the leftover is flushed on EOF/error so the final prompt byte is never silently swallowed.

Related Issue

Fixes #70

Context / Problem

On Windows PowerShell a prompt containing a non-ASCII character such as », 日, or the spinner emoji from e.g. npm/pnpm reliably produced � in the Trixty terminal whenever the output happened to flush at a 4096-byte boundary that landed inside the codepoint. It was intermittent and looked like a font issue, but it was the decoder discarding bytes that hadn't arrived yet.

Change

Extracted the decoding rule into a small pure helper split_utf8_safely(buf: &[u8]) -> usize that returns how many bytes of buf are safe to emit now.
Rewrote the reader loop to hold back the truncated tail and prepend it to the next read.
Flush any trailing leftover on EOF/error so the final output is never dropped.
Added six unit tests covering ASCII, 4-byte codepoints split at every internal boundary (1/2/3), mixed multi-byte runs with 1-byte reads, invalid bytes (U+FFFD recovery, no stall), empty input, and truncated tail on close.

Verification

cargo test --lib → 33 passed, 0 failed (includes the 6 new tests).
cargo clippy --lib --no-deps -- -D warnings → clean.
OS: Windows 11
Version: v1.0.10

Checklist

I have tested this fix thoroughly.
I have followed the project's coding guidelines.
My changes generate no new warnings or errors.
I have verified the fix on:
- OS: Windows
- Version: v1.0.10

The PTY reader thread was calling `String::from_utf8_lossy` on each raw 4096-byte chunk straight from the OS. That works for ASCII but breaks on any multi-byte UTF-8 sequence that happens to straddle a chunk boundary: the last byte(s) of a codepoint end up decoded as the lead of an "incomplete" sequence and get replaced with U+FFFD, so the terminal shows `�` for accented letters, CJK text, emoji, box-drawing characters, and most non-English shell prompts whenever the read splits on a bad offset. The fix is a small leftover buffer: on every read we combine any held bytes with the new chunk, find the longest valid UTF-8 prefix, emit that, and keep the 1–3 trailing truncated bytes for the next read. Actually invalid sequences are still replaced with U+FFFD so one bad byte cannot stall the stream forever, and the leftover is flushed on EOF/error so the final prompt byte is never silently swallowed. Extracted the decode rule as `split_utf8_safely` and covered it with unit tests for ASCII, 4-byte codepoints split at every internal boundary, mixed multibyte runs with 1-byte reads, invalid bytes, empty input, and a truncated tail on close. Fixes TrixtyAI#70

github-actions · 2026-04-21T04:38:49Z

Thanks for the contribution! I'll review it as soon as possible. If you have still changes, please mark this PR as draft and all reviews will be cancelled. Tests reviews will be re-run only when the PR is marked as ready for review.

Copilot

Pull request overview

Fixes PTY output decoding so UTF-8 multi-byte codepoints aren’t corrupted when reads split across the fixed 4 KiB boundary (fixes #70).

Changes:

Added split_utf8_safely helper to identify an emit-safe UTF-8 prefix and retain truncated tails.
Updated the PTY reader thread to prepend any leftover bytes to the next read and flush leftovers on EOF/error.
Added unit tests covering boundary splits, invalid bytes, empty input, and truncated tails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot review on TrixtyAI#173 flagged three issues on the reader thread, all valid: 1. Per-iteration allocations — `combined` was rebuilt with `std::mem::take(&mut leftover)` + `extend_from_slice` + a `combined[split..].to_vec()` on every read, so a busy shell paid two Vec allocations per 4 KiB chunk. 2. Post-invalid bytes delayed until the next read — when a chunk was `[valid][invalid][valid]`, the old split rule used `valid_up_to + error_len` as the emit boundary and kept the trailing valid suffix as `leftover`. That did not lose data but delayed it by a full read. 3. Docstring for `split_utf8_safely` said "truncated tail" but the code also branched on mid-buffer invalid sequences, which did not match the contract. Fix: - `split_utf8_safely` now returns `buf.len()` for any invalid UTF-8 that is not a truncated tail, so callers emit the whole chunk (`from_utf8_lossy` substitutes `U+FFFD` for the bad bytes) instead of holding the tail back. - The reader thread now keeps a single `Vec<u8>` assembly buffer with capacity `4096 + 4` and uses `drain(..split)` to consume emitted bytes in place — no allocation per read. - Docstring rewritten to match the new contract. - Added a regression test (`invalid_byte_mid_chunk_does_not_delay_valid_trailing_bytes`) that would fail under the old implementation.

matiaspalmac · 2026-04-21T05:18:19Z

Addressed all three Copilot observations:

Allocation per read — reader now keeps a single Vec<u8> assembly buffer with capacity 4096 + 4 and calls drain(..split) in place, so no std::mem::take + to_vec churn on each chunk.
Post-invalid bytes delayed until the next read — split_utf8_safely no longer uses valid_up_to + error_len as the boundary. On any mid-buffer invalid sequence it now returns buf.len() and the caller emits the whole chunk through from_utf8_lossy (one U+FFFD per bad sequence, no stalled suffix).
Docstring vs implementation mismatch — rewrote the doc comment to reflect the new "only truncated tails are held back" contract.

Added invalid_byte_mid_chunk_does_not_delay_valid_trailing_bytes as a regression test that would fail under the old implementation. 7/7 pty tests pass, clippy clean. (5ca2037)

Copilot AI review requested due to automatic review settings April 21, 2026 04:38

github-actions bot added bug Something isn't working dependencies Pull requests that update a dependency file labels Apr 21, 2026

Copilot started reviewing on behalf of matiaspalmac April 21, 2026 04:38 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Comment thread apps/desktop/src-tauri/src/pty.rs Outdated

Comment thread apps/desktop/src-tauri/src/pty.rs Outdated

Comment thread apps/desktop/src-tauri/src/pty.rs Outdated

matiaspalmac mentioned this pull request Apr 21, 2026

[Fix]: Make PTY reader thread joinable so old-session output cannot bleed into the next tab #178

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix]: Preserve UTF-8 codepoints across PTY 4 KiB read boundaries#173

[Fix]: Preserve UTF-8 codepoints across PTY 4 KiB read boundaries#173
matiaspalmac wants to merge 2 commits intoTrixtyAI:mainfrom
matiaspalmac:fix/pty-utf8-split

matiaspalmac commented Apr 21, 2026

Uh oh!

github-actions bot commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matiaspalmac commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

matiaspalmac commented Apr 21, 2026

Description

Related Issue

Context / Problem

Change

Verification

Checklist

Uh oh!

github-actions bot commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matiaspalmac commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants