Skip to content

fix(nfs): upgrade read-only handle on WRITE instead of returning STALE#178

Merged
XciD merged 3 commits into
mainfrom
fix/nfs-readonly-handle-upgrade
May 22, 2026
Merged

fix(nfs): upgrade read-only handle on WRITE instead of returning STALE#178
XciD merged 3 commits into
mainfrom
fix/nfs-readonly-handle-upgrade

Conversation

@XciD

@XciD XciD commented May 22, 2026

Copy link
Copy Markdown
Member

Summary

NFS WRITEs on macOS were silently dropping bytes when the file had been READ first. Root cause is server-side: a read-only pooled handle from a prior NFS READ returns EBADF on virtual_fs.write, which errno_to_nfs maps to NFS3ERR_STALE. macOS NFS treats STALE on WRITE as fatal and silently flushes its write buffer — dd reports success but the bytes never reach the server. fsync(2) later returns ESTALE.

Fix

nfs.rs::write():

  1. Fast path EBADF: handle was opened read-only. Remove the now-stale entry from the pool (guarded by peek == Some(fh) so a concurrent successful upgrader isn't clobbered), evict, fall through.
  2. Slow path: open(writable=true)pwrite → only THEN insert_handle. The freshly-opened fh stays private to this task until its write commits. No other task can release it via insert_handle's replaced eviction while we're between insert and pwrite — because we don't insert until after the pwrite is done.

Mirrors the existing EBADF retry pattern in read() (nfs.rs:181-200).

Reproducer (pre-fix)

hf-mount-nfs bucket X/y /mnt
ls /mnt                                              # triggers READ → pools Lazy handle
dd if=/dev/urandom of=/mnt/existing-file bs=1M count=1 seek=100 conv=notrunc,fsync
# dd reports success; file in bucket is unchanged.
python3 -c "import os; fd=os.open('/mnt/existing-file', os.O_RDWR); os.fsync(fd)"
# OSError: [Errno 70] Stale NFS file handle

Server log shows:

open: ino=2, writable=false   ← from the READ
read: fh=1, offset=...
write: ino=2, fh=1, offset=104857600, len=1048576   ← EBADF returned, mapped to STALE
(no further server activity; macOS dropped the write buffer)

End-to-end validation (post-fix, macOS NFS mount)

Sequential read+write:

write: ino=4, fh=1 → EBADF
release: fh=1                ← pool entry removed
open: writable=true           ← slow path
write: ino=4, fh=2, offset=0  ← succeeds on fresh fh

8 concurrent dd at distinct offsets on same file (was previously hangbait):

all 8 dd done in .009s
8 writes server-side on fh=2 (fast path reused after initial upgrade)
zero release intempestif, zero NFS3ERR_STALE

Python fsync:

fsync OK ✓     # before fix: [Errno 70] Stale NFS file handle

Concurrent-write race (worth calling out)

An earlier draft of this PR added a per-inode tokio::sync::Mutex around the slow path. Adversarial review (Codex) pointed out the correct, lighter defense: publish to the pool only after the first write succeeds. With insert_handle deferred until after pwrite, the fh is unreachable by other tasks during its critical window — no mutex needed, race closed by invariant.

Concretely, if two NFS WRITE RPCs on the same ino both peek a Lazy handle and both hit EBADF:

  • Both reach the slow path; virtual_fs.open(writable=true) is internally serialized by VirtualFs's per-ino staging lock, so they produce distinct fh_A and fh_B.
  • Pre-fix (insert-then-write): writer A inserts fh_A → writer B inserts fh_B, replaced=fh_A → evict_handle releases fh_A. Writer A then pwrites against fh_A → EBADF → STALE. Same silent-data-loss this PR is fixing.
  • Post-fix (write-then-insert): writer A pwrites to fh_A in private (no one else knows about it), then inserts. Writer B does the same with fh_B. Both pwrites succeed; the loser of the insert race has its fh released by the winner's insert, but its bytes are already in the staging file.

Relation to other PRs

Tests

  • write_after_read_upgrades_handle_instead_of_returning_stale — main regression test (READ then WRITE).
  • second_write_reuses_writable_handle — fast path is reused after upgrade.
  • write_without_prior_read_opens_writable_directly — slow path standalone.
  • Verified by reverting just the write() body to main's version: tests fail with NFS3ERR_STALE. Tests are load-bearing for the regression.
  • cargo test --features fuse,nfs --lib → 342/342 pass.
  • cargo clippy --features nfs --all-targets -- -D warnings → clean.
  • End-to-end validated against XciD/hf-mount-test bucket on macOS NFS as described above.

XciD added 2 commits May 22, 2026 18:17
If macOS (or any other NFSv3 client) issues a READ before a WRITE on the
same file — which is the common case via stat/`ls` populating the
attribute cache — `read()` opens a Lazy/read-only handle and pools it.
The subsequent WRITE handler peeks the pool, finds the read-only handle,
calls `virtual_fs.write()` which returns EBADF, and `errno_to_nfs` maps
that to NFS3ERR_STALE.

macOS NFS treats STALE on WRITE as a hard failure and silently discards
the pending writes from its buffer — `dd` reports success but the bytes
never reach the server. `fsync(2)` on the file later returns ESTALE.

Fix: on EBADF, evict the read-only handle and open writable, then retry
the write. Mirrors the existing EBADF retry already in `read()`.
Three tests that exercise the bug fix:

* `write_after_read_upgrades_handle_instead_of_returning_stale` — main
  regression test: prior READ pools a Lazy handle, then WRITE must
  upgrade rather than surface NFS3ERR_STALE.
* `second_write_reuses_writable_handle` — fast path remains the fast
  path after the upgrade (no churn on subsequent writes).
* `write_without_prior_read_opens_writable_directly` — slow path works
  standalone when nothing is pooled.

Verified by reverting `nfs.rs::write()` to the pre-fix version: tests
1 and 3 fail with NFS3ERR_STALE (test 2 panics before it can run).
@github-actions

Copy link
Copy Markdown
Contributor

POSIX Compliance (pjdfstest)

============================================================
  pjdfstest POSIX Compliance Results
------------------------------------------------------------
  Files: 130/130 passed    Tests: 832 total (0 subtests failed)
  Result: PASS
------------------------------------------------------------
  Category               Passed    Total   Status
  -------------------- -------- -------- --------
  chflags                     5        5       OK
  chmod                       8        8       OK
  chown                       6        6       OK
  ftruncate                  13       13       OK
  granular                    5        5       OK
  mkdir                       9        9       OK
  open                       19       19       OK
  posix_fallocate             1        1       OK
  rename                     10       10       OK
  rmdir                      11       11       OK
  symlink                    10       10       OK
  truncate                   13       13       OK
  unlink                     11       11       OK
  utimensat                   9        9       OK
============================================================

@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Benchmark Results

============================================================
  Benchmark — 50MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    181.2 MB/s     114.1 MB/s
  Sequential re-read                2247.6 MB/s    2403.9 MB/s
  Range read (1MB@25MB)                0.4 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1440.1 MB/s
  Close latency (CAS+Hub)            0.146 s
  Write end-to-end                   277.4 MB/s
  Dedup write                       1825.0 MB/s
  Dedup close latency                0.107 s
  Dedup end-to-end                   371.3 MB/s
============================================================
============================================================
  Benchmark — 200MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    766.0 MB/s     776.4 MB/s
  Sequential re-read                2251.6 MB/s    2431.5 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1645.9 MB/s
  Close latency (CAS+Hub)            0.109 s
  Write end-to-end                   866.0 MB/s
  Dedup write                       1667.8 MB/s
  Dedup close latency                0.164 s
  Dedup end-to-end                   704.7 MB/s
============================================================
============================================================
  Benchmark — 500MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                   1365.8 MB/s    1298.2 MB/s
  Sequential re-read                2227.5 MB/s    2449.7 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1503.1 MB/s
  Close latency (CAS+Hub)            0.147 s
  Write end-to-end                  1043.3 MB/s
  Dedup write                       1502.8 MB/s
  Dedup close latency                0.131 s
  Dedup end-to-end                  1078.0 MB/s
============================================================
============================================================
  fio Benchmark Results
------------------------------------------------------------
  Job                        FUSE MB/s   NFS MB/s  FUSE IOPS   NFS IOPS
  ------------------------- ---------- ---------- ---------- ----------
  seq-read-100M                  328.9      490.2                      
  seq-reread-100M               2564.1       41.0                      
  rand-read-4k-100M                0.1        0.1         15         20
  seq-read-5x10M                 909.1      819.7                      
  rand-read-10x1M                  0.1        0.1         37         38
  Random Read Latency           FUSE avg      NFS avg
  ------------------------- ------------ ------------
  rand-read-4k-100M           67017.6 us   50171.6 us
  rand-read-10x1M             27281.2 us   26411.3 us
============================================================

@XciD XciD force-pushed the fix/nfs-readonly-handle-upgrade branch from e78465c to 9f6943b Compare May 22, 2026 17:44
Codex review pointed out a concrete race in the slow path of the EBADF
upgrade: two concurrent NFS WRITE RPCs on the same ino can both peek a
read-only handle, both get EBADF, and both fall through to the slow
path. They open distinct writable handles fh_A and fh_B (serialized by
virtual_fs's per-ino staging lock). Writer A inserts fh_A into the
pool; writer B's `insert_handle` then releases fh_A as `replaced`.
Writer A's subsequent pwrite hits EBADF on the now-released fh_A and
maps back to NFS3ERR_STALE — the very silent-data-loss this code path
exists to prevent.

Reorder the slow path so the fresh fh is used PRIVATELY for the write,
then published to the pool only on success. Nothing else can see the
fh until its pwrite has committed to the staging file, so no other
task can release it out from under us. Removes the race without
introducing a per-inode mutex (the simpler-than-locking alternative
suggested by Codex).

Also: in the fast-path EBADF branch, remove the stale pool entry
before re-entering, mirroring the analogous EBADF retry in `read()`.
Pre-fix the entry survived `evict_handle` (which only releases the VFS
handle), so a concurrent caller could peek and use a freshly-released
fh. Guarded by a `peek == Some(fh)` check to avoid removing a
different fh installed by a concurrent successful upgrader.
@XciD XciD marked this pull request as ready for review May 22, 2026 18:10
@XciD XciD merged commit 1fe503c into main May 22, 2026
6 checks passed
@XciD XciD deleted the fix/nfs-readonly-handle-upgrade branch May 22, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant