fix(nfs): upgrade read-only handle on WRITE instead of returning STALE by XciD · Pull Request #178 · huggingface/hf-mount

XciD · 2026-05-22T16:17:38Z

Summary

NFS WRITEs on macOS were silently dropping bytes when the file had been READ first. Root cause is server-side: a read-only pooled handle from a prior NFS READ returns EBADF on virtual_fs.write, which errno_to_nfs maps to NFS3ERR_STALE. macOS NFS treats STALE on WRITE as fatal and silently flushes its write buffer — dd reports success but the bytes never reach the server. fsync(2) later returns ESTALE.

Fix

nfs.rs::write():

Fast path EBADF: handle was opened read-only. Remove the now-stale entry from the pool (guarded by peek == Some(fh) so a concurrent successful upgrader isn't clobbered), evict, fall through.
Slow path: open(writable=true) → pwrite → only THEN insert_handle. The freshly-opened fh stays private to this task until its write commits. No other task can release it via insert_handle's replaced eviction while we're between insert and pwrite — because we don't insert until after the pwrite is done.

Mirrors the existing EBADF retry pattern in read() (nfs.rs:181-200).

Reproducer (pre-fix)

hf-mount-nfs bucket X/y /mnt
ls /mnt                                              # triggers READ → pools Lazy handle
dd if=/dev/urandom of=/mnt/existing-file bs=1M count=1 seek=100 conv=notrunc,fsync
# dd reports success; file in bucket is unchanged.
python3 -c "import os; fd=os.open('/mnt/existing-file', os.O_RDWR); os.fsync(fd)"
# OSError: [Errno 70] Stale NFS file handle

Server log shows:

open: ino=2, writable=false   ← from the READ
read: fh=1, offset=...
write: ino=2, fh=1, offset=104857600, len=1048576   ← EBADF returned, mapped to STALE
(no further server activity; macOS dropped the write buffer)

End-to-end validation (post-fix, macOS NFS mount)

Sequential read+write:

write: ino=4, fh=1 → EBADF
release: fh=1                ← pool entry removed
open: writable=true           ← slow path
write: ino=4, fh=2, offset=0  ← succeeds on fresh fh

8 concurrent dd at distinct offsets on same file (was previously hangbait):

all 8 dd done in .009s
8 writes server-side on fh=2 (fast path reused after initial upgrade)
zero release intempestif, zero NFS3ERR_STALE

Python fsync:

fsync OK ✓     # before fix: [Errno 70] Stale NFS file handle

Concurrent-write race (worth calling out)

An earlier draft of this PR added a per-inode tokio::sync::Mutex around the slow path. Adversarial review (Codex) pointed out the correct, lighter defense: publish to the pool only after the first write succeeds. With insert_handle deferred until after pwrite, the fh is unreachable by other tasks during its critical window — no mutex needed, race closed by invariant.

Concretely, if two NFS WRITE RPCs on the same ino both peek a Lazy handle and both hit EBADF:

Both reach the slow path; virtual_fs.open(writable=true) is internally serialized by VirtualFs's per-ino staging lock, so they produce distinct fh_A and fh_B.
Pre-fix (insert-then-write): writer A inserts fh_A → writer B inserts fh_B, replaced=fh_A → evict_handle releases fh_A. Writer A then pwrites against fh_A → EBADF → STALE. Same silent-data-loss this PR is fixing.
Post-fix (write-then-insert): writer A pwrites to fh_A in private (no one else knows about it), then inserts. Writer B does the same with fh_B. Both pwrites succeed; the loser of the insert race has its fh released by the winner's insert, but its bytes are already in the staging file.

Relation to other PRs

PR feat: sparse writes with range_upload, zero-download write path #41 (sparse writes) contains the same EBADF→upgrade fix in its nfs.rs. This PR isolates the change so it can land independently. When feat: sparse writes with range_upload, zero-download write path #41 merges, this PR becomes a no-op merge. The write-before-publish reordering here will need to be ported into feat: sparse writes with range_upload, zero-download write path #41's nfs.rs (the same race exists there).
PR fix(nfs): stable filehandles across server restarts #177 (stable filehandles) was an earlier misdiagnosis attributing the symptom to macOS client-side filehandle caching across umount. Closed.

Tests

write_after_read_upgrades_handle_instead_of_returning_stale — main regression test (READ then WRITE).
second_write_reuses_writable_handle — fast path is reused after upgrade.
write_without_prior_read_opens_writable_directly — slow path standalone.
Verified by reverting just the write() body to main's version: tests fail with NFS3ERR_STALE. Tests are load-bearing for the regression.
cargo test --features fuse,nfs --lib → 342/342 pass.
cargo clippy --features nfs --all-targets -- -D warnings → clean.
End-to-end validated against XciD/hf-mount-test bucket on macOS NFS as described above.

If macOS (or any other NFSv3 client) issues a READ before a WRITE on the same file — which is the common case via stat/`ls` populating the attribute cache — `read()` opens a Lazy/read-only handle and pools it. The subsequent WRITE handler peeks the pool, finds the read-only handle, calls `virtual_fs.write()` which returns EBADF, and `errno_to_nfs` maps that to NFS3ERR_STALE. macOS NFS treats STALE on WRITE as a hard failure and silently discards the pending writes from its buffer — `dd` reports success but the bytes never reach the server. `fsync(2)` on the file later returns ESTALE. Fix: on EBADF, evict the read-only handle and open writable, then retry the write. Mirrors the existing EBADF retry already in `read()`.

Three tests that exercise the bug fix: * `write_after_read_upgrades_handle_instead_of_returning_stale` — main regression test: prior READ pools a Lazy handle, then WRITE must upgrade rather than surface NFS3ERR_STALE. * `second_write_reuses_writable_handle` — fast path remains the fast path after the upgrade (no churn on subsequent writes). * `write_without_prior_read_opens_writable_directly` — slow path works standalone when nothing is pooled. Verified by reverting `nfs.rs::write()` to the pre-fix version: tests 1 and 3 fail with NFS3ERR_STALE (test 2 panics before it can run).

github-actions · 2026-05-22T16:26:54Z

POSIX Compliance (pjdfstest)

============================================================
  pjdfstest POSIX Compliance Results
------------------------------------------------------------
  Files: 130/130 passed    Tests: 832 total (0 subtests failed)
  Result: PASS
------------------------------------------------------------
  Category               Passed    Total   Status
  -------------------- -------- -------- --------
  chflags                     5        5       OK
  chmod                       8        8       OK
  chown                       6        6       OK
  ftruncate                  13       13       OK
  granular                    5        5       OK
  mkdir                       9        9       OK
  open                       19       19       OK
  posix_fallocate             1        1       OK
  rename                     10       10       OK
  rmdir                      11       11       OK
  symlink                    10       10       OK
  truncate                   13       13       OK
  unlink                     11       11       OK
  utimensat                   9        9       OK
============================================================

github-actions · 2026-05-22T16:28:04Z

Benchmark Results

============================================================
  Benchmark — 50MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    181.2 MB/s     114.1 MB/s
  Sequential re-read                2247.6 MB/s    2403.9 MB/s
  Range read (1MB@25MB)                0.4 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1440.1 MB/s
  Close latency (CAS+Hub)            0.146 s
  Write end-to-end                   277.4 MB/s
  Dedup write                       1825.0 MB/s
  Dedup close latency                0.107 s
  Dedup end-to-end                   371.3 MB/s
============================================================
============================================================
  Benchmark — 200MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    766.0 MB/s     776.4 MB/s
  Sequential re-read                2251.6 MB/s    2431.5 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1645.9 MB/s
  Close latency (CAS+Hub)            0.109 s
  Write end-to-end                   866.0 MB/s
  Dedup write                       1667.8 MB/s
  Dedup close latency                0.164 s
  Dedup end-to-end                   704.7 MB/s
============================================================
============================================================
  Benchmark — 500MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                   1365.8 MB/s    1298.2 MB/s
  Sequential re-read                2227.5 MB/s    2449.7 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1503.1 MB/s
  Close latency (CAS+Hub)            0.147 s
  Write end-to-end                  1043.3 MB/s
  Dedup write                       1502.8 MB/s
  Dedup close latency                0.131 s
  Dedup end-to-end                  1078.0 MB/s
============================================================
============================================================
  fio Benchmark Results
------------------------------------------------------------
  Job                        FUSE MB/s   NFS MB/s  FUSE IOPS   NFS IOPS
  ------------------------- ---------- ---------- ---------- ----------
  seq-read-100M                  328.9      490.2                      
  seq-reread-100M               2564.1       41.0                      
  rand-read-4k-100M                0.1        0.1         15         20
  seq-read-5x10M                 909.1      819.7                      
  rand-read-10x1M                  0.1        0.1         37         38
  Random Read Latency           FUSE avg      NFS avg
  ------------------------- ------------ ------------
  rand-read-4k-100M           67017.6 us   50171.6 us
  rand-read-10x1M             27281.2 us   26411.3 us
============================================================

Codex review pointed out a concrete race in the slow path of the EBADF upgrade: two concurrent NFS WRITE RPCs on the same ino can both peek a read-only handle, both get EBADF, and both fall through to the slow path. They open distinct writable handles fh_A and fh_B (serialized by virtual_fs's per-ino staging lock). Writer A inserts fh_A into the pool; writer B's `insert_handle` then releases fh_A as `replaced`. Writer A's subsequent pwrite hits EBADF on the now-released fh_A and maps back to NFS3ERR_STALE — the very silent-data-loss this code path exists to prevent. Reorder the slow path so the fresh fh is used PRIVATELY for the write, then published to the pool only on success. Nothing else can see the fh until its pwrite has committed to the staging file, so no other task can release it out from under us. Removes the race without introducing a per-inode mutex (the simpler-than-locking alternative suggested by Codex). Also: in the fast-path EBADF branch, remove the stale pool entry before re-entering, mirroring the analogous EBADF retry in `read()`. Pre-fix the entry survived `evict_handle` (which only releases the VFS handle), so a concurrent caller could peek and use a freshly-released fh. Guarded by a `peek == Some(fh)` check to avoid removing a different fh installed by a concurrent successful upgrader.

XciD added 2 commits May 22, 2026 18:17

XciD force-pushed the fix/nfs-readonly-handle-upgrade branch from e78465c to 9f6943b Compare May 22, 2026 17:44

XciD marked this pull request as ready for review May 22, 2026 18:10

XciD merged commit 1fe503c into main May 22, 2026
6 checks passed

XciD deleted the fix/nfs-readonly-handle-upgrade branch May 22, 2026 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(nfs): upgrade read-only handle on WRITE instead of returning STALE#178

fix(nfs): upgrade read-only handle on WRITE instead of returning STALE#178
XciD merged 3 commits into
mainfrom
fix/nfs-readonly-handle-upgrade

XciD commented May 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

XciD commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Reproducer (pre-fix)

End-to-end validation (post-fix, macOS NFS mount)

Concurrent-write race (worth calling out)

Relation to other PRs

Tests

Uh oh!

github-actions Bot commented May 22, 2026

POSIX Compliance (pjdfstest)

Uh oh!

github-actions Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

XciD commented May 22, 2026 •

edited

Loading

github-actions Bot commented May 22, 2026 •

edited

Loading