fix(nfs): stable filehandles across server restarts by XciD · Pull Request #177 · huggingface/hf-mount

XciD · 2026-05-22T05:37:32Z

Summary

Override id_to_fh / fh_to_id in NFSAdapter so the embedded generation number derives from the mount source identifier (bucket/<id> or <type>/<repo>/<rev>) instead of nfsserve's default SystemTime::now() at startup. Same source = same generation across restarts, so cached client handles stay valid after umount+remount.

Bug

On macOS, this sequence silently loses writes:

```
hf-mount-nfs bucket X/y /tmp/m # session #1, write a file
umount /tmp/m
^C the hf-mount-nfs process
hf-mount-nfs bucket X/y /tmp/m # session #2, same mount point
dd of=/tmp/m/existing-file ... # reports success
```

But:

No WRITE RPC reaches the server (verified via debug logging)
The data never lands in the bucket
An explicit `fsync(2)` reveals `[Errno 70] Stale NFS file handle`

Root cause

`nfsserve`'s default `id_to_fh` embeds `SystemTime::now()` at server startup as a generation number. Every restart of the server invalidates every previously-issued filehandle.

The macOS NFS kernel client caches filehandles across `umount` of the same mount point (as an attribute-cache optimization). When the server comes back with a new generation number, the cached handles get `NFS3ERR_STALE` on the next operation. The kernel then silently discards pending writes without surfacing an error to userspace.

By design, each end is defensible in isolation:

nfsserve protects against fileid reuse across restarts
macOS NFS caches aggressively for performance

Their combination produces silent data loss.

Fix

Derive the generation number from the source identifier rather than the startup time:

```rust
let fh_gen = fnv1a_64(virtual_fs.source_identifier().as_bytes());
```

FNV-1a is used inline (6 lines, no new dep) because std's `DefaultHasher` is randomized per-process, which would defeat the purpose.

Trade-off

Inode numbers within a source are not guaranteed stable across restarts (hf-mount allocates them in tree-listing order). If the listing changes between mounts, a client cached handle may resolve to a different file. The risk is bounded: clients re-LOOKUP on miss, and the issue only matters for handles held across a restart. We accept this slight reduction vs upstream's strict "every handle expires on restart" because the alternative is silent write loss.

Tests

`fnv1a_is_deterministic_for_same_input` — locks the FNV-1a basis vector
`fnv1a_distinguishes_sources` — different bucket/repo IDs → different gens
`filehandle_survives_simulated_server_restart` — regression for the actual bug
`filehandle_rejected_when_source_differs` — cross-source isolation
`filehandle_rejects_wrong_length` — malformed handles still rejected

`cargo test --lib --features fuse,nfs` → 344/344 pass.

The default nfsserve `id_to_fh` embeds `SystemTime::now()` at startup as the generation number, so every restart invalidates every previously- issued handle. macOS NFS client caches filehandles across `umount` and remount of the same mount point, so after `umount → kill nfs server → restart → mount`, the kernel keeps using the cached handles. The server returns NFS3ERR_STALE, and the macOS client *silently drops the write* (WRITE RPC never reaches the server, dd reports success, an explicit fsync(2) on the file returns ESTALE). Symptom in practice: opening an existing file for write after a remount on macOS appears to succeed but the data never lands in the bucket. Override `id_to_fh` / `fh_to_id` on NFSAdapter to use a generation number derived from the mount source identifier (`bucket/<id>` or `<type>/<repo>/<rev>`) via FNV-1a. Same source → same gen across restarts, so cached client handles stay valid. Different sources → different gens, so a handle from bucket A is properly rejected when mounting bucket B. Std's `DefaultHasher` is per-process randomized, which would defeat the whole point — hence the inline FNV-1a (no new deps). Tests cover: determinism, cross-source rejection, malformed handles, and the simulated-restart round-trip that locks in the regression fix.

github-actions · 2026-05-22T05:45:01Z

POSIX Compliance (pjdfstest)

============================================================
  pjdfstest POSIX Compliance Results
------------------------------------------------------------
  Files: 130/130 passed    Tests: 832 total (0 subtests failed)
  Result: PASS
------------------------------------------------------------
  Category               Passed    Total   Status
  -------------------- -------- -------- --------
  chflags                     5        5       OK
  chmod                       8        8       OK
  chown                       6        6       OK
  ftruncate                  13       13       OK
  granular                    5        5       OK
  mkdir                       9        9       OK
  open                       19       19       OK
  posix_fallocate             1        1       OK
  rename                     10       10       OK
  rmdir                      11       11       OK
  symlink                    10       10       OK
  truncate                   13       13       OK
  unlink                     11       11       OK
  utimensat                   9        9       OK
============================================================

github-actions · 2026-05-22T05:45:58Z

Benchmark Results

============================================================
  Benchmark — 50MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    244.7 MB/s     216.1 MB/s
  Sequential re-read                2216.4 MB/s    2246.4 MB/s
  Range read (1MB@25MB)                0.5 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1472.5 MB/s
  Close latency (CAS+Hub)            0.118 s
  Write end-to-end                   330.1 MB/s
  Dedup write                       1794.8 MB/s
  Dedup close latency                0.103 s
  Dedup end-to-end                   382.2 MB/s
============================================================
============================================================
  Benchmark — 200MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    846.9 MB/s     915.7 MB/s
  Sequential re-read                2207.0 MB/s    2389.9 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1719.9 MB/s
  Close latency (CAS+Hub)            0.133 s
  Write end-to-end                   801.3 MB/s
  Dedup write                       1770.4 MB/s
  Dedup close latency                0.111 s
  Dedup end-to-end                   894.5 MB/s
============================================================
============================================================
  Benchmark — 500MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                   1459.8 MB/s    1303.3 MB/s
  Sequential re-read                2230.7 MB/s    2395.1 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1564.3 MB/s
  Close latency (CAS+Hub)            0.144 s
  Write end-to-end                  1079.5 MB/s
  Dedup write                       1570.2 MB/s
  Dedup close latency                0.144 s
  Dedup end-to-end                  1081.2 MB/s
============================================================
============================================================
  fio Benchmark Results
------------------------------------------------------------
  Job                        FUSE MB/s   NFS MB/s  FUSE IOPS   NFS IOPS
  ------------------------- ---------- ---------- ---------- ----------
  seq-read-100M                  520.8      476.2                      
  seq-reread-100M               2439.0       33.9                      
  rand-read-4k-100M                0.1        0.1         17         19
  seq-read-5x10M                 694.4      819.7                      
  rand-read-10x1M                  0.1        0.1         37         37
  Random Read Latency           FUSE avg      NFS avg
  ------------------------- ------------ ------------
  rand-read-4k-100M           58854.0 us   52080.1 us
  rand-read-10x1M             27031.8 us   27293.5 us
============================================================

XciD · 2026-05-22T16:06:52Z

Closing — root cause analysis was wrong.

I attributed the symptom to stale filehandles cached by macOS NFS client across umount. After actually testing this fix on macOS, the bug still reproduces. Reading the log carefully revealed the real mechanism:

macOS NFS sends a READ first (during `ls` / stat)
hf-mount opens a Lazy (read-only) handle, pools it as `fh=1`
macOS NFS sends a WRITE
`nfs.rs::write()` peeks the pool → finds `fh=1`
`virtual_fs.write(fh=1, ...)` returns `EBADF` (Lazy can't write)
`errno_to_nfs(EBADF)` maps to `NFS3ERR_STALE` (nfs.rs:833)
macOS receives ESTALE → drops the write silently

The bug is server-side: `nfs.rs::write()` on main doesn't upgrade a read-only pool handle when WRITE arrives. PR #41 already adds the upgrade path (`Err(EBADF) => evict + open(writable=true) + retry`). So this issue is fixed once #41 lands.

The generation-number-from-source change in this PR is orthogonal and doesn't address anything currently observable. Could be useful defensively, but not as a fix for this bug — leaving it out.

#178) ## Summary NFS WRITEs on macOS were silently dropping bytes when the file had been READ first. Root cause is server-side: a read-only pooled handle from a prior NFS READ returns `EBADF` on `virtual_fs.write`, which `errno_to_nfs` maps to `NFS3ERR_STALE`. macOS NFS treats STALE on WRITE as fatal and silently flushes its write buffer — `dd` reports success but the bytes never reach the server. `fsync(2)` later returns ESTALE. ## Fix `nfs.rs::write()`: 1. **Fast path EBADF**: handle was opened read-only. Remove the now-stale entry from the pool (guarded by `peek == Some(fh)` so a concurrent successful upgrader isn't clobbered), evict, fall through. 2. **Slow path**: `open(writable=true)` → `pwrite` → only THEN `insert_handle`. The freshly-opened fh stays private to this task until its write commits. No other task can release it via `insert_handle`'s `replaced` eviction while we're between insert and pwrite — because we don't insert until after the pwrite is done. Mirrors the existing EBADF retry pattern in `read()` (nfs.rs:181-200). ## Reproducer (pre-fix) ```bash hf-mount-nfs bucket X/y /mnt ls /mnt # triggers READ → pools Lazy handle dd if=/dev/urandom of=/mnt/existing-file bs=1M count=1 seek=100 conv=notrunc,fsync # dd reports success; file in bucket is unchanged. python3 -c "import os; fd=os.open('/mnt/existing-file', os.O_RDWR); os.fsync(fd)" # OSError: [Errno 70] Stale NFS file handle ``` Server log shows: ``` open: ino=2, writable=false ← from the READ read: fh=1, offset=... write: ino=2, fh=1, offset=104857600, len=1048576 ← EBADF returned, mapped to STALE (no further server activity; macOS dropped the write buffer) ``` ## End-to-end validation (post-fix, macOS NFS mount) Sequential read+write: ``` write: ino=4, fh=1 → EBADF release: fh=1 ← pool entry removed open: writable=true ← slow path write: ino=4, fh=2, offset=0 ← succeeds on fresh fh ``` 8 concurrent dd at distinct offsets on same file (was previously hangbait): ``` all 8 dd done in .009s 8 writes server-side on fh=2 (fast path reused after initial upgrade) zero release intempestif, zero NFS3ERR_STALE ``` Python fsync: ``` fsync OK ✓ # before fix: [Errno 70] Stale NFS file handle ``` ## Concurrent-write race (worth calling out) An earlier draft of this PR added a per-inode `tokio::sync::Mutex` around the slow path. Adversarial review (Codex) pointed out the correct, lighter defense: **publish to the pool only after the first write succeeds**. With insert_handle deferred until after pwrite, the fh is unreachable by other tasks during its critical window — no mutex needed, race closed by invariant. Concretely, if two NFS WRITE RPCs on the same ino both peek a Lazy handle and both hit EBADF: - Both reach the slow path; `virtual_fs.open(writable=true)` is internally serialized by VirtualFs's per-ino staging lock, so they produce distinct fh_A and fh_B. - Pre-fix (insert-then-write): writer A inserts fh_A → writer B inserts fh_B, `replaced=fh_A` → evict_handle releases fh_A. Writer A then pwrites against fh_A → EBADF → STALE. Same silent-data-loss this PR is fixing. - Post-fix (write-then-insert): writer A pwrites to fh_A in private (no one else knows about it), then inserts. Writer B does the same with fh_B. Both pwrites succeed; the loser of the insert race has its fh released by the winner's insert, but its bytes are already in the staging file. ## Relation to other PRs - **PR #41 (sparse writes)** contains the same EBADF→upgrade fix in its nfs.rs. This PR isolates the change so it can land independently. When #41 merges, this PR becomes a no-op merge. The write-before-publish reordering here will need to be ported into #41's nfs.rs (the same race exists there). - **PR #177 (stable filehandles)** was an earlier misdiagnosis attributing the symptom to macOS client-side filehandle caching across umount. Closed. ## Tests - `write_after_read_upgrades_handle_instead_of_returning_stale` — main regression test (READ then WRITE). - `second_write_reuses_writable_handle` — fast path is reused after upgrade. - `write_without_prior_read_opens_writable_directly` — slow path standalone. - Verified by reverting just the `write()` body to main's version: tests fail with `NFS3ERR_STALE`. Tests are load-bearing for the regression. - `cargo test --features fuse,nfs --lib` → 342/342 pass. - `cargo clippy --features nfs --all-targets -- -D warnings` → clean. - End-to-end validated against `XciD/hf-mount-test` bucket on macOS NFS as described above.

XciD closed this May 22, 2026

XciD mentioned this pull request May 22, 2026

fix(nfs): upgrade read-only handle on WRITE instead of returning STALE #178

Merged

XciD deleted the fix/nfs-stable-filehandles branch June 28, 2026 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(nfs): stable filehandles across server restarts#177

fix(nfs): stable filehandles across server restarts#177
XciD wants to merge 1 commit into
mainfrom
fix/nfs-stable-filehandles

XciD commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

XciD commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

XciD commented May 22, 2026

Summary

Bug

Root cause

Fix

Trade-off

Tests

Uh oh!

github-actions Bot commented May 22, 2026

POSIX Compliance (pjdfstest)

Uh oh!

github-actions Bot commented May 22, 2026

Benchmark Results

Uh oh!

XciD commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant