Skip to content

fix(nfs): stable filehandles across server restarts#177

Closed
XciD wants to merge 1 commit into
mainfrom
fix/nfs-stable-filehandles
Closed

fix(nfs): stable filehandles across server restarts#177
XciD wants to merge 1 commit into
mainfrom
fix/nfs-stable-filehandles

Conversation

@XciD

@XciD XciD commented May 22, 2026

Copy link
Copy Markdown
Member

Summary

Override id_to_fh / fh_to_id in NFSAdapter so the embedded generation number derives from the mount source identifier (bucket/<id> or <type>/<repo>/<rev>) instead of nfsserve's default SystemTime::now() at startup. Same source = same generation across restarts, so cached client handles stay valid after umount+remount.

Bug

On macOS, this sequence silently loses writes:

```
hf-mount-nfs bucket X/y /tmp/m # session #1, write a file
umount /tmp/m
^C the hf-mount-nfs process
hf-mount-nfs bucket X/y /tmp/m # session #2, same mount point
dd of=/tmp/m/existing-file ... # reports success
```

But:

  • No WRITE RPC reaches the server (verified via debug logging)
  • The data never lands in the bucket
  • An explicit `fsync(2)` reveals `[Errno 70] Stale NFS file handle`

Root cause

`nfsserve`'s default `id_to_fh` embeds `SystemTime::now()` at server startup as a generation number. Every restart of the server invalidates every previously-issued filehandle.

The macOS NFS kernel client caches filehandles across `umount` of the same mount point (as an attribute-cache optimization). When the server comes back with a new generation number, the cached handles get `NFS3ERR_STALE` on the next operation. The kernel then silently discards pending writes without surfacing an error to userspace.

By design, each end is defensible in isolation:

  • nfsserve protects against fileid reuse across restarts
  • macOS NFS caches aggressively for performance

Their combination produces silent data loss.

Fix

Derive the generation number from the source identifier rather than the startup time:

```rust
let fh_gen = fnv1a_64(virtual_fs.source_identifier().as_bytes());
```

FNV-1a is used inline (6 lines, no new dep) because std's `DefaultHasher` is randomized per-process, which would defeat the purpose.

Trade-off

Inode numbers within a source are not guaranteed stable across restarts (hf-mount allocates them in tree-listing order). If the listing changes between mounts, a client cached handle may resolve to a different file. The risk is bounded: clients re-LOOKUP on miss, and the issue only matters for handles held across a restart. We accept this slight reduction vs upstream's strict "every handle expires on restart" because the alternative is silent write loss.

Tests

  • `fnv1a_is_deterministic_for_same_input` — locks the FNV-1a basis vector
  • `fnv1a_distinguishes_sources` — different bucket/repo IDs → different gens
  • `filehandle_survives_simulated_server_restart` — regression for the actual bug
  • `filehandle_rejected_when_source_differs` — cross-source isolation
  • `filehandle_rejects_wrong_length` — malformed handles still rejected

`cargo test --lib --features fuse,nfs` → 344/344 pass.

The default nfsserve `id_to_fh` embeds `SystemTime::now()` at startup as
the generation number, so every restart invalidates every previously-
issued handle. macOS NFS client caches filehandles across `umount` and
remount of the same mount point, so after `umount → kill nfs server →
restart → mount`, the kernel keeps using the cached handles. The server
returns NFS3ERR_STALE, and the macOS client *silently drops the write*
(WRITE RPC never reaches the server, dd reports success, an explicit
fsync(2) on the file returns ESTALE).

Symptom in practice: opening an existing file for write after a remount
on macOS appears to succeed but the data never lands in the bucket.

Override `id_to_fh` / `fh_to_id` on NFSAdapter to use a generation
number derived from the mount source identifier (`bucket/<id>` or
`<type>/<repo>/<rev>`) via FNV-1a. Same source → same gen across
restarts, so cached client handles stay valid. Different sources →
different gens, so a handle from bucket A is properly rejected when
mounting bucket B.

Std's `DefaultHasher` is per-process randomized, which would defeat the
whole point — hence the inline FNV-1a (no new deps).

Tests cover: determinism, cross-source rejection, malformed handles,
and the simulated-restart round-trip that locks in the regression fix.
@github-actions

Copy link
Copy Markdown
Contributor

POSIX Compliance (pjdfstest)

============================================================
  pjdfstest POSIX Compliance Results
------------------------------------------------------------
  Files: 130/130 passed    Tests: 832 total (0 subtests failed)
  Result: PASS
------------------------------------------------------------
  Category               Passed    Total   Status
  -------------------- -------- -------- --------
  chflags                     5        5       OK
  chmod                       8        8       OK
  chown                       6        6       OK
  ftruncate                  13       13       OK
  granular                    5        5       OK
  mkdir                       9        9       OK
  open                       19       19       OK
  posix_fallocate             1        1       OK
  rename                     10       10       OK
  rmdir                      11       11       OK
  symlink                    10       10       OK
  truncate                   13       13       OK
  unlink                     11       11       OK
  utimensat                   9        9       OK
============================================================

@github-actions

Copy link
Copy Markdown
Contributor

Benchmark Results

============================================================
  Benchmark — 50MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    244.7 MB/s     216.1 MB/s
  Sequential re-read                2216.4 MB/s    2246.4 MB/s
  Range read (1MB@25MB)                0.5 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1472.5 MB/s
  Close latency (CAS+Hub)            0.118 s
  Write end-to-end                   330.1 MB/s
  Dedup write                       1794.8 MB/s
  Dedup close latency                0.103 s
  Dedup end-to-end                   382.2 MB/s
============================================================
============================================================
  Benchmark — 200MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    846.9 MB/s     915.7 MB/s
  Sequential re-read                2207.0 MB/s    2389.9 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1719.9 MB/s
  Close latency (CAS+Hub)            0.133 s
  Write end-to-end                   801.3 MB/s
  Dedup write                       1770.4 MB/s
  Dedup close latency                0.111 s
  Dedup end-to-end                   894.5 MB/s
============================================================
============================================================
  Benchmark — 500MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                   1459.8 MB/s    1303.3 MB/s
  Sequential re-read                2230.7 MB/s    2395.1 MB/s
  Range read (1MB@25MB)                0.2 ms         0.2 ms
  Random reads (100x4KB avg)           0.0 ms         0.0 ms
  Sequential write (FUSE)           1564.3 MB/s
  Close latency (CAS+Hub)            0.144 s
  Write end-to-end                  1079.5 MB/s
  Dedup write                       1570.2 MB/s
  Dedup close latency                0.144 s
  Dedup end-to-end                  1081.2 MB/s
============================================================
============================================================
  fio Benchmark Results
------------------------------------------------------------
  Job                        FUSE MB/s   NFS MB/s  FUSE IOPS   NFS IOPS
  ------------------------- ---------- ---------- ---------- ----------
  seq-read-100M                  520.8      476.2                      
  seq-reread-100M               2439.0       33.9                      
  rand-read-4k-100M                0.1        0.1         17         19
  seq-read-5x10M                 694.4      819.7                      
  rand-read-10x1M                  0.1        0.1         37         37
  Random Read Latency           FUSE avg      NFS avg
  ------------------------- ------------ ------------
  rand-read-4k-100M           58854.0 us   52080.1 us
  rand-read-10x1M             27031.8 us   27293.5 us
============================================================

@XciD

XciD commented May 22, 2026

Copy link
Copy Markdown
Member Author

Closing — root cause analysis was wrong.

I attributed the symptom to stale filehandles cached by macOS NFS client across umount. After actually testing this fix on macOS, the bug still reproduces. Reading the log carefully revealed the real mechanism:

  1. macOS NFS sends a READ first (during `ls` / stat)
  2. hf-mount opens a Lazy (read-only) handle, pools it as `fh=1`
  3. macOS NFS sends a WRITE
  4. `nfs.rs::write()` peeks the pool → finds `fh=1`
  5. `virtual_fs.write(fh=1, ...)` returns `EBADF` (Lazy can't write)
  6. `errno_to_nfs(EBADF)` maps to `NFS3ERR_STALE` (nfs.rs:833)
  7. macOS receives ESTALE → drops the write silently

The bug is server-side: `nfs.rs::write()` on main doesn't upgrade a read-only pool handle when WRITE arrives. PR #41 already adds the upgrade path (`Err(EBADF) => evict + open(writable=true) + retry`). So this issue is fixed once #41 lands.

The generation-number-from-source change in this PR is orthogonal and doesn't address anything currently observable. Could be useful defensively, but not as a fix for this bug — leaving it out.

@XciD XciD closed this May 22, 2026
XciD added a commit that referenced this pull request May 22, 2026
#178)

## Summary

NFS WRITEs on macOS were silently dropping bytes when the file had been
READ first. Root cause is server-side: a read-only pooled handle from a
prior NFS READ returns `EBADF` on `virtual_fs.write`, which
`errno_to_nfs` maps to `NFS3ERR_STALE`. macOS NFS treats STALE on WRITE
as fatal and silently flushes its write buffer — `dd` reports success
but the bytes never reach the server. `fsync(2)` later returns ESTALE.

## Fix

`nfs.rs::write()`:

1. **Fast path EBADF**: handle was opened read-only. Remove the
now-stale entry from the pool (guarded by `peek == Some(fh)` so a
concurrent successful upgrader isn't clobbered), evict, fall through.
2. **Slow path**: `open(writable=true)` → `pwrite` → only THEN
`insert_handle`. The freshly-opened fh stays private to this task until
its write commits. No other task can release it via `insert_handle`'s
`replaced` eviction while we're between insert and pwrite — because we
don't insert until after the pwrite is done.

Mirrors the existing EBADF retry pattern in `read()` (nfs.rs:181-200).

## Reproducer (pre-fix)

```bash
hf-mount-nfs bucket X/y /mnt
ls /mnt                                              # triggers READ → pools Lazy handle
dd if=/dev/urandom of=/mnt/existing-file bs=1M count=1 seek=100 conv=notrunc,fsync
# dd reports success; file in bucket is unchanged.
python3 -c "import os; fd=os.open('/mnt/existing-file', os.O_RDWR); os.fsync(fd)"
# OSError: [Errno 70] Stale NFS file handle
```

Server log shows:
```
open: ino=2, writable=false   ← from the READ
read: fh=1, offset=...
write: ino=2, fh=1, offset=104857600, len=1048576   ← EBADF returned, mapped to STALE
(no further server activity; macOS dropped the write buffer)
```

## End-to-end validation (post-fix, macOS NFS mount)

Sequential read+write:
```
write: ino=4, fh=1 → EBADF
release: fh=1                ← pool entry removed
open: writable=true           ← slow path
write: ino=4, fh=2, offset=0  ← succeeds on fresh fh
```

8 concurrent dd at distinct offsets on same file (was previously
hangbait):
```
all 8 dd done in .009s
8 writes server-side on fh=2 (fast path reused after initial upgrade)
zero release intempestif, zero NFS3ERR_STALE
```

Python fsync:
```
fsync OK ✓     # before fix: [Errno 70] Stale NFS file handle
```

## Concurrent-write race (worth calling out)

An earlier draft of this PR added a per-inode `tokio::sync::Mutex`
around the slow path. Adversarial review (Codex) pointed out the
correct, lighter defense: **publish to the pool only after the first
write succeeds**. With insert_handle deferred until after pwrite, the fh
is unreachable by other tasks during its critical window — no mutex
needed, race closed by invariant.

Concretely, if two NFS WRITE RPCs on the same ino both peek a Lazy
handle and both hit EBADF:
- Both reach the slow path; `virtual_fs.open(writable=true)` is
internally serialized by VirtualFs's per-ino staging lock, so they
produce distinct fh_A and fh_B.
- Pre-fix (insert-then-write): writer A inserts fh_A → writer B inserts
fh_B, `replaced=fh_A` → evict_handle releases fh_A. Writer A then
pwrites against fh_A → EBADF → STALE. Same silent-data-loss this PR is
fixing.
- Post-fix (write-then-insert): writer A pwrites to fh_A in private (no
one else knows about it), then inserts. Writer B does the same with
fh_B. Both pwrites succeed; the loser of the insert race has its fh
released by the winner's insert, but its bytes are already in the
staging file.

## Relation to other PRs

- **PR #41 (sparse writes)** contains the same EBADF→upgrade fix in its
nfs.rs. This PR isolates the change so it can land independently. When
#41 merges, this PR becomes a no-op merge. The write-before-publish
reordering here will need to be ported into #41's nfs.rs (the same race
exists there).
- **PR #177 (stable filehandles)** was an earlier misdiagnosis
attributing the symptom to macOS client-side filehandle caching across
umount. Closed.

## Tests

- `write_after_read_upgrades_handle_instead_of_returning_stale` — main
regression test (READ then WRITE).
- `second_write_reuses_writable_handle` — fast path is reused after
upgrade.
- `write_without_prior_read_opens_writable_directly` — slow path
standalone.
- Verified by reverting just the `write()` body to main's version: tests
fail with `NFS3ERR_STALE`. Tests are load-bearing for the regression.
- `cargo test --features fuse,nfs --lib` → 342/342 pass.
- `cargo clippy --features nfs --all-targets -- -D warnings` → clean.
- End-to-end validated against `XciD/hf-mount-test` bucket on macOS NFS
as described above.
@XciD XciD deleted the fix/nfs-stable-filehandles branch June 28, 2026 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant