WAL & recovery robustness — short-write handling, reader recursion, decode panics

Three distinct correctness hazards in the WAL write, read, and columnar recovery paths. Each can corrupt, crash, or wedge the recovery step; grouped as an epic because a single defensive-coding sweep can cover all three.

---

## 1. `WalWriter::flush_buffer` ignores short writes and mis-advances `file_offset` under `O_DIRECT`

**File:** [`nodedb-wal/src/writer.rs:305-340`](nodedb-wal/src/writer.rs)

```rust
fn flush_buffer(&mut self) -> Result<()> {
    if self.buffer.is_empty() { return Ok(()); }

    let data = if self.config.use_direct_io {
        // O_DIRECT requires aligned I/O size.
        self.buffer.as_aligned_slice()
    } else {
        self.buffer.as_slice()
    };

    #[cfg(unix)]
    {
        use std::os::unix::io::AsRawFd;
        let fd = self.file.as_raw_fd();
        let written = unsafe {
            libc::pwrite(fd, data.as_ptr() as *const libc::c_void,
                         data.len(), self.file_offset as libc::off_t)
        };
        if written < 0 {                                   // ← only negative checked
            return Err(WalError::Io(std::io::Error::last_os_error()));
        }
    }

    self.file_offset += self.buffer.len() as u64;          // ← buffer.len(), not data.len()
    self.buffer.clear();
    Ok(())
}
```

Three related defects in the durability path:

1. **Short writes not detected.** `pwrite` can return a positive value **less than** `data.len()` (ENOSPC, signal interruption, etc.). The code only checks `< 0`, so a short write silently loses the tail of the buffer while the writer proceeds as if everything succeeded.
2. **`file_offset` uses `buffer.len()` instead of `data.len()`.** When `use_direct_io = true`, `data = self.buffer.as_aligned_slice()` which is padded up to the alignment boundary (typically 512/4096 bytes). The `pwrite` writes the **aligned** length, but the offset advances by the **unaligned** buffer length. The next `flush_buffer` then issues `pwrite(..., file_offset=unaligned)`, which `O_DIRECT` rejects with `EINVAL` on any real filesystem (ext4 / xfs on NVMe). Even absent the error, the record stream on disk becomes unreadable because the following record overlaps the previous padding.
3. **No retry on partial write.** Even in the well-behaved case, a single partial `pwrite` leaves the writer wedged at an inconsistent offset.

---

## 2. `WalReader::next_record` / `MmapRecordIter::next_record` recurse on unknown-optional records — stack overflow DoS

**Files:**
- [`nodedb-wal/src/reader.rs:107-119`](nodedb-wal/src/reader.rs)
- [`nodedb-wal/src/mmap_reader.rs:97-108`](nodedb-wal/src/mmap_reader.rs)

```rust
// reader.rs
if RecordType::from_raw(logical_type).is_none() {
    if RecordType::is_required(logical_type) {
        return Err(WalError::UnknownRequiredRecordType { ... });
    }
    // Unknown optional record — skip it and continue.
    // (The record is already consumed, so just recurse.)
    return self.next_record();
}
```

The comment explicitly acknowledges self-recursion. Rust does **not** guarantee tail-call optimization; debug builds reliably do not TCO and release builds do so only opportunistically. A WAL with many consecutive unknown-optional records (straightforward after a forward-compat rollback: a newer binary introduced a new non-required record type, then got rolled back) exhausts the thread stack during `recover()` **at process startup** — the server panics and refuses to boot.

A crafted segment file with N records typed as an unknown-but-optional type (CRC valid, bit 15 of `record_type` clear for "optional") reliably reproduces this. N = ~30–50 k on Linux's 8 MB stack; ~2–4 k on macOS non-main threads.

Same pattern exists in both the streaming `WalReader` and the mmap-based reader.

---

## 3. `decode_row_from_wal` panics on crafted / truncated WAL payloads (unchecked slice indexing)

**File:** [`nodedb-columnar/src/wal_record.rs:175-252`](nodedb-columnar/src/wal_record.rs)

```rust
let tag = data[cursor];                                          // ← panic on empty slice
cursor += 1;
let value = match tag {
    ...
    4 | 5 | 8 => {
        let len = u32::from_le_bytes(data[cursor..cursor + 4].try_into().map_err(...)?) as usize;
        cursor += 4;
        let bytes = &data[cursor..cursor + len];                 // ← panic if cursor+len > data.len()
        cursor += len;
        ...
    }
    6 => {
        let micros = i64::from_le_bytes(data[cursor..cursor + 8].try_into().map_err(...)?);
        cursor += 8;
        ...
    }
    7 => {
        let mut bytes = [0u8; 16];
        bytes.copy_from_slice(&data[cursor..cursor + 16]);       // ← panic on truncation
        cursor += 16;
        Value::Decimal(rust_decimal::Decimal::deserialize(bytes))
    }
    9 => {
        let count = u32::from_le_bytes(data[cursor..cursor + 4].try_into().map_err(...)?) as usize;
        cursor += 4;
        let mut arr = Vec::with_capacity(count);                 // ← OOM on count = 0xFFFFFFFF
        for _ in 0..count {
            let f = f32::from_le_bytes(data[cursor..cursor + 4].try_into().map_err(...)?);
            cursor += 4;
            arr.push(Value::Float(f as f64));
        }
        ...
    }
    10 => {
        let len = u32::from_le_bytes(data[cursor..cursor + 4].try_into().map_err(...)?) as usize;
        cursor += 4;
        let json_bytes = &data[cursor..cursor + len];            // ← panic if len huge
        ...
    }
    ...
};
```

The `try_into` on fixed-length headers is bounds-checked (returns `Err` on slice length mismatch), but the **subsequent payload reads** are raw slice indexing. A record with `tag = 4` and `len = 0xFFFFFFFF`, a truncated record of any variable-length type, or a tag-9 Array with `count = 0xFFFFFFFF` produces either `index out of bounds` panic or a multi-GB `Vec` allocation.

Since WAL files can be partially written at crash time, a normal crash-recovery scenario becomes a **crash loop** — the process panics during replay, restarts, and panics again on the same segment.

---

## Checklist

- [ ] 1. Change `pwrite` to a retry loop that handles short writes; advance `file_offset` by `data.len()` (actual amount issued), not `buffer.len()`. Surface unrecoverable short-write as a typed error.
- [ ] 2. Replace self-recursion in `reader.rs` and `mmap_reader.rs` with a `loop`. (Both files.)
- [ ] 3. Introduce a `read_slice(data, &mut cursor, n) -> Result<&[u8], _>` helper in `wal_record.rs` and route every payload read through it; also cap variable-length fields (`len`, `count`) at a sane maximum before allocating.

## Notes

- Found during a CPU/memory + correctness audit sweep of `nodedb-wal/` and `nodedb-columnar/`.
- All three items are independently reproducible with small hand-crafted test inputs; checkboxes let PRs close them one-by-one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WAL & recovery robustness — short-write handling, reader recursion, decode panics #42

1. `WalWriter::flush_buffer` ignores short writes and mis-advances `file_offset` under `O_DIRECT`

2. `WalReader::next_record` / `MmapRecordIter::next_record` recurse on unknown-optional records — stack overflow DoS

3. `decode_row_from_wal` panics on crafted / truncated WAL payloads (unchecked slice indexing)

Checklist

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WAL & recovery robustness — short-write handling, reader recursion, decode panics #42

Description

1. WalWriter::flush_buffer ignores short writes and mis-advances file_offset under O_DIRECT

2. WalReader::next_record / MmapRecordIter::next_record recurse on unknown-optional records — stack overflow DoS

3. decode_row_from_wal panics on crafted / truncated WAL payloads (unchecked slice indexing)

Checklist

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `WalWriter::flush_buffer` ignores short writes and mis-advances `file_offset` under `O_DIRECT`

2. `WalReader::next_record` / `MmapRecordIter::next_record` recurse on unknown-optional records — stack overflow DoS

3. `decode_row_from_wal` panics on crafted / truncated WAL payloads (unchecked slice indexing)