Skip to content

WAL & recovery robustness — short-write handling, reader recursion, decode panics #42

@hollanf

Description

@hollanf

Three distinct correctness hazards in the WAL write, read, and columnar recovery paths. Each can corrupt, crash, or wedge the recovery step; grouped as an epic because a single defensive-coding sweep can cover all three.


1. WalWriter::flush_buffer ignores short writes and mis-advances file_offset under O_DIRECT

File: nodedb-wal/src/writer.rs:305-340

fn flush_buffer(&mut self) -> Result<()> {
    if self.buffer.is_empty() { return Ok(()); }

    let data = if self.config.use_direct_io {
        // O_DIRECT requires aligned I/O size.
        self.buffer.as_aligned_slice()
    } else {
        self.buffer.as_slice()
    };

    #[cfg(unix)]
    {
        use std::os::unix::io::AsRawFd;
        let fd = self.file.as_raw_fd();
        let written = unsafe {
            libc::pwrite(fd, data.as_ptr() as *const libc::c_void,
                         data.len(), self.file_offset as libc::off_t)
        };
        if written < 0 {                                   // ← only negative checked
            return Err(WalError::Io(std::io::Error::last_os_error()));
        }
    }

    self.file_offset += self.buffer.len() as u64;          // ← buffer.len(), not data.len()
    self.buffer.clear();
    Ok(())
}

Three related defects in the durability path:

  1. Short writes not detected. pwrite can return a positive value less than data.len() (ENOSPC, signal interruption, etc.). The code only checks < 0, so a short write silently loses the tail of the buffer while the writer proceeds as if everything succeeded.
  2. file_offset uses buffer.len() instead of data.len(). When use_direct_io = true, data = self.buffer.as_aligned_slice() which is padded up to the alignment boundary (typically 512/4096 bytes). The pwrite writes the aligned length, but the offset advances by the unaligned buffer length. The next flush_buffer then issues pwrite(..., file_offset=unaligned), which O_DIRECT rejects with EINVAL on any real filesystem (ext4 / xfs on NVMe). Even absent the error, the record stream on disk becomes unreadable because the following record overlaps the previous padding.
  3. No retry on partial write. Even in the well-behaved case, a single partial pwrite leaves the writer wedged at an inconsistent offset.

2. WalReader::next_record / MmapRecordIter::next_record recurse on unknown-optional records — stack overflow DoS

Files:

// reader.rs
if RecordType::from_raw(logical_type).is_none() {
    if RecordType::is_required(logical_type) {
        return Err(WalError::UnknownRequiredRecordType { ... });
    }
    // Unknown optional record — skip it and continue.
    // (The record is already consumed, so just recurse.)
    return self.next_record();
}

The comment explicitly acknowledges self-recursion. Rust does not guarantee tail-call optimization; debug builds reliably do not TCO and release builds do so only opportunistically. A WAL with many consecutive unknown-optional records (straightforward after a forward-compat rollback: a newer binary introduced a new non-required record type, then got rolled back) exhausts the thread stack during recover() at process startup — the server panics and refuses to boot.

A crafted segment file with N records typed as an unknown-but-optional type (CRC valid, bit 15 of record_type clear for "optional") reliably reproduces this. N = ~30–50 k on Linux's 8 MB stack; ~2–4 k on macOS non-main threads.

Same pattern exists in both the streaming WalReader and the mmap-based reader.


3. decode_row_from_wal panics on crafted / truncated WAL payloads (unchecked slice indexing)

File: nodedb-columnar/src/wal_record.rs:175-252

let tag = data[cursor];                                          // ← panic on empty slice
cursor += 1;
let value = match tag {
    ...
    4 | 5 | 8 => {
        let len = u32::from_le_bytes(data[cursor..cursor + 4].try_into().map_err(...)?) as usize;
        cursor += 4;
        let bytes = &data[cursor..cursor + len];                 // ← panic if cursor+len > data.len()
        cursor += len;
        ...
    }
    6 => {
        let micros = i64::from_le_bytes(data[cursor..cursor + 8].try_into().map_err(...)?);
        cursor += 8;
        ...
    }
    7 => {
        let mut bytes = [0u8; 16];
        bytes.copy_from_slice(&data[cursor..cursor + 16]);       // ← panic on truncation
        cursor += 16;
        Value::Decimal(rust_decimal::Decimal::deserialize(bytes))
    }
    9 => {
        let count = u32::from_le_bytes(data[cursor..cursor + 4].try_into().map_err(...)?) as usize;
        cursor += 4;
        let mut arr = Vec::with_capacity(count);                 // ← OOM on count = 0xFFFFFFFF
        for _ in 0..count {
            let f = f32::from_le_bytes(data[cursor..cursor + 4].try_into().map_err(...)?);
            cursor += 4;
            arr.push(Value::Float(f as f64));
        }
        ...
    }
    10 => {
        let len = u32::from_le_bytes(data[cursor..cursor + 4].try_into().map_err(...)?) as usize;
        cursor += 4;
        let json_bytes = &data[cursor..cursor + len];            // ← panic if len huge
        ...
    }
    ...
};

The try_into on fixed-length headers is bounds-checked (returns Err on slice length mismatch), but the subsequent payload reads are raw slice indexing. A record with tag = 4 and len = 0xFFFFFFFF, a truncated record of any variable-length type, or a tag-9 Array with count = 0xFFFFFFFF produces either index out of bounds panic or a multi-GB Vec allocation.

Since WAL files can be partially written at crash time, a normal crash-recovery scenario becomes a crash loop — the process panics during replay, restarts, and panics again on the same segment.


Checklist

  • 1. Change pwrite to a retry loop that handles short writes; advance file_offset by data.len() (actual amount issued), not buffer.len(). Surface unrecoverable short-write as a typed error.
  • 2. Replace self-recursion in reader.rs and mmap_reader.rs with a loop. (Both files.)
  • 3. Introduce a read_slice(data, &mut cursor, n) -> Result<&[u8], _> helper in wal_record.rs and route every payload read through it; also cap variable-length fields (len, count) at a sane maximum before allocating.

Notes

  • Found during a CPU/memory + correctness audit sweep of nodedb-wal/ and nodedb-columnar/.
  • All three items are independently reproducible with small hand-crafted test inputs; checkboxes let PRs close them one-by-one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions