Three distinct correctness hazards in the WAL write, read, and columnar recovery paths. Each can corrupt, crash, or wedge the recovery step; grouped as an epic because a single defensive-coding sweep can cover all three.
1. WalWriter::flush_buffer ignores short writes and mis-advances file_offset under O_DIRECT
File: nodedb-wal/src/writer.rs:305-340
fn flush_buffer(&mut self) -> Result<()> {
if self.buffer.is_empty() { return Ok(()); }
let data = if self.config.use_direct_io {
// O_DIRECT requires aligned I/O size.
self.buffer.as_aligned_slice()
} else {
self.buffer.as_slice()
};
#[cfg(unix)]
{
use std::os::unix::io::AsRawFd;
let fd = self.file.as_raw_fd();
let written = unsafe {
libc::pwrite(fd, data.as_ptr() as *const libc::c_void,
data.len(), self.file_offset as libc::off_t)
};
if written < 0 { // ← only negative checked
return Err(WalError::Io(std::io::Error::last_os_error()));
}
}
self.file_offset += self.buffer.len() as u64; // ← buffer.len(), not data.len()
self.buffer.clear();
Ok(())
}
Three related defects in the durability path:
- Short writes not detected.
pwrite can return a positive value less than data.len() (ENOSPC, signal interruption, etc.). The code only checks < 0, so a short write silently loses the tail of the buffer while the writer proceeds as if everything succeeded.
file_offset uses buffer.len() instead of data.len(). When use_direct_io = true, data = self.buffer.as_aligned_slice() which is padded up to the alignment boundary (typically 512/4096 bytes). The pwrite writes the aligned length, but the offset advances by the unaligned buffer length. The next flush_buffer then issues pwrite(..., file_offset=unaligned), which O_DIRECT rejects with EINVAL on any real filesystem (ext4 / xfs on NVMe). Even absent the error, the record stream on disk becomes unreadable because the following record overlaps the previous padding.
- No retry on partial write. Even in the well-behaved case, a single partial
pwrite leaves the writer wedged at an inconsistent offset.
2. WalReader::next_record / MmapRecordIter::next_record recurse on unknown-optional records — stack overflow DoS
Files:
// reader.rs
if RecordType::from_raw(logical_type).is_none() {
if RecordType::is_required(logical_type) {
return Err(WalError::UnknownRequiredRecordType { ... });
}
// Unknown optional record — skip it and continue.
// (The record is already consumed, so just recurse.)
return self.next_record();
}
The comment explicitly acknowledges self-recursion. Rust does not guarantee tail-call optimization; debug builds reliably do not TCO and release builds do so only opportunistically. A WAL with many consecutive unknown-optional records (straightforward after a forward-compat rollback: a newer binary introduced a new non-required record type, then got rolled back) exhausts the thread stack during recover() at process startup — the server panics and refuses to boot.
A crafted segment file with N records typed as an unknown-but-optional type (CRC valid, bit 15 of record_type clear for "optional") reliably reproduces this. N = ~30–50 k on Linux's 8 MB stack; ~2–4 k on macOS non-main threads.
Same pattern exists in both the streaming WalReader and the mmap-based reader.
3. decode_row_from_wal panics on crafted / truncated WAL payloads (unchecked slice indexing)
File: nodedb-columnar/src/wal_record.rs:175-252
let tag = data[cursor]; // ← panic on empty slice
cursor += 1;
let value = match tag {
...
4 | 5 | 8 => {
let len = u32::from_le_bytes(data[cursor..cursor + 4].try_into().map_err(...)?) as usize;
cursor += 4;
let bytes = &data[cursor..cursor + len]; // ← panic if cursor+len > data.len()
cursor += len;
...
}
6 => {
let micros = i64::from_le_bytes(data[cursor..cursor + 8].try_into().map_err(...)?);
cursor += 8;
...
}
7 => {
let mut bytes = [0u8; 16];
bytes.copy_from_slice(&data[cursor..cursor + 16]); // ← panic on truncation
cursor += 16;
Value::Decimal(rust_decimal::Decimal::deserialize(bytes))
}
9 => {
let count = u32::from_le_bytes(data[cursor..cursor + 4].try_into().map_err(...)?) as usize;
cursor += 4;
let mut arr = Vec::with_capacity(count); // ← OOM on count = 0xFFFFFFFF
for _ in 0..count {
let f = f32::from_le_bytes(data[cursor..cursor + 4].try_into().map_err(...)?);
cursor += 4;
arr.push(Value::Float(f as f64));
}
...
}
10 => {
let len = u32::from_le_bytes(data[cursor..cursor + 4].try_into().map_err(...)?) as usize;
cursor += 4;
let json_bytes = &data[cursor..cursor + len]; // ← panic if len huge
...
}
...
};
The try_into on fixed-length headers is bounds-checked (returns Err on slice length mismatch), but the subsequent payload reads are raw slice indexing. A record with tag = 4 and len = 0xFFFFFFFF, a truncated record of any variable-length type, or a tag-9 Array with count = 0xFFFFFFFF produces either index out of bounds panic or a multi-GB Vec allocation.
Since WAL files can be partially written at crash time, a normal crash-recovery scenario becomes a crash loop — the process panics during replay, restarts, and panics again on the same segment.
Checklist
Notes
- Found during a CPU/memory + correctness audit sweep of
nodedb-wal/ and nodedb-columnar/.
- All three items are independently reproducible with small hand-crafted test inputs; checkboxes let PRs close them one-by-one.
Three distinct correctness hazards in the WAL write, read, and columnar recovery paths. Each can corrupt, crash, or wedge the recovery step; grouped as an epic because a single defensive-coding sweep can cover all three.
1.
WalWriter::flush_bufferignores short writes and mis-advancesfile_offsetunderO_DIRECTFile:
nodedb-wal/src/writer.rs:305-340Three related defects in the durability path:
pwritecan return a positive value less thandata.len()(ENOSPC, signal interruption, etc.). The code only checks< 0, so a short write silently loses the tail of the buffer while the writer proceeds as if everything succeeded.file_offsetusesbuffer.len()instead ofdata.len(). Whenuse_direct_io = true,data = self.buffer.as_aligned_slice()which is padded up to the alignment boundary (typically 512/4096 bytes). Thepwritewrites the aligned length, but the offset advances by the unaligned buffer length. The nextflush_bufferthen issuespwrite(..., file_offset=unaligned), whichO_DIRECTrejects withEINVALon any real filesystem (ext4 / xfs on NVMe). Even absent the error, the record stream on disk becomes unreadable because the following record overlaps the previous padding.pwriteleaves the writer wedged at an inconsistent offset.2.
WalReader::next_record/MmapRecordIter::next_recordrecurse on unknown-optional records — stack overflow DoSFiles:
nodedb-wal/src/reader.rs:107-119nodedb-wal/src/mmap_reader.rs:97-108The comment explicitly acknowledges self-recursion. Rust does not guarantee tail-call optimization; debug builds reliably do not TCO and release builds do so only opportunistically. A WAL with many consecutive unknown-optional records (straightforward after a forward-compat rollback: a newer binary introduced a new non-required record type, then got rolled back) exhausts the thread stack during
recover()at process startup — the server panics and refuses to boot.A crafted segment file with N records typed as an unknown-but-optional type (CRC valid, bit 15 of
record_typeclear for "optional") reliably reproduces this. N = ~30–50 k on Linux's 8 MB stack; ~2–4 k on macOS non-main threads.Same pattern exists in both the streaming
WalReaderand the mmap-based reader.3.
decode_row_from_walpanics on crafted / truncated WAL payloads (unchecked slice indexing)File:
nodedb-columnar/src/wal_record.rs:175-252The
try_intoon fixed-length headers is bounds-checked (returnsErron slice length mismatch), but the subsequent payload reads are raw slice indexing. A record withtag = 4andlen = 0xFFFFFFFF, a truncated record of any variable-length type, or a tag-9 Array withcount = 0xFFFFFFFFproduces eitherindex out of boundspanic or a multi-GBVecallocation.Since WAL files can be partially written at crash time, a normal crash-recovery scenario becomes a crash loop — the process panics during replay, restarts, and panics again on the same segment.
Checklist
pwriteto a retry loop that handles short writes; advancefile_offsetbydata.len()(actual amount issued), notbuffer.len(). Surface unrecoverable short-write as a typed error.reader.rsandmmap_reader.rswith aloop. (Both files.)read_slice(data, &mut cursor, n) -> Result<&[u8], _>helper inwal_record.rsand route every payload read through it; also cap variable-length fields (len,count) at a sane maximum before allocating.Notes
nodedb-wal/andnodedb-columnar/.