Verify CPU corruption after op via --verify_cpu_corruption_dir (#14852) by hx235 · Pull Request #14852 · facebook/rocksdb

hx235 · 2026-06-15T04:07:05Z

Summary:

Detection layer of the CPU corruption injector (#14858). With --verify_cpu_corruption_dir=<dir>, db_stress reads back the full keyspace after every write/manual flush/manual compaction op and compares it to the expected-values model, classifying any mismatch by kind: lost / resurrected / wrong-value (silent data corruption) or detected-corruption (a status/checksum-caught error). Each finding is written to <dir>/data_corruption.<tid>.json ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard VerificationAbort for a clean exit-1. A startup guard requires --threads=1 and all fault injection off so the read-back is single-writer and the only corruption present is the injected one

Bonus: a minor refactoring into the surrounding error handling code in these ops

Test plan:

1.Startup guard rejects misconfiguration:

--threads=2           -> exit 1: "--verify_cpu_corruption_dir requires --threads=1"
--read_fault_one_in=5 -> exit 1: "requires all fault injection off"

2.No false positive (clean CORE preset run, no injection):

$ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ...
exit 0; no data_corruption.<tid>.json produced; "Verification successful"

3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real <dir>/data_corruption.<tid>.json:

silent data corruption -- write returned OK but the key is gone on read-back:

{"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"}

detected corruption -- read-back Get returns Corruption via the memtable per-key checksum:

{"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed."

4.See PR #14866 test plan's spread in the outcome for verification of detection

Differential Revision: D107999834

meta-codesync · 2026-06-15T04:07:15Z

@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107999834.

github-actions · 2026-06-15T04:10:05Z

✅ clang-tidy: No findings on changed lines

Completed in 91.2s.

github-actions · 2026-06-15T04:26:15Z

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 9aebab1

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-06-15T04:39:07Z

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 9aebab1

Summary

This PR adds a useful CPU corruption detection layer to db_stress via --verify_cpu_corruption_dir, performing full-keyspace readback after each write/flush/compaction op. The API changes (void->Status for TestCompactRange/TestCompactFiles) are clean. However, the verification logic has several correctness gaps that would cause crashes in debug builds and false results with certain configurations.

High-severity findings (2):

[db_stress_test_base.cc:952] ExpectedValue::Exists() called without checking pending state -- will assert-crash in debug builds if any key has PendingWrite/PendingDelete flags set.
[db_stress_test_base.cc:1007] Missing ReadOptions.timestamp when FLAGS_user_timestamp_size > 0 -- all readback Get() calls will fail or return wrong results with user-defined timestamps enabled.

Full review (click to expand)

Findings

🔴 HIGH

H1. `ExpectedValue::Exists()` assert crash -- `db_stress_test_base.cc:952`

Issue: ClassifyReadBack() calls expected_value.Exists() which contains assert(!PendingWrite() && !PendingDelete()) (expected_value.h:41). If any key in the full-keyspace scan has a pending flag, this crashes in debug builds. In release builds, the assert is compiled out and behavior is undefined (IsDeleted may return incorrect results for pending-delete keys).
Root cause: The PR does not check for pending state before calling Exists(). While --threads=1 ensures the just-completed op's expected state is committed, background operations or edge cases could leave other keys in pending state. More importantly, this diverges from the established pattern in VerifyOrSyncValue() (no_batched_ops_stress.cc:3268) which explicitly checks and handles pending state.
Suggested fix: Before calling Exists(), check expected_value.PendingWrite() || expected_value.PendingDelete() and either skip the key or sync the state from the DB, following the VerifyOrSyncValue() pattern.

H2. Missing user-defined timestamp support -- `db_stress_test_base.cc:1007`

Issue: MaybeVerifyCpuCorruption() creates ReadOptions read_opts with verify_checksums = true but does NOT set read_opts.timestamp when FLAGS_user_timestamp_size > 0. Every db_->Get() call will either fail or return incorrect results when UDT is enabled.
Root cause: The canonical verification pattern (no_batched_ops_stress.cc:40-44) sets options.timestamp to GetNowNanos(). The PR omits this.

Suggested fix: Add timestamp handling:

std::string ts_str;
Slice ts;
if (FLAGS_user_timestamp_size > 0) {
  ts_str = GetNowNanos();
  ts = ts_str;
  read_opts.timestamp = &ts;
}

🟡 MEDIUM

M1. No guard for non-state-tracked stress tests -- `db_stress_test_base.cc:998`

Issue: MaybeVerifyCpuCorruption() uses shared->Get(cf, key) which relies on expected state tracking. Three of four stress test subclasses set IsStateTracked() = false: BatchedOpsStressTest, MultiOpsTxnsStressTest, CfConsistencyStressTest. These tests don't call PreparePut/PrepareDelete, so expected state is never populated -- every key appears deleted, producing false "lost" reports for every existing key.
Root cause: The method is in the base class but only works with NonBatchedOpsStressTest.
Suggested fix: Add early return: if (!IsStateTracked()) return; or validate at startup that verify_cpu_corruption_dir is incompatible with test_batches_snapshots, test_cf_consistency, and test_multi_ops_txns.

M2. Incomplete fault injection validation -- `db_stress_tool.cc:196`

Issue: Startup validation checks most *_fault_one_in flags but misses secondary_cache_fault_one_in (declared in db_stress_gflags.cc:1301). Secondary cache faults could inject errors during readback Get() calls, tainting results.
Suggested fix: Add FLAGS_secondary_cache_fault_one_in != 0 to the validation block.

M3. Insufficient error status handling in ClassifyReadBack -- `db_stress_test_base.cc:949`

Issue: After checking IsCorruption(), the code asserts read_status.ok() || read_status.IsNotFound(). Even with fault injection off, other statuses are possible: IOError (disk issues), Incomplete (prefix bloom), TimedOut. The assert would crash in debug; in release, the status would be silently misclassified.
Suggested fix: Handle unexpected statuses gracefully -- either skip the key with a warning or treat non-OK/non-NotFound as "detected-corruption" with the actual status.

M4. Missing verification for ingest operations -- `db_stress_test_base.cc` (OperateDb)

Issue: TestIngestExternalFile and ingest_wbwi_one_in operations modify the DB but are NOT followed by MaybeVerifyCpuCorruption(). These operations can significantly change DB state, creating a gap in corruption detection coverage.
Suggested fix: Either add MaybeVerifyCpuCorruption() after ingest operations or document the exclusion.

🟢 LOW / NIT

L1. Code duplication with VerifyDb -- `db_stress_test_base.cc:1003`

Issue: The PR's full-keyspace readback duplicates logic from VerifyOrSyncValue() (no_batched_ops_stress.cc:3260). The TODO comment at line 1003 acknowledges this. The canonical verifier handles more edge cases (pending state, timestamps, wide columns, secondary DB). Divergence over time is a maintenance risk.
Suggested fix: Follow through on the TODO to extract a shared per-key verification helper.

L2. Wide column validation gap -- `db_stress_test_base.cc:1013`

Issue: When FLAGS_use_put_entity_one_in > 0, the PR uses db_->Get() which returns the default column value. This is functionally correct for value comparison, but does not validate wide column consistency (non-default columns could be corrupted without detection). The canonical VerifyDb calls VerifyWideColumns().
Suggested fix: Consider using GetEntity() with VerifyWideColumns() for comprehensive coverage when PutEntity is in use.

L3. fopen without directory existence check -- `db_stress_test_base.cc:977`

Issue: WriteDataCorruption() calls fopen() on a path under FLAGS_verify_cpu_corruption_dir without verifying the directory exists. If the directory doesn't exist, fopen fails and the error is only printed to stderr -- the corruption finding is lost.
Suggested fix: Validate directory existence at startup or create it if missing.

L4. Performance documentation -- `db_stress_gflags.cc:133`

Issue: The flag documentation doesn't mention the extreme performance cost of full-keyspace readback after every op (potentially millions of Get calls per write). Users should understand this is designed for single-op CPU-fault-injection debugging, not general stress testing.
Suggested fix: Add a note about performance implications to the flag description.

Cross-Component Analysis

Context	Compatible?	Notes
NonBatchedOpsStressTest	YES	Primary target, state tracking works
BatchedOpsStressTest	NO	IsStateTracked()=false, expected state not populated
MultiOpsTxnsStressTest	NO	IsStateTracked()=false, different data model
CfConsistencyStressTest	NO	IsStateTracked()=false
User-defined timestamps	NO	ReadOptions.timestamp not set (H2)
Wide columns / PutEntity	PARTIAL	Default column correct, non-default not validated
Merge operations	YES	db_->Get() returns merged result correctly
Background compaction	YES	Compaction preserves data visibility; compaction filter is already mutually exclusive with most features

Assumption stress-test results:

"Single writer": Valid with --threads=1. Background compaction doesn't change logical data visibility (only reorganizes files). Compaction filters are already excluded by existing flag validation.
"Fault injection off": Mostly valid but secondary_cache_fault_one_in is missed (M2).
"Only OK/NotFound statuses": False -- IOError and other statuses possible even without injection (M3).

Positive Observations

The void-to-Status refactor for TestCompactRange and TestCompactFiles is clean and correctly propagates through the only override (MultiOpsTxnsStressTest).
The DataCorruption struct with Slice fields has correct lifetime management -- Slices reference caller-owned buffers (db_value string and expected_scratch array) that outlive the struct usage.
JSON output uses std::quoted() for op_status field escaping and hex encoding for values -- safe against injection.
The early-return pattern in MaybeVerifyCpuCorruption() (checking FLAGS_verify_cpu_corruption_dir.empty()) has negligible overhead on the normal (disabled) path.
Placing verification before re-enabling fault injection (see "put" case comments) is correct and well-documented.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

…acebook#14852) Summary: Detection layer of the CPU corruption injector (coming up). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/flush/compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one Differential Revision: D107999834

hx235 · 2026-06-16T05:06:01Z

Fixed meaningful AI feedback

…acebook#14852) Summary: Detection layer of the CPU corruption injector (coming up). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/flush/compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one **Test plan:** 1.Startup guard rejects misconfiguration: ``` --threads=2 -> exit 1: "--verify_cpu_corruption_dir requires --threads=1" --read_fault_one_in=5 -> exit 1: "requires all fault injection off" ``` 2.No false positive (clean CORE preset run, no injection): ``` $ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ... exit 0; no data_corruption.<tid>.json produced; "Verification successful" ``` 3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real `<dir>/data_corruption.<tid>.json`: silent data corruption -- write returned OK but the key is gone on read-back: ``` {"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"} ``` detected corruption -- read-back Get returns Corruption via the memtable per-key checksum: ``` {"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed." ``` Differential Revision: D107999834

hx235 · 2026-06-16T05:22:03Z

Fix format

github-actions · 2026-06-16T05:40:02Z

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit aafd57d

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-06-16T05:48:10Z

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit aafd57d

Summary

Well-structured, focused PR that adds CPU corruption detection to db_stress with appropriate safety guards. The design is sound: full-keyspace read-back after each op under single-writer constraints. The void-to-Status refactoring of TestCompactRange/TestCompactFiles is clean and correctly updates the only override (MultiOpsTxnsStressTest).

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

M1. Missing startup guard for `--num_dbs > 1` with `--verify_cpu_corruption_dir` -- `db_stress_tool/db_stress_tool.cc`

Issue: The startup guard validates --threads=1 but does not check FLAGS_num_dbs. With --num_dbs=2, two StressTest instances run on separate threads, each with its own DB and expected state. The output files use thread->tid which resets per DB instance (thread IDs are 0-based per DB). With two DB instances both having tid=0, the output file data_corruption.0.json would be written by both instances to the same directory, causing a race.
Root cause: FLAGS_verify_cpu_corruption_dir is a global (per-process) flag, but thread IDs are per-DB.
Suggested fix: Either reject --num_dbs > 1 in the startup guard, or include the DB index in the output filename (e.g., data_corruption.<db_index>.<tid>.json).

M2. No snapshot used during full-keyspace read-back -- `db_stress_test_base.cc:1044-1059`

Issue: MaybeVerifyCpuCorruption issues individual db_->Get() calls without a snapshot. Between successive Get() calls, background compaction/flush could run. There is a theoretical window where TTL expiration (if FLAGS_ttl > 0) could cause a key to disappear between the write and a later read-back Get(), leading to a false "lost" classification.
Root cause: The canonical VerifyDb also reads without a snapshot in its Get path, so this follows existing convention. However, VerifyDb runs at quiescent points, while this runs mid-operation-loop.
Suggested fix: Consider rejecting --ttl > 0 in the startup guard, or use a snapshot for the read-back.

M3. `ClassifyReadBack` does not handle `PendingWrite`/`PendingDelete` state -- `db_stress_test_base.cc:949-981`

Issue: Unlike the canonical VerifyOrSyncValue (no_batched_ops_stress.cc:3301) which checks for pending state and syncs rather than asserting, ClassifyReadBack calls expected_value.Exists() which asserts !PendingWrite() && !PendingDelete(). Under --threads=1, this is safe because TestPut/TestDelete/TestDeleteRange commit/rollback before returning. However, this relies on a non-locally-obvious invariant.
Suggested fix: Add a comment at the Exists() call explaining WHY it's safe (the commit/rollback-before-return guarantee under --threads=1).

🟢 LOW / NIT

L1. Output directory not validated at startup -- `db_stress_tool/db_stress_tool.cc`

Issue: The startup guard does not verify that FLAGS_verify_cpu_corruption_dir exists or is writable. If the directory doesn't exist, fopen returns NULL, stderr message prints, but VerificationAbort still triggers. JSON file is supplementary.
Suggested fix: Consider validating the directory at startup or documenting it must exist.

L2. Full-keyspace scan after no-op compaction -- `db_stress_test_base.cc:1549`

Issue: TestCompactFiles returns Status::OK() when no compaction is performed. Full read-back still runs. Wasteful but harmless for a debugging tool.

Cross-Component Analysis

Context	Assumptions hold?	Action needed?
`--num_dbs > 1`	File collision risk (M1)	Guard or fix filename
`--enable_compaction_filter`	Safe - only removes already-deleted keys	None
`--use_txn` + `commit_bypass_memtable`	Safe - committed before read-back	None
`--ttl > 0`	Potential false positive (M2)	Consider guarding
User-defined timestamps	Handled correctly	None
BlobDB	Safe - Get() handles blob indirection	None
Background compaction/flush	Safe - doesn't modify expected state	None

Positive Observations

Clean refactoring: The void -> Status change for TestCompactRange/TestCompactFiles is minimal and correctly updates the only override.
Proper fault injection window: MaybeVerifyCpuCorruption is called BEFORE re-enabling fault injection, ensuring clean read-back.
Defensive unexpected-status handling: ImmediateExit(1) for non-OK/non-NotFound/non-Corruption statuses is appropriate.
Good startup guards: Catches test mode incompatibilities and all fault injection flags.
Tech debt acknowledgment: The TODO noting duplication with VerifyDb and the refactoring plan is good practice.
Correct user-timestamp handling: Read-back correctly sets a read timestamp when FLAGS_user_timestamp_size > 0.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

…14852) Summary: Detection layer of the CPU corruption injector (coming up). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/flush/compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one **Test plan:** 1.Startup guard rejects misconfiguration: ``` --threads=2 -> exit 1: "--verify_cpu_corruption_dir requires --threads=1" --read_fault_one_in=5 -> exit 1: "requires all fault injection off" ``` 2.No false positive (clean CORE preset run, no injection): ``` $ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ... exit 0; no data_corruption.<tid>.json produced; "Verification successful" ``` 3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real `<dir>/data_corruption.<tid>.json`: silent data corruption -- write returned OK but the key is gone on read-back: ``` {"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"} ``` detected corruption -- read-back Get returns Corruption via the memtable per-key checksum: ``` {"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed." ``` Differential Revision: D107999834

) Summary: Injection layer of the CPU corruption injector (tools/cpu_corruption_injector/injector.py), runs inside gdb and corrupt a register by bit flip in exactly one db_stress op (i.e, write, foreground compaction and flush) per stress test run. Detection is at db_stress (#14852); orchestration is coming up. How one run works - The orchestration layer, coming up, randomly picks which op instance (so corruption lands at different points in the LSM's life) and which target_fn per run (so it has a reasonable number of instructions to step under a reasonable time limit); injector.py picks which instruction within target_fn. - Attach: gdb starts with injector.py's parameters passed via -iex and the db_stress command after --args, so db_stress runs unmodified. Example: ``` gdb --batch --nx \ -iex "py import sys; sys.argv=['injector.py','--op','write','--op_index','42','--entry_fn','rocksdb::MemTable::Add','--target_fn','rocksdb::MemTable::Add','--corruptions_per_op','1','--seed','7','--dir','<rundir>']" \ -x tools/cpu_corruption_injector/injector.py \ --args <db_stress> --threads=1 --verify_cpu_corruption_dir=<rundir> ... ``` - Reach the op: entry_fn is called exactly once per stress test run's op so the op_index-th op is its op_index-th call. The orchestration layer picks op_index . `injector_navigate.py` breaks on entry_fn and set a gdb ignore-count of op_index-1 to fast-forward to op_index-th one. - Warm up: `injector_critical_instruction.py` will choose "critical instruction" (those that move key/value bytes with general-purpose or vector registers or set a branch flag) uniformly within the chosen `target_fn` (within `entry_fn`) by the orchestration layer. In order to do that, it needs to approximate how many such instructions within `target_fn`. Hence we have this warm-up phase. It single-steps the first call of target_fn to count and pick the critical instruction index, then corrupt that index at a later call. - Corrupt: on a later call of target_fn, `injector_critical_instruction.py` single-step to the m-th critical instruction and bit-flip the register through `injector_register_corruption.py`. The way to corrupt register depends on what instruction it is. - Record: `injector_telemetry.py` provides telemetry to capture the corruption for later analysis. **Test plan:** 1. Isolated tests (real gdb-captured x/i fixtures): test_inject_critical_instruction 6/6 2. E2E test on navigation, inject, telemetry will be done in the later orchestration PR. Below is inject.json from such run ``` { "injection_result": "injected", "db_stress_crash_signal": null, "op": "write", "op_index": 279, "entry_fn": "rocksdb::MemTable::Add", "target_fn": "rocksdb::MemTable::Add", "critical_instruction_index": 37, "corruptions": [ { "instruction": "mov %rsi,0x8c8(%rbx)", "register": "rsi", "corruption_type": "bit_flip", "before": "0x7fffee4c64d8", "after": "0x7fffee4c64c8", "details": { "source": "rocksdb::Arena::AllocateAligned @ ./fbcode/internal_repo_rocksdb/repo/memory/arena.cc:135", "call_chain": [ "rocksdb::Arena::AllocateAligned @ ./fbcode/internal_repo_rocksdb/repo/memory/arena.cc:135", "rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)::{lambda()#1}::operator()() const @ fbcode/internal_repo_rocksdb/repo/memory/concurrent_arena.h:65", "rocksdb::ConcurrentArena::AllocateImpl<rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)::{lambda()#1}>(unsigned long, bool, rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)::{lambda()#1} const&) @ fbcode/internal_repo_rocksdb/repo/memory/concurrent_arena.h:145", "rocksdb::ConcurrentArena::AllocateAligned @ fbcode/internal_repo_rocksdb/repo/memory/concurrent_arena.h:63", "rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::AllocateNode @ fbcode/internal_repo_rocksdb/repo/memtable/inlineskiplist.h:868", "rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::AllocateKey @ fbcode/internal_repo_rocksdb/repo/memtable/inlineskiplist.h:855", "rocksdb::(anonymous namespace)::SkipListRep::Allocate @ ./fbcode/internal_repo_rocksdb/repo/memtable/skiplistrep.cc:36", "rocksdb::MemTable::Add @ ./fbcode/internal_repo_rocksdb/repo/db/memtable.cc:1157" ] } } ], "ops_seen": 279, "critical_instructions_seen": 38 } ``` Differential Revision: D107999835

…ook#14852) Summary: Detection layer of the CPU corruption injector (facebook#14858). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/manual flush/manual compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one Bonus: a minor refactoring into the surrounding error handling code in these ops **Test plan:** 1.Startup guard rejects misconfiguration: ``` --threads=2 -> exit 1: "--verify_cpu_corruption_dir requires --threads=1" --read_fault_one_in=5 -> exit 1: "requires all fault injection off" ``` 2.No false positive (clean CORE preset run, no injection): ``` $ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ... exit 0; no data_corruption.<tid>.json produced; "Verification successful" ``` 3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real `<dir>/data_corruption.<tid>.json`: silent data corruption -- write returned OK but the key is gone on read-back: ``` {"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"} ``` detected corruption -- read-back Get returns Corruption via the memtable per-key checksum: ``` {"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed." ``` 4.See PR [todo]'s spread in the outcome for verification of detection Differential Revision: D107999834

Summary: Detection layer of the CPU corruption injector (#14858). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/manual flush/manual compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one Bonus: a minor refactoring into the surrounding error handling code in these ops **Test plan:** 1.Startup guard rejects misconfiguration: ``` --threads=2 -> exit 1: "--verify_cpu_corruption_dir requires --threads=1" --read_fault_one_in=5 -> exit 1: "requires all fault injection off" ``` 2.No false positive (clean CORE preset run, no injection): ``` $ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ... exit 0; no data_corruption.<tid>.json produced; "Verification successful" ``` 3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real `<dir>/data_corruption.<tid>.json`: silent data corruption -- write returned OK but the key is gone on read-back: ``` {"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"} ``` detected corruption -- read-back Get returns Corruption via the memtable per-key checksum: ``` {"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed." ``` 4.See PR [todo]'s spread in the outcome for verification of detection Differential Revision: D107999834

) Summary: This PR is the injection layer of the CPU corruption injector, runs inside gdb and randomly corrupts a register by bit flip in exactly one db_stress op (i.e, write, foreground compaction and flush) per stress test run. Detection layer is at db_stress (#14852); orchestration layer is coming up. __How one run works__ - The orchestration layer, coming up, randomly picks which stress test `op` instance (so corruption can land at different points in the LSM shape journey) and which `target_fn` of that `op` (so to cap instructions to step under a reasonable limit; `injector.py` in this PR randomly picks which instruction within the `target_fn` to inject (so corruption can land at different points of a `target_fn`). - Attach: gdb starts with injector.py's parameters passed via -iex and the db_stress command after --args, so db_stress runs unmodified. Example: ``` gdb --batch --nx \ -iex "py import sys; sys.argv=['injector.py','--op','write','--op_index','42','--entry_fn','rocksdb::MemTable::Add','--target_fn','rocksdb::MemTable::Add','--corruptions_per_op','1','--seed','7','--dir','<rundir>']" \ -x tools/cpu_corruption_injector/injector.py \ --args <db_stress> --threads=1 --verify_cpu_corruption_dir=<rundir> ... ``` - Navigate: The orchestration layer will pick op_index. `entry_fn` is called exactly once per stress test run's op so the op_index-th op is its op_index-th call. `injector_navigate.py` breaks on `entry_fn` and set a gdb ignore-count of op_index-1 to fast-forward to op_index-th one. It also breaks at the first `target_fn` within that `entry_fn`. - Warm up: `injector_critical_instruction.py` will choose "critical instruction" (those that move key/value bytes with general-purpose or vector registers or set a branch flag) uniformly within the chosen `target_fn` by the orchestration layer. In order to do that, it needs to approximate how many such instructions within the `target_fn`. Hence we have this warm-up phase. It single-steps the instruction within the first encoutering of `target_fn` to count and draw the critical instruction index, then corrupt that index at a later call. - Corrupt: on a later call of `target_fn`, `injector_critical_instruction.py` single-step to the m-th critical instruction and bit-flip the register through `injector_register_corruption.py`. The way to corrupt register depends on what instruction it is. If the current call of `target_fn`'s m-th instruction is not a critical instruction, we will try next `target_fn` till running out of `target_fn`. - Record: `injector_telemetry.py` provides telemetry to capture the corruption for later analysis. **Test plan:** 1. Isolated tests (real gdb-captured x/i fixtures): test_inject_critical_instruction 2. E2E test on navigation, inject, telemetry will be done in the later orchestration PR. Below is inject.json from such run ``` { "injection_result": "injected", "db_stress_crash_signal": null, "op": "write", "op_index": 279, "entry_fn": "rocksdb::MemTable::Add", "target_fn": "rocksdb::MemTable::Add", "critical_instruction_index": 37, "corruptions": [ { "instruction": "mov %rsi,0x8c8(%rbx)", "register": "rsi", "corruption_type": "bit_flip", "before": "0x7fffee4c64d8", "after": "0x7fffee4c64c8", "details": { "source": "rocksdb::Arena::AllocateAligned @ ./fbcode/internal_repo_rocksdb/repo/memory/arena.cc:135", "call_chain": [ "rocksdb::Arena::AllocateAligned @ ./fbcode/internal_repo_rocksdb/repo/memory/arena.cc:135", "rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)::{lambda()#1}::operator()() const @ fbcode/internal_repo_rocksdb/repo/memory/concurrent_arena.h:65", "rocksdb::ConcurrentArena::AllocateImpl<rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)::{lambda()#1}>(unsigned long, bool, rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)::{lambda()#1} const&) @ fbcode/internal_repo_rocksdb/repo/memory/concurrent_arena.h:145", "rocksdb::ConcurrentArena::AllocateAligned @ fbcode/internal_repo_rocksdb/repo/memory/concurrent_arena.h:63", "rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::AllocateNode @ fbcode/internal_repo_rocksdb/repo/memtable/inlineskiplist.h:868", "rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::AllocateKey @ fbcode/internal_repo_rocksdb/repo/memtable/inlineskiplist.h:855", "rocksdb::(anonymous namespace)::SkipListRep::Allocate @ ./fbcode/internal_repo_rocksdb/repo/memtable/skiplistrep.cc:36", "rocksdb::MemTable::Add @ ./fbcode/internal_repo_rocksdb/repo/db/memtable.cc:1157" ] } } ], "ops_seen": 279, "critical_instructions_seen": 38 } ``` Differential Revision: D107999835

Summary: Detection layer of the CPU corruption injector (#14858). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/manual flush/manual compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one Bonus: a minor refactoring into the surrounding error handling code in these ops **Test plan:** 1.Startup guard rejects misconfiguration: ``` --threads=2 -> exit 1: "--verify_cpu_corruption_dir requires --threads=1" --read_fault_one_in=5 -> exit 1: "requires all fault injection off" ``` 2.No false positive (clean CORE preset run, no injection): ``` $ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ... exit 0; no data_corruption.<tid>.json produced; "Verification successful" ``` 3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real `<dir>/data_corruption.<tid>.json`: silent data corruption -- write returned OK but the key is gone on read-back: ``` {"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"} ``` detected corruption -- read-back Get returns Corruption via the memtable per-key checksum: ``` {"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed." ``` 4.See PR [todo]'s spread in the outcome for verification of detection Differential Revision: D107999834

) Summary: This PR is the injection layer of the CPU corruption injector, runs inside gdb and randomly corrupts a register by bit flip in exactly one db_stress op (i.e, write, foreground compaction and flush) per stress test run. Detection layer is at db_stress (#14852); orchestration layer is coming up. __How one run works__ - The orchestration layer, coming up, randomly picks which stress test `op` instance (so corruption can land at different points in the LSM shape journey) and which `target_fn` of that `op` (so to cap instructions to step under a reasonable limit; `injector.py` in this PR randomly picks which instruction within the `target_fn` to inject (so corruption can land at different points of a `target_fn`). - Attach: gdb starts with injector.py's parameters passed via -iex and the db_stress command after --args, so db_stress runs unmodified. Example: ``` gdb --batch --nx \ -iex "py import sys; sys.argv=['injector.py','--op','write','--op_index','42','--entry_fn','rocksdb::MemTable::Add','--target_fn','rocksdb::MemTable::Add','--corruptions_per_op','1','--seed','7','--dir','<rundir>']" \ -x tools/cpu_corruption_injector/injector.py \ --args <db_stress> --threads=1 --verify_cpu_corruption_dir=<rundir> ... ``` - Navigate: The orchestration layer will pick op_index. `entry_fn` is called exactly once per stress test run's op so the op_index-th op is its op_index-th call. `injector_navigate.py` breaks on `entry_fn` and set a gdb ignore-count of op_index-1 to fast-forward to op_index-th one. It also breaks at the first `target_fn` within that `entry_fn`. - Warm up: `injector_critical_instruction.py` will choose "critical instruction" (those that move key/value bytes with general-purpose or vector registers or set a branch flag) uniformly within the chosen `target_fn` by the orchestration layer. In order to do that, it needs to approximate how many such instructions within the `target_fn`. Hence we have this warm-up phase. It single-steps the instruction within the first encoutering of `target_fn` to count and draw the critical instruction index, then corrupt that index at a later call. - Corrupt: on a later call of `target_fn`, `injector_critical_instruction.py` single-step to the m-th critical instruction and bit-flip the register through `injector_register_corruption.py`. The way to corrupt register depends on what instruction it is. If the current call of `target_fn`'s m-th instruction is not a critical instruction, we will try next `target_fn` till running out of `target_fn`. - Record: `injector_telemetry.py` provides telemetry to capture the corruption for later analysis. **Test plan:** 1. Isolated tests (real gdb-captured x/i fixtures): test_inject_critical_instruction 2. E2E test on navigation, inject, telemetry will be done in the later orchestration PR. Below is inject.json from such run ``` { "injection_result": "injected", "db_stress_crash_signal": null, "op": "write", "op_index": 279, "entry_fn": "rocksdb::MemTable::Add", "target_fn": "rocksdb::MemTable::Add", "critical_instruction_index": 37, "corruptions": [ { "instruction": "mov %rsi,0x8c8(%rbx)", "register": "rsi", "corruption_type": "bit_flip", "before": "0x7fffee4c64d8", "after": "0x7fffee4c64c8", "details": { "source": "rocksdb::Arena::AllocateAligned @ ./fbcode/internal_repo_rocksdb/repo/memory/arena.cc:135", "call_chain": [ "rocksdb::Arena::AllocateAligned @ ./fbcode/internal_repo_rocksdb/repo/memory/arena.cc:135", "rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)::{lambda()#1}::operator()() const @ fbcode/internal_repo_rocksdb/repo/memory/concurrent_arena.h:65", "rocksdb::ConcurrentArena::AllocateImpl<rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)::{lambda()#1}>(unsigned long, bool, rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)::{lambda()#1} const&) @ fbcode/internal_repo_rocksdb/repo/memory/concurrent_arena.h:145", "rocksdb::ConcurrentArena::AllocateAligned @ fbcode/internal_repo_rocksdb/repo/memory/concurrent_arena.h:63", "rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::AllocateNode @ fbcode/internal_repo_rocksdb/repo/memtable/inlineskiplist.h:868", "rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::AllocateKey @ fbcode/internal_repo_rocksdb/repo/memtable/inlineskiplist.h:855", "rocksdb::(anonymous namespace)::SkipListRep::Allocate @ ./fbcode/internal_repo_rocksdb/repo/memtable/skiplistrep.cc:36", "rocksdb::MemTable::Add @ ./fbcode/internal_repo_rocksdb/repo/db/memtable.cc:1157" ] } } ], "ops_seen": 279, "critical_instructions_seen": 38 } ``` Differential Revision: D107999835

…ook#14852) Summary: Detection layer of the CPU corruption injector (facebook#14858). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/manual flush/manual compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one Bonus: a minor refactoring into the surrounding error handling code in these ops **Test plan:** 1.Startup guard rejects misconfiguration: ``` --threads=2 -> exit 1: "--verify_cpu_corruption_dir requires --threads=1" --read_fault_one_in=5 -> exit 1: "requires all fault injection off" ``` 2.No false positive (clean CORE preset run, no injection): ``` $ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ... exit 0; no data_corruption.<tid>.json produced; "Verification successful" ``` 3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real `<dir>/data_corruption.<tid>.json`: silent data corruption -- write returned OK but the key is gone on read-back: ``` {"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"} ``` detected corruption -- read-back Get returns Corruption via the memtable per-key checksum: ``` {"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed." ``` 4.See PR facebook#14866 test plan's spread in the outcome for verification of detection Differential Revision: D107999834

meta-cla Bot added the CLA Signed label Jun 15, 2026

meta-codesync Bot added the meta-exported label Jun 15, 2026

meta-codesync Bot changed the title ~~Verify CPU corruption after each op via --verify_cpu_corruption_dir~~ Verify CPU corruption after each op via --verify_cpu_corruption_dir (#14852) Jun 16, 2026

hx235 force-pushed the export-D107999834 branch from 9aebab1 to 9e499d3 Compare June 16, 2026 05:03

hx235 force-pushed the export-D107999834 branch from 9e499d3 to aafd57d Compare June 16, 2026 05:21

This was referenced Jun 16, 2026

CPU corruption injector: gdb register flip into one db_stress op #14857

Closed

CPU corruption injector: gdb register flip into one db_stress op (#14858) #14858

Open

hx235 changed the title ~~Verify CPU corruption after each op via --verify_cpu_corruption_dir (#14852)~~ Verify CPU corruption after op via --verify_cpu_corruption_dir (#14852) Jun 16, 2026

hx235 requested a review from pdillinger June 16, 2026 20:54

hx235 force-pushed the export-D107999834 branch from aafd57d to 5cad4a7 Compare June 18, 2026 22:13

hx235 mentioned this pull request Jun 18, 2026

CPU corruption injector: runner + db_stress preset flags #14866

Open

hx235 force-pushed the export-D107999834 branch from 5cad4a7 to dbc0912 Compare June 18, 2026 22:29

Uh oh!

Conversation

hx235 commented Jun 15, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ clang-tidy: No findings on changed lines

Uh oh!

github-actions Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Code Review - OBSOLETE

🟡 Codex Code Review

Uh oh!

github-actions Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review - OBSOLETE

✅ Claude Code Review

Summary

Findings

🔴 HIGH

H1. ExpectedValue::Exists() assert crash -- db_stress_test_base.cc:952

H2. Missing user-defined timestamp support -- db_stress_test_base.cc:1007

🟡 MEDIUM

M1. No guard for non-state-tracked stress tests -- db_stress_test_base.cc:998

M2. Incomplete fault injection validation -- db_stress_tool.cc:196

M3. Insufficient error status handling in ClassifyReadBack -- db_stress_test_base.cc:949

M4. Missing verification for ingest operations -- db_stress_test_base.cc (OperateDb)

🟢 LOW / NIT

L1. Code duplication with VerifyDb -- db_stress_test_base.cc:1003

L2. Wide column validation gap -- db_stress_test_base.cc:1013

L3. fopen without directory existence check -- db_stress_test_base.cc:977

L4. Performance documentation -- db_stress_gflags.cc:133

Cross-Component Analysis

Positive Observations

Uh oh!

hx235 commented Jun 16, 2026

Uh oh!

hx235 commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

🟡 Codex Code Review

Uh oh!

github-actions Bot commented Jun 16, 2026

✅ Claude Code Review

Summary

Findings

🔴 HIGH

🟡 MEDIUM

M1. Missing startup guard for --num_dbs > 1 with --verify_cpu_corruption_dir -- db_stress_tool/db_stress_tool.cc

M2. No snapshot used during full-keyspace read-back -- db_stress_test_base.cc:1044-1059

M3. ClassifyReadBack does not handle PendingWrite/PendingDelete state -- db_stress_test_base.cc:949-981

🟢 LOW / NIT

L1. Output directory not validated at startup -- db_stress_tool/db_stress_tool.cc

L2. Full-keyspace scan after no-op compaction -- db_stress_test_base.cc:1549

Cross-Component Analysis

Positive Observations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hx235 commented Jun 15, 2026 •

edited by meta-codesync Bot

Loading

github-actions Bot commented Jun 15, 2026 •

edited

Loading

github-actions Bot commented Jun 15, 2026 •

edited

Loading

github-actions Bot commented Jun 15, 2026 •

edited

Loading

H1. `ExpectedValue::Exists()` assert crash -- `db_stress_test_base.cc:952`

H2. Missing user-defined timestamp support -- `db_stress_test_base.cc:1007`

M1. No guard for non-state-tracked stress tests -- `db_stress_test_base.cc:998`

M2. Incomplete fault injection validation -- `db_stress_tool.cc:196`

M3. Insufficient error status handling in ClassifyReadBack -- `db_stress_test_base.cc:949`

M4. Missing verification for ingest operations -- `db_stress_test_base.cc` (OperateDb)

L1. Code duplication with VerifyDb -- `db_stress_test_base.cc:1003`

L2. Wide column validation gap -- `db_stress_test_base.cc:1013`

L3. fopen without directory existence check -- `db_stress_test_base.cc:977`

L4. Performance documentation -- `db_stress_gflags.cc:133`

M1. Missing startup guard for `--num_dbs > 1` with `--verify_cpu_corruption_dir` -- `db_stress_tool/db_stress_tool.cc`

M2. No snapshot used during full-keyspace read-back -- `db_stress_test_base.cc:1044-1059`

M3. `ClassifyReadBack` does not handle `PendingWrite`/`PendingDelete` state -- `db_stress_test_base.cc:949-981`

L1. Output directory not validated at startup -- `db_stress_tool/db_stress_tool.cc`

L2. Full-keyspace scan after no-op compaction -- `db_stress_test_base.cc:1549`