CPU corruption injector: gdb register flip into one db_stress op by hx235 · Pull Request #14857 · facebook/rocksdb

hx235 · 2026-06-16T06:46:49Z

Summary:
Injection layer of the CPU corruption injector (tools/cpu_corruption_injector/injector.py), runs inside gdb and corrupt a register by bit flip in exactly one db_stress op (i.e, write, foreground compaction and flush) per stress test run. Detection is at db_stress (#14852); orchestration is coming up.

How one run works

The orchestration layer, coming up, randomly picks which op instance (so corruption lands at different points in the LSM's life) and which target_fn per run (so it has a reasonable number of instructions to step under a reasonable time limit); injector.py picks which instruction within target_fn.
Attach: gdb starts with injector.py's parameters passed via -iex and the db_stress command after --args, so db_stress runs unmodified. Example:

gdb --batch --nx \
  -iex "py import sys; sys.argv=['injector.py','--op','write','--op_index','42','--entry_fn','rocksdb::MemTable::Add','--target_fn','rocksdb::MemTable::Add','--corruptions_per_op','1','--seed','7','--dir','<rundir>']" \
  -x tools/cpu_corruption_injector/injector.py \
  --args <db_stress> --threads=1 --verify_cpu_corruption_dir=<rundir>  ...

Reach the op: entry_fn is called exactly once per stress test run's op so the op_index-th op is its op_index-th call. The orchestration layer picks op_index . injector_navigate.py breaks on entry_fn and set a gdb ignore-count of op_index-1 to fast-forward to op_index-th one.
Warm up: injector_critical_instruction.py will choose "critical instruction" (those that move key/value bytes with general-purpose or vector registers or set a branch flag) uniformly within the chosen target_fn (within entry_fn) by the orchestration layer. In order to do that, it needs to approximate how many such instructions within target_fn. Hence we have this warm-up phase. It single-steps the first call of target_fn to count and pick the critical instruction index, then corrupt that index at a later call.
Corrupt: on a later call of target_fn, injector_critical_instruction.py single-step to the m-th critical instruction and bit-flip the register through injector_register_corruption.py. The way to corrupt register depends on what instruction it is.
Record: injector_telemetry.py provides telemetry to capture the corruption for later analysis.

Differential Revision: D107999835

…acebook#14852) Summary: Detection layer of the CPU corruption injector (coming up). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/flush/compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one **Test plan:** 1.Startup guard rejects misconfiguration: ``` --threads=2 -> exit 1: "--verify_cpu_corruption_dir requires --threads=1" --read_fault_one_in=5 -> exit 1: "requires all fault injection off" ``` 2.No false positive (clean CORE preset run, no injection): ``` $ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ... exit 0; no data_corruption.<tid>.json produced; "Verification successful" ``` 3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real `<dir>/data_corruption.<tid>.json`: silent data corruption -- write returned OK but the key is gone on read-back: ``` {"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"} ``` detected corruption -- read-back Get returns Corruption via the memtable per-key checksum: ``` {"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed." ``` Differential Revision: D107999834

Summary: Injection layer of the CPU corruption injector (tools/cpu_corruption_injector/injector.py), runs inside gdb and corrupt a register by bit flip in exactly one db_stress op (i.e, write, foreground compaction and flush) per stress test run. Detection is at db_stress (facebook#14852); orchestration is coming up. How one run works - The orchestration layer, coming up, randomly picks which op instance (so corruption lands at different points in the LSM's life) and which target_fn per run (so it has a reasonable number of instructions to step under a reasonable time limit); injector.py picks which instruction within target_fn. - Attach: gdb starts with injector.py's parameters passed via -iex and the db_stress command after --args, so db_stress runs unmodified. Example: ``` gdb --batch --nx \ -iex "py import sys; sys.argv=['injector.py','--op','write','--op_index','42','--entry_fn','rocksdb::MemTable::Add','--target_fn','rocksdb::MemTable::Add','--corruptions_per_op','1','--seed','7','--dir','<rundir>']" \ -x tools/cpu_corruption_injector/injector.py \ --args <db_stress> --threads=1 --verify_cpu_corruption_dir=<rundir> ... ``` - Reach the op: entry_fn is called exactly once per stress test run's op so the op_index-th op is its op_index-th call. The orchestration layer picks op_index . `injector_navigate.py` breaks on entry_fn and set a gdb ignore-count of op_index-1 to fast-forward to op_index-th one. - Warm up: `injector_critical_instruction.py` will choose "critical instruction" (those that move key/value bytes with general-purpose or vector registers or set a branch flag) uniformly within the chosen `target_fn` (within `entry_fn`) by the orchestration layer. In order to do that, it needs to approximate how many such instructions within `target_fn`. Hence we have this warm-up phase. It single-steps the first call of target_fn to count and pick the critical instruction index, then corrupt that index at a later call. - Corrupt: on a later call of target_fn, `injector_critical_instruction.py` single-step to the m-th critical instruction and bit-flip the register through `injector_register_corruption.py`. The way to corrupt register depends on what instruction it is. - Record: `injector_telemetry.py` provides telemetry to capture the corruption for later analysis. Differential Revision: D107999835

meta-codesync · 2026-06-16T06:46:57Z

@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107999835.

github-actions · 2026-06-16T06:50:08Z

✅ clang-tidy: No findings on changed lines

Completed in 98.2s.

Hui Xiao added 2 commits June 15, 2026 23:46

meta-cla Bot added the CLA Signed label Jun 16, 2026

meta-codesync Bot added the meta-exported label Jun 16, 2026

hx235 closed this Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU corruption injector: gdb register flip into one db_stress op#14857

CPU corruption injector: gdb register flip into one db_stress op#14857
hx235 wants to merge 2 commits into
facebook:mainfrom
hx235:export-D107999835

hx235 commented Jun 16, 2026

Uh oh!

meta-codesync Bot commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hx235 commented Jun 16, 2026

Uh oh!

meta-codesync Bot commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

✅ clang-tidy: No findings on changed lines

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant