Skip to content

CPU corruption injector: gdb register flip into one db_stress op#14857

Closed
hx235 wants to merge 2 commits into
facebook:mainfrom
hx235:export-D107999835
Closed

CPU corruption injector: gdb register flip into one db_stress op#14857
hx235 wants to merge 2 commits into
facebook:mainfrom
hx235:export-D107999835

Conversation

@hx235

@hx235 hx235 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary:
Injection layer of the CPU corruption injector (tools/cpu_corruption_injector/injector.py), runs inside gdb and corrupt a register by bit flip in exactly one db_stress op (i.e, write, foreground compaction and flush) per stress test run. Detection is at db_stress (#14852); orchestration is coming up.

How one run works

  • The orchestration layer, coming up, randomly picks which op instance (so corruption lands at different points in the LSM's life) and which target_fn per run (so it has a reasonable number of instructions to step under a reasonable time limit); injector.py picks which instruction within target_fn.
  • Attach: gdb starts with injector.py's parameters passed via -iex and the db_stress command after --args, so db_stress runs unmodified. Example:
gdb --batch --nx \
  -iex "py import sys; sys.argv=['injector.py','--op','write','--op_index','42','--entry_fn','rocksdb::MemTable::Add','--target_fn','rocksdb::MemTable::Add','--corruptions_per_op','1','--seed','7','--dir','<rundir>']" \
  -x tools/cpu_corruption_injector/injector.py \
  --args <db_stress> --threads=1 --verify_cpu_corruption_dir=<rundir>  ...
  • Reach the op: entry_fn is called exactly once per stress test run's op so the op_index-th op is its op_index-th call. The orchestration layer picks op_index . injector_navigate.py breaks on entry_fn and set a gdb ignore-count of op_index-1 to fast-forward to op_index-th one.

  • Warm up: injector_critical_instruction.py will choose "critical instruction" (those that move key/value bytes with general-purpose or vector registers or set a branch flag) uniformly within the chosen target_fn (within entry_fn) by the orchestration layer. In order to do that, it needs to approximate how many such instructions within target_fn. Hence we have this warm-up phase. It single-steps the first call of target_fn to count and pick the critical instruction index, then corrupt that index at a later call.

  • Corrupt: on a later call of target_fn, injector_critical_instruction.py single-step to the m-th critical instruction and bit-flip the register through injector_register_corruption.py. The way to corrupt register depends on what instruction it is.

  • Record: injector_telemetry.py provides telemetry to capture the corruption for later analysis.

Differential Revision: D107999835

Hui Xiao added 2 commits June 15, 2026 23:46
…acebook#14852)

Summary:

Detection layer of the CPU corruption injector (coming up). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/flush/compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one

**Test plan:**
1.Startup guard rejects misconfiguration:
```
--threads=2           -> exit 1: "--verify_cpu_corruption_dir requires --threads=1"
--read_fault_one_in=5 -> exit 1: "requires all fault injection off"
```
2.No false positive (clean CORE preset run, no injection):
```
$ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ...
exit 0; no data_corruption.<tid>.json produced; "Verification successful"
```
3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real `<dir>/data_corruption.<tid>.json`:

silent data corruption -- write returned OK but the key is gone on read-back:
```
{"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"}
```
detected corruption -- read-back Get returns Corruption via the memtable per-key checksum:
```
{"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed."
```

Differential Revision: D107999834
Summary:
Injection layer of the CPU corruption injector (tools/cpu_corruption_injector/injector.py), runs inside gdb and corrupt a register by bit flip in exactly one db_stress op (i.e, write, foreground compaction and flush) per stress test run. Detection is at db_stress (facebook#14852); orchestration is coming up.

How one run works 
- The orchestration layer, coming up, randomly picks which op instance (so corruption lands at different points in the LSM's life) and which target_fn per run (so it has a reasonable number of instructions to step under a reasonable time limit); injector.py picks which instruction within target_fn.
- Attach: gdb starts with injector.py's parameters passed via -iex and the db_stress command after --args, so db_stress runs unmodified. Example:
```
gdb --batch --nx \
  -iex "py import sys; sys.argv=['injector.py','--op','write','--op_index','42','--entry_fn','rocksdb::MemTable::Add','--target_fn','rocksdb::MemTable::Add','--corruptions_per_op','1','--seed','7','--dir','<rundir>']" \
  -x tools/cpu_corruption_injector/injector.py \
  --args <db_stress> --threads=1 --verify_cpu_corruption_dir=<rundir>  ...
```
- Reach the op: entry_fn is called exactly once per stress test run's op so the op_index-th op is its op_index-th call. The orchestration layer picks op_index . `injector_navigate.py` breaks on entry_fn and set a gdb ignore-count of op_index-1 to fast-forward to op_index-th one. 

- Warm up: `injector_critical_instruction.py` will choose "critical instruction" (those that move key/value bytes with general-purpose or vector registers or set a branch flag) uniformly within the chosen `target_fn` (within `entry_fn`) by the orchestration layer. In order to do that, it needs to approximate how many such instructions within `target_fn`. Hence we have this warm-up phase. It single-steps the first call of target_fn to count and pick the critical instruction index, then corrupt that index at a later call. 

- Corrupt: on a later call of target_fn, `injector_critical_instruction.py` single-step to the m-th critical instruction and bit-flip the register through `injector_register_corruption.py`. The way to corrupt register depends on what instruction it is. 

- Record: `injector_telemetry.py` provides telemetry to capture the corruption for later analysis.

Differential Revision: D107999835
@meta-cla meta-cla Bot added the CLA Signed label Jun 16, 2026
@meta-codesync

meta-codesync Bot commented Jun 16, 2026

Copy link
Copy Markdown

@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107999835.

@github-actions

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 98.2s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant