Add db_stress liveness profile and fix lost flush request by xingbowang · Pull Request #14830 · facebook/rocksdb

xingbowang · 2026-06-07T22:20:11Z

Summary

Add a db_crashtest liveness mode that runs db_stress with a watchdog for
foreground progress. The profile disables fault injection and uses longer runs
so failures are more likely to represent real forward-progress bugs instead of
expected injected I/O errors. The watchdog tracks completed operations, active
operation type/start time, and per-operation completion counts to distinguish
stuck writes from long admin operations.

Bug: The new mode exposed a latent flush scheduling bug. A flush request can be queued
with an effective max memtable id, then a newer immutable memtable can be added
before the background flush picks work. PickMemtablesToFlush() could pick only
the older memtables and still clear flush_requested_. A later queued flush for
the newer memtable would then be skipped because the column family no longer
looked flush-pending, leaving the immutable memtable permanently unflushed. In
stress this surfaced as IngestExternalFile() waiting in WaitForFlushMemTables(),
with writes eventually blocked behind the same unflushed memtable.

Fix: Keep flush_requested_ set until all unstarted immutable memtables have been
picked. Add a direct MemTableList regression for the partial-pick case and a
DB-level regression covering the ingest/get-live-files ordering that reproduced
the lost request.

Testing

CI
Local liveness mode 32 full 20-minute runs passed, 0 failures

A flush request can be capped at an older memtable while newer immutable memtables are added. PickMemtablesToFlush() could pick only part of the pending list and clear flush_requested_, causing the newer queued flush request to be skipped and leaving an immutable memtable unflushed. Keep flush_requested_ until all unstarted immutable memtables have been picked, and add unit and DB-level regressions for the lost-request case.

github-actions · 2026-06-07T22:25:28Z

⚠️ clang-tidy: 2 warning(s) on changed lines

Completed in 1640.4s.

Summary by check

Check	Count
`cppcoreguidelines-pro-type-member-init`	1
`cppcoreguidelines-special-member-functions`	1
Total	2

Details

db/blob/blob_gen2_format.cc (1 warning(s))

db/blob/blob_gen2_format.cc:90:3: warning: uninitialized record type: 'trailer' [cppcoreguidelines-pro-type-member-init]

db_stress_tool/db_stress_test_base.cc (1 warning(s))

db_stress_tool/db_stress_test_base.cc:61:7: warning: class 'ScopedThreadOperation' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator [cppcoreguidelines-special-member-functions]

github-actions · 2026-06-07T23:03:52Z

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 00526b6

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-06-07T23:19:57Z

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 00526b6

Summary

This PR adds an opt-in liveness watchdog to db_stress and fixes a real bug where flush_requested_ was prematurely cleared after partial memtable picks, causing lost flush requests. The flush fix is correct and well-tested. The watchdog is a useful stress-test diagnostic addition with a sound design. Two medium-severity issues were identified around torn-read false positives in the watchdog and potential flakiness in one integration test.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

No high-severity findings.

🟡 MEDIUM

M1. Torn read in watchdog could produce false timeout on `started_micros == 0` -- `db_stress_common.cc:HasStuckWriteOperation`

Issue: GetThreadOperationSnapshot reads active_type (acquire) then started_micros (relaxed) as two separate atomic loads. A concurrent ClearOperation stores active_type = kNone (release) then started_micros = 0 (relaxed). If the watchdog reads a stale active_type (e.g., kWrite) but the NEW started_micros = 0, then ElapsedMicros(now, 0) produces a value equal to now (potentially billions of microseconds), which would exceed any timeout.
Root cause: The code in HasStuckWriteOperation checks snapshot.started_micros != 0 before comparing elapsed time, which guards against the started_micros == 0 case. Similarly, HasTimedOutNonWriteOperation checks snapshot.started_micros == 0 via the continue guard. The torn read is safe for timeout detection due to these guards.
Suggested fix: No code change needed. Consider adding a brief comment explaining why the != 0 check is the critical safety guard against torn reads.

M2. Integration test uses sleep-based polling -- `external_sst_file_test.cc:IngestFlushRequestNotLostBehindQueuedFlush`

Issue: The test uses env_->SleepForMicroseconds in busy-wait loops. CLAUDE.md states: "Don't use sleep to wait for certain events to happen."
Root cause: SyncPoints are already used for the critical coordination. The sleeps are polling between syncpoint-signaled states with a 5-second timeout + FAIL() safety net.
Suggested fix: Borderline acceptable but could be made more robust with additional SyncPoint dependencies instead of sleep polling.

M3. `abort_and_resume_compactions_one_in` default more aggressive -- `db_crashtest.py`

Issue: Changed from [10000, 1000000] to [0, 1000, 10000]. Value 1000 is quite aggressive.
Root cause: New gating logic (TryBeginAbortAndResumeCompactions, ShouldAbortAndResumeCompactions checking write stall) prevents starvation.
Suggested fix: Monitor CI for write stall regressions. No code change needed.

🟢 LOW / NIT

L1. `TerminateWithoutMutex()` -- acceptable for stress test diagnostic tool

L2. 3 atomic ops per operation in `ScopedThreadOperation` destructor -- negligible for stress test

L3. `LivenessTrackingEnabled()` reads FLAGS globals per call -- could cache but negligible

L4. False sharing in `ThreadOperationState` array -- negligible for typical thread counts

Cross-Component Analysis

The flush_requested_ persistence change was initially flagged as HIGH severity by multiple review agents due to concerns about infinite scheduling loops and stuck state. After tracing the full call chain through EnqueueFlushRequest (guarded by queued_for_flush()), BackgroundFlush (filters by IsFlushPending()), and PickMemtablesToFlush (guarded by flush_in_progress_), the concerns were disproven. The change is correct and the simplified condition num_flush_not_started_ == 0 is the natural termination criterion.

The watchdog's TryLock-based shutdown is safe: either progress resumes and the lock is acquired, or the watchdog terminates the process via TerminateWithoutMutex().

Positive Observations

flush_requested_ fix is elegant -- removes tracking variable while fixing a real concurrency bug
ScopedThreadOperation RAII is well-designed with correct FinishAction semantics
Abort/resume compaction gating prevents starvation with CAS + success counting
Comprehensive Python test coverage for the new liveness mode
Good regression tests (FlushRequestPersistsAfterPartialPick, IngestFlushRequestNotLostBehindQueuedFlush)

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

meta-codesync · 2026-06-08T16:14:59Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D107895052.

github-actions · 2026-06-08T18:17:31Z

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit e4d90c4

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-06-08T18:29:58Z

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit e4d90c4

Summary

This PR fixes a real and well-diagnosed flush scheduling bug where flush_requested_ was prematurely cleared on the non-atomic-flush path, and adds a liveness watchdog to db_stress. The flush fix is correct and well-tested. The liveness watchdog is a substantial but well-structured addition to the stress test infrastructure. No high-severity issues found.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

M1. Integration test uses sleep-based polling -- `db/external_sst_file_test.cc:1830-1857`

Issue: IngestFlushRequestNotLostBehindQueuedFlush uses env_->SleepForMicroseconds(1000) in polling loops and a 500-iteration * 10ms wait loop. CLAUDE.md guidelines say "Don't use sleep to wait for certain events to happen. This will cause test to be flaky. Instead, use sync point to synchronize thread progress."
Root cause: The test needs to wait for ingest to reach its flush-wait point, then wait for ingest to complete after unblocking. The first wait uses a sync point (FlushMemTable:BeforeWaitForBgFlush), which is good. The second wait (lines 1849-1855) polls ingest_done with sleep -- this is the flaky pattern.
Suggested fix: Use a condition variable or sync point callback to signal ingest_done rather than polling with sleep. If polling is truly needed, the current 5-second total timeout (500 * 10ms) seems reasonable for CI but could be tightened with a sync point.

M2. No unit tests for PushOperation/PopOperation/GetThreadOperationSnapshot -- `db_stress_tool/db_stress_shared_state.h`

Issue: The lock-free per-thread operation stack with atomic depth/frame tracking has no dedicated unit tests. The only coverage is through liveness mode integration in the Python harness.
Root cause: This is test infrastructure code, so it gets less test attention, but the atomic memory ordering is non-trivial.
Suggested fix: Consider adding a unit test that exercises concurrent push/pop/snapshot from multiple threads to validate the memory ordering guarantees, especially the torn-read scenario documented in the GetThreadOperationSnapshot comment.

M3. No unit tests for abort/resume compaction serialization -- `db_stress_tool/db_stress_test_base.cc:3588-3666`

Issue: The new TryBeginAbortAndResumeCompactions, ShouldAbortAndResumeCompactions, and MarkAbortAndResumeCompactions logic adds non-trivial gating with compaction count tracking and write stall avoidance, but has no dedicated test coverage.
Root cause: This is stress test infrastructure.
Suggested fix: At minimum, add a comment explaining the invariants. Ideally add a simple unit test for the CAS serialization and the backoff logic.

M4. `IncFinishedOps` contention on multi-threaded workloads -- `db_stress_tool/db_stress_shared_state.h`

Issue: Every operation completion increments a single std::atomic<uint64_t> finished_ops_ with fetch_add. In a 32+ thread stress test, this is a contended cache line. While relaxed ordering helps, this is still a potential scalability bottleneck.
Root cause: Simple counter design for diagnostic purposes.
Suggested fix: This is acceptable for a test tool. If contention becomes visible in profiling, consider per-thread counters aggregated lazily. No action needed now.

🟢 LOW / NIT

L1. `ScopedThreadOperation` always calls `CompletedOpForDiagnostics` in destructor -- `db_stress_tool/db_stress_test_base.cc:67`

Issue: The destructor calls CompletedOpForDiagnostics(type_) for every scope exit, including the kPrepare type which wraps the entire iteration. The IncCompletedOpForDiagnostics function explicitly skips kPrepare (line 307 in shared_state.h), so this is harmless, but the unconditional call is slightly misleading.
Suggested fix: NIT -- no action needed, the filter is correct.

L2. `StressOperationType::kSnapshot` used for both acquire and release -- `db_stress_tool/db_stress_test_base.cc:1660,1665`

Issue: Both TestAcquireSnapshot and MaybeReleaseSnapshots use StressOperationType::kSnapshot. This means the watchdog can't distinguish between a stuck acquire and a stuck release. Minor diagnostic clarity issue.
Suggested fix: Consider splitting into kAcquireSnapshot and kReleaseSnapshot if finer diagnostics are needed.

L3. Updated test assertion in `FlushPendingTest` is correct but comment is stale -- `db/memtable_list_test.cc:889-894`

Issue: Lines 893-894 changed from ASSERT_FALSE to ASSERT_TRUE, correctly reflecting the new behavior. However, the preceding comment at line 885 ("Pick tables to flush... no table will be selected in this case") doesn't mention that flush_requested_ now persists.
Suggested fix: Add a brief comment explaining why IsFlushPending() and HasFlushRequested() remain true after the empty pick.

L4. Python `execute_cmd` signature change adds `expected_to_timeout=True` default -- `tools/db_crashtest.py:1979`

Issue: The new parameter expected_to_timeout=True preserves backwards compatibility for existing callers (blackbox/whitebox), which pass only 2-3 args. The liveness caller passes False. This is clean.
Suggested fix: None needed.

L5. `TerminateWithoutMutex()` should document when it's safe to use -- `db_stress_tool/db_stress_shared_state.h:631-635`

Issue: The method comments explain WHY it exists but not the safety contract. Since std::terminate() is thread-safe by spec, this is fine, but a note that it's safe to call from any thread context would help future readers.
Suggested fix: NIT -- add a brief note about thread safety.

Cross-Component Analysis

Flush Fix: Alternative Execution Contexts

Context	Behavior Changed?	Analysis
Atomic flush (`atomic_flush_seqno_ != kMaxSequenceNumber`)	No	Old: `if (false \|\| num_flush_not_started_ == 0)`. New: same condition. Identical.
Non-atomic flush	Yes (fix)	Old: always cleared `flush_requested_`. New: only clears when `num_flush_not_started_ == 0`. This is the bug fix.
MemPurge	Compatible	`BackgroundFlush:3730` calls `FlushRequested()` for mempurge CFs, which re-sets the flag. The fix doesn't interfere.
RollbackMemtableFlush	Compatible	Rollback increments `num_flush_not_started_`. With fix, `flush_requested_` may already be true, ensuring re-scheduling.
WritePreparedTxnDB	No interaction	`flush_requested_` is a scheduling flag, not a visibility control.
ReadOnly DB	No interaction	No flushes scheduled.
User-defined timestamps	Compatible	UDT reschedule logic (line 3692) checks separately.

Busy-Loop Analysis

Claim: "flush_requested_ staying true won't cause busy-looping"

Verified correct:

PickMemtablesToFlush only picks memtables with !flush_in_progress_. Each pick decrements num_flush_not_started_.
When num_flush_not_started_ == 0, flush_requested_ is cleared and imm_flush_needed is set to false.
EnqueuePendingFlush for non-atomic flush checks !cfd->queued_for_flush(), preventing duplicate enqueue while a flush is in-flight.
After flush completes, PopFirstFromFlushQueue clears queued_for_flush. Then MaybeScheduleFlushOrCompaction processes the queue.
The remaining unstarted memtable will be picked up when a new flush is enqueued (e.g., by the write path or another explicit flush request that sees IsFlushPending() == true).

No infinite loop is possible because each flush iteration makes progress (picks at least one memtable or finds nothing to pick when all are in-progress/done).

Re-scheduling Path for Remaining Memtable

After the first partial flush completes:

queued_for_flush is cleared for the CF (line 3466)
flush_requested_ is still true (the fix)
Any subsequent call to EnqueuePendingFlush for this CF will succeed because !queued_for_flush() and IsFlushPending() both return appropriate values
The remaining memtable gets flushed when the next flush trigger fires (write path switch, explicit flush, or another GetLiveFilesStorageInfo call)

Positive Observations

Well-diagnosed bug: The PR clearly explains the race condition with a concrete reproduction scenario (IngestExternalFile + GetLiveFilesStorageInfo).
Minimal flush fix: The fix is a 3-line deletion + 5-line comment change. The minimal diff reduces risk.
Both unit and integration tests: The flush fix has both a direct MemTableList unit test (FlushRequestPersistsAfterPartialPick) and a DB-level integration test (IngestFlushRequestNotLostBehindQueuedFlush).
Liveness watchdog is well-gated: Zero overhead when disabled (default). The LivenessTrackingEnabled() check prevents any performance impact on normal stress test runs.
RAII operation tracking: ScopedThreadOperation ensures operations are always popped, preventing stuck-operation false positives from code that returns early.
Abort/resume compaction hardening: The new gating logic (serialization via CAS, compaction count tracking, write stall avoidance) addresses real issues where aggressive random abort/resume could starve compactions and cause secondary write stalls.
Comprehensive Python test coverage: The db_crashtest_test.py additions thoroughly test parameter generation, command flag passing, timeout handling, and mode composability.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

…est_liveness_bug # Conflicts: # db_stress_tool/db_stress_test_base.cc # tools/db_crashtest.py # tools/db_crashtest_test.py

meta-codesync · 2026-06-24T17:05:22Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D107895052.

meta-codesync · 2026-06-26T12:36:39Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D107895052.

github-actions · 2026-06-26T16:42:56Z

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 5b27079

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-06-26T16:48:39Z

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 5b27079

Summary

This PR has two parts: (1) a correctness fix for a lost flush request bug in MemTableList::PickMemtablesToFlush, and (2) a new liveness watchdog for db_stress. The flush fix is minimal and correct -- it removes a flawed condition that prematurely cleared flush_requested_ during partial picks, which could leave immutable memtables permanently unflushed. The liveness watchdog is well-structured with proper atomic ordering and a clean RAII-based operation tracking design. The TestAbortAndResumeCompactions hardening (compaction-progress gating, sleep cap, write-stall awareness) is a thoughtful improvement.

High-severity findings (0):
No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

M1. `ScopedThreadOperation` is not move/copy-safe -- `db_stress_test_base.cc`

Issue: ScopedThreadOperation stores a raw pointer to ThreadState and a pushed_ flag, but has no deleted copy/move constructors. If accidentally copied or moved (e.g., stored in a container, returned from a function, or captured by value in a lambda), the destructor could double-pop or pop from a destroyed object.
Root cause: Missing = delete for copy/move constructors and assignment operators.
Suggested fix: Add ScopedThreadOperation(const ScopedThreadOperation&) = delete; ScopedThreadOperation& operator=(const ScopedThreadOperation&) = delete; to the class.

M2. `FlushPendingTest` changed assertions may mask future regressions -- `db/memtable_list_test.cc:893-894`

Issue: The existing FlushPendingTest changed two assertions from ASSERT_FALSE to ASSERT_TRUE for IsFlushPending() and HasFlushRequested() after an empty pick (max_memtable_id=4, but only memtable with ID=5 exists). This is correct for the new behavior, but the test comment at line 885 ("Pick tables to flush. The tables to pick must have ID smaller than or equal to 4. Therefore, no table will be selected in this case.") doesn't explain WHY flush_requested_ should remain set.
Root cause: Comment doesn't reflect the new semantic.
Suggested fix: Add a brief comment explaining that flush_requested_ persists because memtable ID=5 is still unstarted.

🟢 LOW / NIT

L1. `abort_and_resume_compactions_one_in` default changed for all modes -- `tools/db_crashtest.py:254`

Issue: The default for abort_and_resume_compactions_one_in was changed from [10000, 1000000] to [0, 1000, 10000] in default_params, which affects ALL test modes (blackbox, whitebox), not just liveness. The value 0 means 33% chance of disabling abort/resume entirely, and 1000 makes it much more aggressive. This is a deliberate change per the commit message ("C++ abort/resume path is additionally gated on completed compaction progress"), but it's a significant behavioral change for existing test modes.
Suggested fix: Verify this is intentional for all modes, not just liveness. If only liveness needs the more aggressive setting, move it to liveness_default_params.

L2. `TerminateWithoutMutex` bypasses stack-trace serialization -- `db_stress_shared_state.h:638`

Issue: SafeTerminate holds the mutex to prevent interleaved stack traces from multiple threads calling std::terminate simultaneously. TerminateWithoutMutex bypasses this, which could produce garbled output. The comment explains the tradeoff (watchdog must terminate even when the mutex holder is stuck), which is reasonable.
Suggested fix: No change needed -- the comment documents the tradeoff. Consider adding a brief note that output may be garbled.

L3. `kSnapshot` operation type used for both acquire and release -- `db_stress_test_base.cc:1811,1817`

Issue: Both TestAcquireSnapshot and MaybeReleaseSnapshots are wrapped with StressOperationType::kSnapshot. This means the watchdog cannot distinguish between a stuck snapshot acquisition and a stuck release. For diagnostics this is minor but could be improved.
Suggested fix: Consider separate operation types, or leave as-is since both are fast operations.

L4. `liveness_watchdog_thread` local variable lifetime -- `db_stress_driver.cc:141-143`

Issue: liveness_watchdog_thread is a local ThreadState on the stack. The thread is started with raw_env->StartThread(LivenessWatchdogThread, &liveness_watchdog_thread). The local variable must remain alive until the thread completes. This is safe because the bg-thread wait loop at the end of RunStressTestImpl ensures all bg threads finish before the function returns, but it follows the same pattern as continuous_verification_thread and bg_thread.
Suggested fix: No change needed -- follows existing pattern.

L5. Missing `ScopedThreadOperation` for `TestAbortAndResumeCompactions` at call site -- `db_stress_test_base.cc:1637-1639`

Issue: Unlike other admin operations that get a ScopedThreadOperation at the call site, TestAbortAndResumeCompactions creates the ScopedThreadOperation inside the function body (after gating checks). This means if the gating check returns early, no operation is tracked. This is intentional (the gated-out path is a no-op), but inconsistent with other admin ops.
Suggested fix: No change needed -- the internal placement is actually better since it avoids tracking no-op calls.

Cross-Component Analysis

Context	Does code execute?	Assumptions hold?	Action needed?
Atomic flush	YES	YES -- the fix makes `flush_requested_` clearing stricter, which is safe for atomic flush	None
Non-atomic flush	YES	YES -- this was the buggy path; now fixed	None
MemPurge	YES	YES -- mempurge adds memtables with older IDs; `num_flush_not_started_` tracking is independent of ID ordering	None
WritePreparedTxnDB	N/A (flush scheduling is orthogonal)	N/A	None
ReadOnly DB	N/A (no flush)	N/A	None

Flush fix assumption stress test:

Claim: "flush_requested_ should persist until all unstarted memtables are picked"
Precondition 1: num_flush_not_started_ accurately tracks unstarted memtables. Verified: incremented in Add(), decremented in PickMemtablesToFlush() and RollbackMemtableFlush(), with assertions guarding consistency.
Precondition 2: PickMemtablesToFlush will eventually be called with a high enough max_memtable_id. Verified: IsFlushPending() returns true while flush_requested_ && num_flush_not_started_ > 0, causing SchedulePendingFlush to enqueue work, which calls FlushJob::PickMemTable().
No counterexample found where flush_requested_ would be stuck indefinitely -- the feedback loop through IsFlushPending() -> scheduling -> picking ensures progress.

Positive Observations

The flush fix is minimal and surgical -- only the condition for clearing flush_requested_ changed, with clear comments explaining the rationale.
The new FlushRequestPersistsAfterPartialPick unit test directly covers the bug scenario with clean setup and assertions.
The IngestFlushRequestNotLostBehindQueuedFlush integration test is thorough with proper timeout guards and cleanup paths to avoid hanging CI.
The liveness watchdog uses a stack-based operation tracking design that handles nested operations gracefully.
The TryBeginAbortAndResumeCompactions gating pattern (CAS + compaction progress check + write stall awareness) is a well-designed solution to prevent compaction starvation from aggressive abort/resume cycling.
The Python test infrastructure changes are well-tested with dedicated unit tests for parameter layering, timeout handling, and fault injection override behavior.
Memory ordering on atomics is appropriate throughout -- relaxed for counters, release/acquire for operation stack depth and type.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

meta-codesync · 2026-06-26T17:37:32Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D107895052.

xingbowang added 2 commits June 6, 2026 06:18

Add db_stress liveness watchdog

29b4921

meta-cla Bot added the CLA Signed label Jun 7, 2026

xingbowang marked this pull request as draft June 7, 2026 22:20

xingbowang changed the title ~~Add db_stress liveness watchdog and fix lost flush request~~ Add db_stress liveness profile and fix lost flush request Jun 8, 2026

refine scope tracking

e4d90c4

xingbowang marked this pull request as ready for review June 8, 2026 16:14

Merge remote-tracking branch 'upstream/main' into 2026_05_28_stress_t…

c948595

…est_liveness_bug # Conflicts: # db_stress_tool/db_stress_test_base.cc # tools/db_crashtest.py # tools/db_crashtest_test.py

address feedback

1937126

fix unit test status check

5b27079

Uh oh!

Conversation

xingbowang commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

github-actions Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ clang-tidy: 2 warning(s) on changed lines

Summary by check

Details

Uh oh!

github-actions Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Code Review - OBSOLETE

🟡 Codex Code Review

Uh oh!

github-actions Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review - OBSOLETE

✅ Claude Code Review

Summary

Findings

🔴 HIGH

🟡 MEDIUM

M1. Torn read in watchdog could produce false timeout on started_micros == 0 -- db_stress_common.cc:HasStuckWriteOperation

M2. Integration test uses sleep-based polling -- external_sst_file_test.cc:IngestFlushRequestNotLostBehindQueuedFlush

M3. abort_and_resume_compactions_one_in default more aggressive -- db_crashtest.py

🟢 LOW / NIT

L1. TerminateWithoutMutex() -- acceptable for stress test diagnostic tool

L2. 3 atomic ops per operation in ScopedThreadOperation destructor -- negligible for stress test

L3. LivenessTrackingEnabled() reads FLAGS globals per call -- could cache but negligible

L4. False sharing in ThreadOperationState array -- negligible for typical thread counts

Cross-Component Analysis

Positive Observations

Uh oh!

meta-codesync Bot commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Code Review - OBSOLETE

🟡 Codex Code Review

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review - OBSOLETE

✅ Claude Code Review

Summary

Findings

🔴 HIGH

🟡 MEDIUM

M1. Integration test uses sleep-based polling -- db/external_sst_file_test.cc:1830-1857

M2. No unit tests for PushOperation/PopOperation/GetThreadOperationSnapshot -- db_stress_tool/db_stress_shared_state.h

M3. No unit tests for abort/resume compaction serialization -- db_stress_tool/db_stress_test_base.cc:3588-3666

M4. IncFinishedOps contention on multi-threaded workloads -- db_stress_tool/db_stress_shared_state.h

🟢 LOW / NIT

L1. ScopedThreadOperation always calls CompletedOpForDiagnostics in destructor -- db_stress_tool/db_stress_test_base.cc:67

L2. StressOperationType::kSnapshot used for both acquire and release -- db_stress_tool/db_stress_test_base.cc:1660,1665

L3. Updated test assertion in FlushPendingTest is correct but comment is stale -- db/memtable_list_test.cc:889-894

L4. Python execute_cmd signature change adds expected_to_timeout=True default -- tools/db_crashtest.py:1979

L5. TerminateWithoutMutex() should document when it's safe to use -- db_stress_tool/db_stress_shared_state.h:631-635

Cross-Component Analysis

Flush Fix: Alternative Execution Contexts

Busy-Loop Analysis

Re-scheduling Path for Remaining Memtable

Positive Observations

Uh oh!

meta-codesync Bot commented Jun 24, 2026

Uh oh!

meta-codesync Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

🟡 Codex Code Review

Uh oh!

github-actions Bot commented Jun 26, 2026

✅ Claude Code Review

Summary

Findings

🔴 HIGH

xingbowang commented Jun 7, 2026 •

edited

Loading

github-actions Bot commented Jun 7, 2026 •

edited

Loading

github-actions Bot commented Jun 7, 2026 •

edited

Loading

github-actions Bot commented Jun 7, 2026 •

edited

Loading

M1. Torn read in watchdog could produce false timeout on `started_micros == 0` -- `db_stress_common.cc:HasStuckWriteOperation`

M2. Integration test uses sleep-based polling -- `external_sst_file_test.cc:IngestFlushRequestNotLostBehindQueuedFlush`

M3. `abort_and_resume_compactions_one_in` default more aggressive -- `db_crashtest.py`

L1. `TerminateWithoutMutex()` -- acceptable for stress test diagnostic tool

L2. 3 atomic ops per operation in `ScopedThreadOperation` destructor -- negligible for stress test

L3. `LivenessTrackingEnabled()` reads FLAGS globals per call -- could cache but negligible

L4. False sharing in `ThreadOperationState` array -- negligible for typical thread counts

github-actions Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

M1. Integration test uses sleep-based polling -- `db/external_sst_file_test.cc:1830-1857`

M2. No unit tests for PushOperation/PopOperation/GetThreadOperationSnapshot -- `db_stress_tool/db_stress_shared_state.h`

M3. No unit tests for abort/resume compaction serialization -- `db_stress_tool/db_stress_test_base.cc:3588-3666`

M4. `IncFinishedOps` contention on multi-threaded workloads -- `db_stress_tool/db_stress_shared_state.h`

L1. `ScopedThreadOperation` always calls `CompletedOpForDiagnostics` in destructor -- `db_stress_tool/db_stress_test_base.cc:67`

L2. `StressOperationType::kSnapshot` used for both acquire and release -- `db_stress_tool/db_stress_test_base.cc:1660,1665`

L3. Updated test assertion in `FlushPendingTest` is correct but comment is stale -- `db/memtable_list_test.cc:889-894`

L4. Python `execute_cmd` signature change adds `expected_to_timeout=True` default -- `tools/db_crashtest.py:1979`

L5. `TerminateWithoutMutex()` should document when it's safe to use -- `db_stress_tool/db_stress_shared_state.h:631-635`

M1. `ScopedThreadOperation` is not move/copy-safe -- `db_stress_test_base.cc`

M2. `FlushPendingTest` changed assertions may mask future regressions -- `db/memtable_list_test.cc:893-894`

L1. `abort_and_resume_compactions_one_in` default changed for all modes -- `tools/db_crashtest.py:254`

L2. `TerminateWithoutMutex` bypasses stack-trace serialization -- `db_stress_shared_state.h:638`

L3. `kSnapshot` operation type used for both acquire and release -- `db_stress_test_base.cc:1811,1817`

L4. `liveness_watchdog_thread` local variable lifetime -- `db_stress_driver.cc:141-143`

L5. Missing `ScopedThreadOperation` for `TestAbortAndResumeCompactions` at call site -- `db_stress_test_base.cc:1637-1639`