Skip to content

Fix crash test Restore() failure with empty trace when replay_write_ops == 0 (#14840)#14840

Open
anand1976 wants to merge 1 commit into
facebook:mainfrom
anand1976:export-D107968797
Open

Fix crash test Restore() failure with empty trace when replay_write_ops == 0 (#14840)#14840
anand1976 wants to merge 1 commit into
facebook:mainfrom
anand1976:export-D107968797

Conversation

@anand1976

@anand1976 anand1976 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary:
ExpectedState::Restore() always attempted to open and replay the post-saved_seqno_ trace. In the crash-test failure, recovery had already returned the DB exactly to saved_seqno_, so replay_write_ops == 0 and the trace was legitimately empty. FileTraceReader reported Status::Incomplete() on that empty trace, which incorrectly failed restore.

Skip the trace replay work when replay_write_ops == 0, but still run the normal state promotion, cleanup, and saved_seqno_ reset path so later saves continue to work.

Differential Revision: D107968797

@meta-cla meta-cla Bot added the CLA Signed label Jun 9, 2026
@meta-codesync

meta-codesync Bot commented Jun 9, 2026

Copy link
Copy Markdown

@anand1976 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107968797.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 44.0s.

@meta-codesync meta-codesync Bot changed the title Fix crash test Restore() failure with empty trace when replay_write_ops == 0 Fix crash test Restore() failure with empty trace when replay_write_ops == 0 (#14840) Jun 9, 2026
anand1976 added a commit to anand1976/rocksdb that referenced this pull request Jun 9, 2026
…ps == 0 (facebook#14840)

Summary: Pull Request resolved: facebook#14840

Differential Revision: D107968797
@anand1976 anand1976 force-pushed the export-D107968797 branch from a4e4ab3 to 88a1016 Compare June 9, 2026 17:15
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 88a1016


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

…ps == 0 (facebook#14840)

Summary:
`ExpectedState::Restore()` always attempted to open and replay the post-`saved_seqno_` trace. In the crash-test failure, recovery had already returned the DB exactly to `saved_seqno_`, so `replay_write_ops == 0` and the trace was legitimately empty. `FileTraceReader` reported `Status::Incomplete()` on that empty trace, which incorrectly failed restore.

Skip the trace replay work when `replay_write_ops == 0`, but still run the normal state promotion, cleanup, and `saved_seqno_` reset path so later saves continue to work.


Differential Revision: D107968797
@anand1976 anand1976 force-pushed the export-D107968797 branch from 88a1016 to b00624a Compare June 9, 2026 18:54
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit b00624a


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit b00624a


Summary

Clean, well-scoped fix for a legitimate crash test failure. The change correctly skips trace replay when replay_write_ops == 0 (DB recovered exactly to saved_seqno_), while preserving all state promotion, cleanup, and saved_seqno_ reset logic. No correctness issues found.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Scattered guards could be simplified with early block skip -- expected_state.cc:1114
  • Issue: The fix adds replay_write_ops > 0 guards at 5 separate locations throughout the replay section. This works correctly but increases cognitive load and maintenance burden. A single if (replay_write_ops > 0) { ... } block wrapping the entire replay section (handler creation, replayer setup, replay loop, completion check) would be cleaner while preserving the same state-promotion/cleanup flow below.
  • Root cause: The replay logic and state-promotion logic are interleaved in a single scope rather than being separated into distinct blocks.
  • Suggested fix: Wrap lines 1147-1193 in a single if (replay_write_ops > 0) { ... } block instead of adding the condition to each individual if statement. The trace_reader conditional on line 1116 is already naturally separate and can remain as-is.
M2. No dedicated test coverage for replay_write_ops == 0 path -- expected_state.cc:1081
  • Issue: There are no unit tests for FileExpectedStateManager::Restore(), and the replay_write_ops == 0 edge case is only exercised by db_stress under specific crash-recovery timing. The new code path introduces null handler/replayer/trace_reader pointers that are handled correctly via short-circuit evaluation and null checks, but should have explicit test coverage to prevent regressions.
  • Root cause: FileExpectedStateManager is test infrastructure that itself lacks unit tests.
  • Suggested fix: Consider adding a test that sets up a saved state at seqno N, opens a DB at seqno N, and verifies Restore() succeeds without trace replay. This is a suggestion for follow-up, not a merge blocker for a bug fix.

🟢 LOW / NIT

L1. Comment could be more precise -- expected_state.cc:1116
  • Issue: The comment says "skip opening/replaying the trace and reuse the normal state promotion and cleanup flow below." This is accurate but could note why it's safe: because replay_write_ops == 0 means no writes survived recovery past saved_seqno_, so the saved state file already matches the DB.
  • Suggested fix: Optional: extend comment to say "...the saved state already matches the DB, so no replay is needed."

Cross-Component Analysis

Context Does code execute? Assumptions hold? Action needed?
WritePreparedTxnDB YES (db_stress supports --use_txn) YES - recovered prepared txns are committed/rolled back in FinishInitDb() before SaveAtAndAfter() is called Safe
WriteUnpreparedTxnDB YES YES - same resolution path as WritePrepared Safe
ReadOnly DB NO - db_stress doesn't use ReadOnly for Restore N/A N/A
Multiple column families YES YES - sequence numbers are global Safe
User-defined timestamps YES (if configured) YES - timestamps don't affect seqno arithmetic Safe
Sequence number underflow Guarded by existing check at line 1077-1078 YES Pre-existing protection, not affected by this PR

Assumption stress-test results:

Claim Preconditions Counterexample attempted Result
"replay_write_ops == 0 means no writes survived" SaveAtAndAfter() called with DB quiesced (API contract) Race condition during SaveAtAndAfter INVALID - API contract requires no concurrent mutation
"Trace file is legitimately empty" Writes traced after save are lost if DB recovers to saved_seqno_ WAL replay bringing DB to saved_seqno_ with traced-but-lost writes SAFE - lost writes shouldn't be replayed against a DB that doesn't have them
"2PC transactions create false replay_write_ops == 0" Prepared txns resolved before SaveAtAndAfter Prepared txn auto-committed during recovery INVALID - db_stress resolves all prepared txns in FinishInitDb() before SaveAtAndAfter()
"Skipping Prepare() loses validation" Trace file not used when replay_write_ops == 0 Corrupted trace file IRRELEVANT - trace is deleted during cleanup, validation serves no purpose

Positive Observations

  • The fix correctly preserves the entire state-promotion and cleanup flow (CopyFile, RenameFile, Open, DeleteFile, saved_seqno_ reset), which is the most important invariant.
  • Short-circuit evaluation in s.ok() && replay_write_ops > 0 && !handler->IsDone() is idiomatic C++ and prevents null dereference.
  • The existing if (handler) null check at line 1195 naturally handles the no-replay case for statistics extraction.
  • The fix eliminates unnecessary file I/O when no replay is needed.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant