CORE-15146: stop re-scrubbing completed partitions in wait_for_internal_scrub#30633
CORE-15146: stop re-scrubbing completed partitions in wait_for_internal_scrub#30633travisdowns wants to merge 2 commits into
Conversation
|
/ci-repeat 1 release |
CI test resultstest results on build#85073
|
|
/cdt |
There was a problem hiding this comment.
Pull request overview
This PR addresses a test timeout caused by internal scrubbing repeatedly re-enqueuing partitions that already completed a full scrub, which could starve “tail” partitions of scrub quota. It also improves cloud-storage scrub diagnostics by adding more detailed debug/trace logging in the anomalies detector.
Changes:
- Increase
cloud_storage_full_scrub_interval_msinwait_for_internal_scrubso fully scrubbed partitions don’t re-enter the rotation during the helper’s wait window. - Add additional debug/trace logging in
cloud_storage::anomalies_detectorto capture quota inputs, progress, and exit reasons.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/rptest/services/redpanda.py | Adjusts internal-scrub test helper configuration to avoid re-scrubbing completed partitions during the wait period. |
| src/v/cloud_storage/anomalies_detector.cc | Adds instrumentation logs to help diagnose scrub quota consumption and stopping conditions. |
| # CORE-15146 experiment: set full-scrub interval well beyond | ||
| # the helper's wait window so a partition that completes a | ||
| # full scrub stays "done" and doesn't get re-enqueued every | ||
| # housekeeping cycle, starving the laggards of quota. | ||
| "cloud_storage_full_scrub_interval_ms": 86_400_000, |
There was a problem hiding this comment.
Reworded to drop the "experiment" framing and describe it as effectively disabling periodic re-scrubs for the helper's lifetime. Kept the value but expressed it as 24 * 60 * 60 * 1000 so it reads as "1 day" at a glance, with a note in the comment that 24h is well past any plausible test runtime so the value is effectively infinite here.
| vlog( | ||
| _logger.debug, | ||
| "Failed downloading partition manifest, exiting scrub: ops={} " | ||
| "segs={}", | ||
| _result.ops, | ||
| _result.segments_visited); |
There was a problem hiding this comment.
Good catch - added the error_outcome to the failed-manifest log line via dl_result.error(). cloud_storage::error_outcome already has an fmt::formatter (defined in src/v/cloud_storage/types.h), so it logs cleanly.
…crub
CORE-15146. The helper set cloud_storage_full_scrub_interval_ms to 10s,
which is shorter than the housekeeping workflow's wall-clock cycle for
clusters with many partitions. As a result every partition that completed
a full scrub became eligible for re-scrubbing before the next cycle
started, the workflow re-enqueued the full partition set each cycle, the
per-cycle ops quota (5000) ran out partway through, and the 2-3
partitions at the tail received quota=0. Those tail partitions advanced
only ~1 segment per cycle and never finished within the helper's 180s
timeout.
Setting the full-scrub interval to 24h keeps completed partitions out of
the queue for the remainder of the wait, so the outstanding set
monotonically shrinks. The reset_scrubbing_metadata path bypasses the
configured interval via the first_scrub branch in
scrubber_scheduler::pick_next_scrub_time, so the initial scrub still
fires within jitter (100ms) after reset, regardless of the configured
base interval.
Validated on a 3-broker AWS CDT cluster running
streaming_cache_test@{limit_mode=objects,log_segment_size=1048576}: same
parametrization fails reliably without the change (180s scrub timeout)
and passes in 11m27s with it.
Augment existing _logger.debug lines in anomalies_detector::run and
check_manifest, and add new _logger.debug + _logger.trace lines on
previously-silent paths, to capture:
- per-call quota inputs (max_num_operations, max_num_segments,
scrub_from)
- manifest range, segment count, number of spillovers
- per-call segment visit count and final last_scrubbed_offset
- which should_stop branch fired (abort_requested / ops_quota /
segs_quota)
Used to diagnose CORE-15146 (laggard partitions getting quota=0 at the
tail of housekeeping workflow runs). Mostly augmentation of existing
log lines; the new lines are at debug or trace, so they don't
materially change the default log volume.
4f510d2 to
162784c
Compare
CORE-15146. The
wait_for_internal_scrubhelper setcloud_storage_full_scrub_interval_msto 10 seconds, shorter than onehousekeeping workflow cycle's wall-clock duration on clusters with many
partitions. As a result every partition that completed a full scrub became
eligible for re-scrubbing before the next cycle started; the workflow
re-enqueued the full partition set every cycle; the 5000-op shared budget
ran out partway through; and the 2-3 partitions at the tail of each cycle
received
quota=0, advancing only ~1 segment apiece via the "make someprogress" clause in
anomalies_detector::should_stop. Those tailpartitions never reached
last_complete_scrub_atwithin the helper's180s
wait_untilbudget, and the test timed out.Instead, let use set the full scrub interval very high, so we don't rescrub
partitions more than once: the scrubber will still kick off as expected
because we reset the scrubbing metadata which makes it kick off "now".
Two commits:
rptest: stop re-scrubbing completed partitions in wait_for_internal_scrubbumps the full-scrub interval to 24h while thehelper waits, so full-scrubbed partitions drop out of the rotation and
the outstanding set monotonically shrinks. The
reset_scrubbing_metadatapath is unaffected becausescrubber_scheduler::pick_next_scrub_time'sfirst_scrubbranch usesnow + jitteronly - the configured base interval is bypassed when_last_partition_scrub == missing().cloud_storage: instrument anomalies_detector for scrub-quota diagnosisaugments existing debug log lines and adds new debug/tracelines in
anomalies_detector::run,check_manifest, andshould_stop. Captures per-call quota inputs, segment-loop progress,and which
should_stop()exit branch fired. Used to diagnose thisbug; left in for future regressions.
Validated end-to-end on a 4-node AWS CDT cluster: the same
streaming_cache_testparametrization that reliably fails the 180swait_for_internal_scrubtimeout without the fix passes in 11m27s withit.
Fixes CORE-15146.
Backports Required
Release Notes