CORE-15146: stop re-scrubbing completed partitions in wait_for_internal_scrub by travisdowns · Pull Request #30633 · redpanda-data/redpanda

travisdowns · 2026-05-28T18:56:59Z

CORE-15146. The wait_for_internal_scrub helper set
cloud_storage_full_scrub_interval_ms to 10 seconds, shorter than one
housekeeping workflow cycle's wall-clock duration on clusters with many
partitions. As a result every partition that completed a full scrub became
eligible for re-scrubbing before the next cycle started; the workflow
re-enqueued the full partition set every cycle; the 5000-op shared budget
ran out partway through; and the 2-3 partitions at the tail of each cycle
received quota=0, advancing only ~1 segment apiece via the "make some
progress" clause in anomalies_detector::should_stop. Those tail
partitions never reached last_complete_scrub_at within the helper's
180s wait_until budget, and the test timed out.

Instead, let use set the full scrub interval very high, so we don't rescrub
partitions more than once: the scrubber will still kick off as expected
because we reset the scrubbing metadata which makes it kick off "now".

Two commits:

rptest: stop re-scrubbing completed partitions in wait_for_internal_scrub bumps the full-scrub interval to 24h while the
helper waits, so full-scrubbed partitions drop out of the rotation and
the outstanding set monotonically shrinks. The
reset_scrubbing_metadata path is unaffected because
scrubber_scheduler::pick_next_scrub_time's first_scrub branch uses
now + jitter only - the configured base interval is bypassed when
_last_partition_scrub == missing().
cloud_storage: instrument anomalies_detector for scrub-quota diagnosis augments existing debug log lines and adds new debug/trace
lines in anomalies_detector::run, check_manifest, and
should_stop. Captures per-call quota inputs, segment-loop progress,
and which should_stop() exit branch fired. Used to diagnose this
bug; left in for future regressions.

Validated end-to-end on a 4-node AWS CDT cluster: the same
streaming_cache_test parametrization that reliably fails the 180s
wait_for_internal_scrub timeout without the fix passes in 11m27s with
it.

Fixes CORE-15146.

Backports Required

Release Notes

none

travisdowns · 2026-05-28T18:57:10Z

/ci-repeat 1 release

vbotbuildovich · 2026-05-28T20:27:03Z

CI test results

test results on build#85073

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(PASS)	ShadowLinkConsumeGroupsMirroringTest	test_continuous_group_sync	{"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "tiered_cloud", "with_failures": true}	integration	https://buildkite.com/redpanda/redpanda/builds/85073#019e7002-106e-4332-990c-77b07073fe8f	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0002, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkConsumeGroupsMirroringTest&test_method=test_continuous_group_sync

travisdowns · 2026-05-28T21:53:42Z

/cdt
tests/rptest/scale_tests/tiered_storage_cache_stress_test.py::TieredStorageCacheStressTest.streaming_cache_test

Copilot

Pull request overview

This PR addresses a test timeout caused by internal scrubbing repeatedly re-enqueuing partitions that already completed a full scrub, which could starve “tail” partitions of scrub quota. It also improves cloud-storage scrub diagnostics by adding more detailed debug/trace logging in the anomalies detector.

Changes:

Increase cloud_storage_full_scrub_interval_ms in wait_for_internal_scrub so fully scrubbed partitions don’t re-enter the rotation during the helper’s wait window.
Add additional debug/trace logging in cloud_storage::anomalies_detector to capture quota inputs, progress, and exit reasons.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
tests/rptest/services/redpanda.py	Adjusts internal-scrub test helper configuration to avoid re-scrubbing completed partitions during the wait period.
src/v/cloud_storage/anomalies_detector.cc	Adds instrumentation logs to help diagnose scrub quota consumption and stopping conditions.

travisdowns · 2026-05-29T20:35:31Z

+                # CORE-15146 experiment: set full-scrub interval well beyond
+                # the helper's wait window so a partition that completes a
+                # full scrub stays "done" and doesn't get re-enqueued every
+                # housekeeping cycle, starving the laggards of quota.
+                "cloud_storage_full_scrub_interval_ms": 86_400_000,


Reworded to drop the "experiment" framing and describe it as effectively disabling periodic re-scrubs for the helper's lifetime. Kept the value but expressed it as 24 * 60 * 60 * 1000 so it reads as "1 day" at a glance, with a note in the comment that 24h is well past any plausible test runtime so the value is effectively infinite here.

travisdowns · 2026-05-29T20:35:39Z

+        vlog(
+          _logger.debug,
+          "Failed downloading partition manifest, exiting scrub: ops={} "
+          "segs={}",
+          _result.ops,
+          _result.segments_visited);


Good catch - added the error_outcome to the failed-manifest log line via dl_result.error(). cloud_storage::error_outcome already has an fmt::formatter (defined in src/v/cloud_storage/types.h), so it logs cleanly.

…crub CORE-15146. The helper set cloud_storage_full_scrub_interval_ms to 10s, which is shorter than the housekeeping workflow's wall-clock cycle for clusters with many partitions. As a result every partition that completed a full scrub became eligible for re-scrubbing before the next cycle started, the workflow re-enqueued the full partition set each cycle, the per-cycle ops quota (5000) ran out partway through, and the 2-3 partitions at the tail received quota=0. Those tail partitions advanced only ~1 segment per cycle and never finished within the helper's 180s timeout. Setting the full-scrub interval to 24h keeps completed partitions out of the queue for the remainder of the wait, so the outstanding set monotonically shrinks. The reset_scrubbing_metadata path bypasses the configured interval via the first_scrub branch in scrubber_scheduler::pick_next_scrub_time, so the initial scrub still fires within jitter (100ms) after reset, regardless of the configured base interval. Validated on a 3-broker AWS CDT cluster running streaming_cache_test@{limit_mode=objects,log_segment_size=1048576}: same parametrization fails reliably without the change (180s scrub timeout) and passes in 11m27s with it.

Augment existing _logger.debug lines in anomalies_detector::run and check_manifest, and add new _logger.debug + _logger.trace lines on previously-silent paths, to capture: - per-call quota inputs (max_num_operations, max_num_segments, scrub_from) - manifest range, segment count, number of spillovers - per-call segment visit count and final last_scrubbed_offset - which should_stop branch fired (abort_requested / ops_quota / segs_quota) Used to diagnose CORE-15146 (laggard partitions getting quota=0 at the tail of housekeeping workflow runs). Mostly augmentation of existing log lines; the new lines are at debug or trace, so they don't materially change the default log volume.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

github-actions Bot added the area/redpanda label May 28, 2026

travisdowns marked this pull request as ready for review May 29, 2026 14:36

Copilot AI review requested due to automatic review settings May 29, 2026 14:36

Copilot started reviewing on behalf of travisdowns May 29, 2026 14:37 View session

travisdowns requested review from Lazin and WillemKauf May 29, 2026 14:37

Copilot AI reviewed May 29, 2026

View reviewed changes

WillemKauf requested a review from oleiman May 29, 2026 16:28

travisdowns added 2 commits May 29, 2026 16:34

travisdowns force-pushed the td-scrub-issue branch from 4f510d2 to 162784c Compare May 29, 2026 20:35

travisdowns requested a review from Copilot May 29, 2026 20:38

Copilot started reviewing on behalf of travisdowns May 29, 2026 20:38 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORE-15146: stop re-scrubbing completed partitions in wait_for_internal_scrub#30633

CORE-15146: stop re-scrubbing completed partitions in wait_for_internal_scrub#30633
travisdowns wants to merge 2 commits into
redpanda-data:devfrom
travisdowns:td-scrub-issue

travisdowns commented May 28, 2026 •

edited by atlassian Bot

Loading

Uh oh!

travisdowns commented May 28, 2026

Uh oh!

vbotbuildovich commented May 28, 2026

Uh oh!

travisdowns commented May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

travisdowns May 29, 2026

Uh oh!

travisdowns May 29, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

travisdowns commented May 28, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

travisdowns commented May 28, 2026

Uh oh!

vbotbuildovich commented May 28, 2026

CI test results

Uh oh!

travisdowns commented May 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

travisdowns May 29, 2026

Choose a reason for hiding this comment

Uh oh!

travisdowns May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

travisdowns commented May 28, 2026 •

edited by atlassian Bot

Loading