Skip to content

CORE-15146: stop re-scrubbing completed partitions in wait_for_internal_scrub#30633

Open
travisdowns wants to merge 2 commits into
redpanda-data:devfrom
travisdowns:td-scrub-issue
Open

CORE-15146: stop re-scrubbing completed partitions in wait_for_internal_scrub#30633
travisdowns wants to merge 2 commits into
redpanda-data:devfrom
travisdowns:td-scrub-issue

Conversation

@travisdowns
Copy link
Copy Markdown
Member

@travisdowns travisdowns commented May 28, 2026

CORE-15146. The wait_for_internal_scrub helper set
cloud_storage_full_scrub_interval_ms to 10 seconds, shorter than one
housekeeping workflow cycle's wall-clock duration on clusters with many
partitions. As a result every partition that completed a full scrub became
eligible for re-scrubbing before the next cycle started; the workflow
re-enqueued the full partition set every cycle; the 5000-op shared budget
ran out partway through; and the 2-3 partitions at the tail of each cycle
received quota=0, advancing only ~1 segment apiece via the "make some
progress" clause in anomalies_detector::should_stop. Those tail
partitions never reached last_complete_scrub_at within the helper's
180s wait_until budget, and the test timed out.

Instead, let use set the full scrub interval very high, so we don't rescrub
partitions more than once: the scrubber will still kick off as expected
because we reset the scrubbing metadata which makes it kick off "now".

Two commits:

  1. rptest: stop re-scrubbing completed partitions in wait_for_internal_scrub bumps the full-scrub interval to 24h while the
    helper waits, so full-scrubbed partitions drop out of the rotation and
    the outstanding set monotonically shrinks. The
    reset_scrubbing_metadata path is unaffected because
    scrubber_scheduler::pick_next_scrub_time's first_scrub branch uses
    now + jitter only - the configured base interval is bypassed when
    _last_partition_scrub == missing().
  2. cloud_storage: instrument anomalies_detector for scrub-quota diagnosis augments existing debug log lines and adds new debug/trace
    lines in anomalies_detector::run, check_manifest, and
    should_stop. Captures per-call quota inputs, segment-loop progress,
    and which should_stop() exit branch fired. Used to diagnose this
    bug; left in for future regressions.

Validated end-to-end on a 4-node AWS CDT cluster: the same
streaming_cache_test parametrization that reliably fails the 180s
wait_for_internal_scrub timeout without the fix passes in 11m27s with
it.

Fixes CORE-15146.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

@travisdowns
Copy link
Copy Markdown
Member Author

/ci-repeat 1 release

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#85073
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkConsumeGroupsMirroringTest test_continuous_group_sync {"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "tiered_cloud", "with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/85073#019e7002-106e-4332-990c-77b07073fe8f 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0002, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkConsumeGroupsMirroringTest&test_method=test_continuous_group_sync

@travisdowns
Copy link
Copy Markdown
Member Author

/cdt
tests/rptest/scale_tests/tiered_storage_cache_stress_test.py::TieredStorageCacheStressTest.streaming_cache_test

@travisdowns travisdowns marked this pull request as ready for review May 29, 2026 14:36
Copilot AI review requested due to automatic review settings May 29, 2026 14:36
@travisdowns travisdowns requested review from Lazin and WillemKauf May 29, 2026 14:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a test timeout caused by internal scrubbing repeatedly re-enqueuing partitions that already completed a full scrub, which could starve “tail” partitions of scrub quota. It also improves cloud-storage scrub diagnostics by adding more detailed debug/trace logging in the anomalies detector.

Changes:

  • Increase cloud_storage_full_scrub_interval_ms in wait_for_internal_scrub so fully scrubbed partitions don’t re-enter the rotation during the helper’s wait window.
  • Add additional debug/trace logging in cloud_storage::anomalies_detector to capture quota inputs, progress, and exit reasons.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/rptest/services/redpanda.py Adjusts internal-scrub test helper configuration to avoid re-scrubbing completed partitions during the wait period.
src/v/cloud_storage/anomalies_detector.cc Adds instrumentation logs to help diagnose scrub quota consumption and stopping conditions.

Comment thread tests/rptest/services/redpanda.py Outdated
Comment on lines +6331 to +6335
# CORE-15146 experiment: set full-scrub interval well beyond
# the helper's wait window so a partition that completes a
# full scrub stays "done" and doesn't get re-enqueued every
# housekeeping cycle, starving the laggards of quota.
"cloud_storage_full_scrub_interval_ms": 86_400_000,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded to drop the "experiment" framing and describe it as effectively disabling periodic re-scrubs for the helper's lifetime. Kept the value but expressed it as 24 * 60 * 60 * 1000 so it reads as "1 day" at a glance, with a note in the comment that 24h is well past any plausible test runtime so the value is effectively infinite here.

Comment on lines +62 to +67
vlog(
_logger.debug,
"Failed downloading partition manifest, exiting scrub: ops={} "
"segs={}",
_result.ops,
_result.segments_visited);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - added the error_outcome to the failed-manifest log line via dl_result.error(). cloud_storage::error_outcome already has an fmt::formatter (defined in src/v/cloud_storage/types.h), so it logs cleanly.

Comment thread src/v/cloud_storage/anomalies_detector.cc
@WillemKauf WillemKauf requested a review from oleiman May 29, 2026 16:28
…crub

CORE-15146. The helper set cloud_storage_full_scrub_interval_ms to 10s,
which is shorter than the housekeeping workflow's wall-clock cycle for
clusters with many partitions. As a result every partition that completed
a full scrub became eligible for re-scrubbing before the next cycle
started, the workflow re-enqueued the full partition set each cycle, the
per-cycle ops quota (5000) ran out partway through, and the 2-3
partitions at the tail received quota=0. Those tail partitions advanced
only ~1 segment per cycle and never finished within the helper's 180s
timeout.

Setting the full-scrub interval to 24h keeps completed partitions out of
the queue for the remainder of the wait, so the outstanding set
monotonically shrinks. The reset_scrubbing_metadata path bypasses the
configured interval via the first_scrub branch in
scrubber_scheduler::pick_next_scrub_time, so the initial scrub still
fires within jitter (100ms) after reset, regardless of the configured
base interval.

Validated on a 3-broker AWS CDT cluster running
streaming_cache_test@{limit_mode=objects,log_segment_size=1048576}: same
parametrization fails reliably without the change (180s scrub timeout)
and passes in 11m27s with it.
Augment existing _logger.debug lines in anomalies_detector::run and
check_manifest, and add new _logger.debug + _logger.trace lines on
previously-silent paths, to capture:

  - per-call quota inputs (max_num_operations, max_num_segments,
    scrub_from)
  - manifest range, segment count, number of spillovers
  - per-call segment visit count and final last_scrubbed_offset
  - which should_stop branch fired (abort_requested / ops_quota /
    segs_quota)

Used to diagnose CORE-15146 (laggard partitions getting quota=0 at the
tail of housekeeping workflow runs). Mostly augmentation of existing
log lines; the new lines are at debug or trace, so they don't
materially change the default log volume.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants