[fix](cloud) skip stale tablet cache check for STOP_TOKEN#63520
Open
gavinchou wants to merge 1 commit into
Open
[fix](cloud) skip stale tablet cache check for STOP_TOKEN#63520gavinchou wants to merge 1 commit into
gavinchou wants to merge 1 commit into
Conversation
…ious schema change failure
In cloud mode, schema change on MOW (Merge-on-Write) tables fails with:
```
task type: ALTER, status_code: INTERNAL_ERROR, status_message:
[(BE_IP)[INTERNAL_ERROR]failed to start tablet job:
meta_service_job.cpp:143 could not perform compaction on expired tablet cache.
req_base_compaction_cnt=0, base_compaction_cnt=0,
req_cumulative_compaction_cnt=8, cumulative_compaction_cnt=9]
```
Jira: CORE-5964
Schema change on a MOW table calls `_process_delete_bitmap()`, which
registers a `STOP_TOKEN` compaction job via
`CloudCompactionStopToken::do_register()`. The `STOP_TOKEN` is not a
real compaction — it is a lock marker that blocks concurrent compactions
during delete bitmap recalculation.
However, `start_compaction_job()` in the meta-service applies the stale
tablet cache check (lines 140-149) **unconditionally to all compaction
types**, including `STOP_TOKEN`. If a concurrent compaction on another
BE node has already advanced `cumulative_compaction_cnt` in the
meta-service (e.g. from 8 → 9) while the schema change BE still holds
its old cached value (8), the STOP_TOKEN registration is rejected with
`STALE_TABLET_CACHE`. This error propagates all the way back to the FE
as a fatal ALTER task failure.
Skip the stale tablet cache check when the compaction job type is
`STOP_TOKEN`. Since `STOP_TOKEN` does not read or compact any rowsets,
verifying the freshness of cached compaction counts is meaningless for
it.
```cpp
// Before
if (compaction.base_compaction_cnt() < stats.base_compaction_cnt() ||
compaction.cumulative_compaction_cnt() < stats.cumulative_compaction_cnt()) {
// After
if (compaction.type() != TabletCompactionJobPB::STOP_TOKEN &&
(compaction.base_compaction_cnt() < stats.base_compaction_cnt() ||
compaction.cumulative_compaction_cnt() < stats.cumulative_compaction_cnt())) {
```
Added regression test `StopTokenSkipsStaleTabletCacheCheck` in
`cloud/test/meta_service_job_test.cpp` that:
1. Sets up a tablet with `cumulative_compaction_cnt=9` on the
meta-service side
2. Verifies that a regular `CUMULATIVE` compaction with stale count=8 is
still correctly rejected with `STALE_TABLET_CACHE`
3. Verifies that a `STOP_TOKEN` with the same stale count=8 succeeds
with `OK`
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
Contributor
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
Author
|
/review |
Contributor
There was a problem hiding this comment.
Reviewed the PR with the Doris code-review checklist.
Critical checkpoint conclusions:
- Goal and proof: The change targets STOP_TOKEN registration failures caused by stale compaction counters. The implementation accomplishes this by skipping only the stale tablet cache comparison for STOP_TOKEN, and the added unit test covers both the preserved CUMULATIVE rejection and the STOP_TOKEN success case.
- Scope: The modification is small and focused; no unrelated compaction behavior is changed.
- Concurrency: The path remains protected by the existing meta-service transaction/job-key conflict checks. STOP_TOKEN still blocks or clears compaction jobs through the existing logic, and the skipped counter check does not introduce a new shared-memory race.
- Lifecycle/static initialization: No new static/global lifecycle behavior is introduced.
- Configuration: No new configuration items are added.
- Compatibility/storage format: No protocol fields or persistent formats are changed; existing STOP_TOKEN fields are reused.
- Parallel paths: The relevant versioned and non-versioned stats-read paths still feed the same compaction type check; real compaction types keep the stale-cache guard.
- Conditional checks: The new STOP_TOKEN exception is documented locally and matches existing semantics that STOP_TOKEN is a lock marker, not a rowset-compacting job.
- Test coverage: A focused regression test was added. I did not run the cloud unit test in this runner.
- Observability: Existing error messages/logging are unchanged and sufficient for this narrow behavior change.
- Transaction/persistence/data writes: The change does not alter job persistence or commit/abort handling; STOP_TOKEN registration still uses the existing transactional job write path.
- Performance: No meaningful added overhead; the change removes an invalid rejection for STOP_TOKEN only.
User focus: No additional user-provided review focus was present.
No blocking issues found.
Contributor
Author
|
skip code-review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: In cloud mode, schema change on MOW tables can fail when registering a STOP_TOKEN compaction job. STOP_TOKEN is a lock marker and does not compact rowsets, but the meta-service previously applied the stale tablet cache check to all compaction types. If another BE advanced the tablet compaction counters while the schema change BE still held older cached counters, STOP_TOKEN registration could be rejected as STALE_TABLET_CACHE and the ALTER task failed. This change skips the stale tablet cache check for STOP_TOKEN while preserving the check for real compaction jobs.
Release note
Fix cloud mode schema change failures caused by STOP_TOKEN registration being rejected by stale tablet cache checks.
Check List (For Author)
StopTokenSkipsStaleTabletCacheCheckincloud/test/meta_service_job_test.cpp; not run in this session.