Skip to content

[fix](cloud) skip stale tablet cache check for STOP_TOKEN#63520

Open
gavinchou wants to merge 1 commit into
apache:masterfrom
gavinchou:gavin-fix-stop-token
Open

[fix](cloud) skip stale tablet cache check for STOP_TOKEN#63520
gavinchou wants to merge 1 commit into
apache:masterfrom
gavinchou:gavin-fix-stop-token

Conversation

@gavinchou
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: In cloud mode, schema change on MOW tables can fail when registering a STOP_TOKEN compaction job. STOP_TOKEN is a lock marker and does not compact rowsets, but the meta-service previously applied the stale tablet cache check to all compaction types. If another BE advanced the tablet compaction counters while the schema change BE still held older cached counters, STOP_TOKEN registration could be rejected as STALE_TABLET_CACHE and the ALTER task failed. This change skips the stale tablet cache check for STOP_TOKEN while preserving the check for real compaction jobs.

Release note

Fix cloud mode schema change failures caused by STOP_TOKEN registration being rejected by stale tablet cache checks.

Check List (For Author)

  • Test: Unit Test
    • Added StopTokenSkipsStaleTabletCacheCheck in cloud/test/meta_service_job_test.cpp; not run in this session.
  • Behavior changed: Yes. STOP_TOKEN compaction jobs no longer fail stale tablet cache validation because they do not read or compact rowsets.
  • Does this need documentation: No

…ious schema change failure

In cloud mode, schema change on MOW (Merge-on-Write) tables fails with:

```
task type: ALTER, status_code: INTERNAL_ERROR, status_message:
[(BE_IP)[INTERNAL_ERROR]failed to start tablet job:
meta_service_job.cpp:143 could not perform compaction on expired tablet cache.
req_base_compaction_cnt=0, base_compaction_cnt=0,
req_cumulative_compaction_cnt=8, cumulative_compaction_cnt=9]
```

Jira: CORE-5964

Schema change on a MOW table calls `_process_delete_bitmap()`, which
registers a `STOP_TOKEN` compaction job via
`CloudCompactionStopToken::do_register()`. The `STOP_TOKEN` is not a
real compaction — it is a lock marker that blocks concurrent compactions
during delete bitmap recalculation.

However, `start_compaction_job()` in the meta-service applies the stale
tablet cache check (lines 140-149) **unconditionally to all compaction
types**, including `STOP_TOKEN`. If a concurrent compaction on another
BE node has already advanced `cumulative_compaction_cnt` in the
meta-service (e.g. from 8 → 9) while the schema change BE still holds
its old cached value (8), the STOP_TOKEN registration is rejected with
`STALE_TABLET_CACHE`. This error propagates all the way back to the FE
as a fatal ALTER task failure.

Skip the stale tablet cache check when the compaction job type is
`STOP_TOKEN`. Since `STOP_TOKEN` does not read or compact any rowsets,
verifying the freshness of cached compaction counts is meaningless for
it.

```cpp
// Before
if (compaction.base_compaction_cnt() < stats.base_compaction_cnt() ||
    compaction.cumulative_compaction_cnt() < stats.cumulative_compaction_cnt()) {

// After
if (compaction.type() != TabletCompactionJobPB::STOP_TOKEN &&
    (compaction.base_compaction_cnt() < stats.base_compaction_cnt() ||
     compaction.cumulative_compaction_cnt() < stats.cumulative_compaction_cnt())) {
```

Added regression test `StopTokenSkipsStaleTabletCacheCheck` in
`cloud/test/meta_service_job_test.cpp` that:
1. Sets up a tablet with `cumulative_compaction_cnt=9` on the
meta-service side
2. Verifies that a regular `CUMULATIVE` compaction with stale count=8 is
still correctly rejected with `STALE_TABLET_CACHE`
3. Verifies that a `STOP_TOKEN` with the same stale count=8 succeeds
with `OK`
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@gavinchou
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.06% (1854/2375)
Line Coverage 64.53% (33338/51663)
Region Coverage 65.25% (16536/25343)
Branch Coverage 55.72% (8834/15854)

@gavinchou
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the PR with the Doris code-review checklist.

Critical checkpoint conclusions:

  • Goal and proof: The change targets STOP_TOKEN registration failures caused by stale compaction counters. The implementation accomplishes this by skipping only the stale tablet cache comparison for STOP_TOKEN, and the added unit test covers both the preserved CUMULATIVE rejection and the STOP_TOKEN success case.
  • Scope: The modification is small and focused; no unrelated compaction behavior is changed.
  • Concurrency: The path remains protected by the existing meta-service transaction/job-key conflict checks. STOP_TOKEN still blocks or clears compaction jobs through the existing logic, and the skipped counter check does not introduce a new shared-memory race.
  • Lifecycle/static initialization: No new static/global lifecycle behavior is introduced.
  • Configuration: No new configuration items are added.
  • Compatibility/storage format: No protocol fields or persistent formats are changed; existing STOP_TOKEN fields are reused.
  • Parallel paths: The relevant versioned and non-versioned stats-read paths still feed the same compaction type check; real compaction types keep the stale-cache guard.
  • Conditional checks: The new STOP_TOKEN exception is documented locally and matches existing semantics that STOP_TOKEN is a lock marker, not a rowset-compacting job.
  • Test coverage: A focused regression test was added. I did not run the cloud unit test in this runner.
  • Observability: Existing error messages/logging are unchanged and sufficient for this narrow behavior change.
  • Transaction/persistence/data writes: The change does not alter job persistence or commit/abort handling; STOP_TOKEN registration still uses the existing transactional job write path.
  • Performance: No meaningful added overhead; the change removes an invalid rejection for STOP_TOKEN only.

User focus: No additional user-provided review focus was present.

No blocking issues found.

@gavinchou
Copy link
Copy Markdown
Contributor Author

skip code-review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants