Skip to content

tso: fix fallback group 0 return and add metrics#10602

Open
bufferflies wants to merge 4 commits intotikv:masterfrom
bufferflies:pr-merge/7786830e-fix-tso-fallback-group0
Open

tso: fix fallback group 0 return and add metrics#10602
bufferflies wants to merge 4 commits intotikv:masterfrom
bufferflies:pr-merge/7786830e-fix-tso-fallback-group0

Conversation

@bufferflies
Copy link
Copy Markdown
Contributor

@bufferflies bufferflies commented Apr 15, 2026

Summary

  • avoid falling back to keyspace group 0 when a keyspace already has a configured TSO group but lookup state is stale
  • add fallback outcome metrics for rejected fallback and fallback-to-default cases
  • keep local make check green on current master by including the required gofmt cleanup in server/cluster/cluster.go

Issue Number: ref #10516, close #10554, close #10601

Author

@ystaticy

CP

Summary by CodeRabbit

  • Bug Fixes

    • Fixed keyspace group assignment fallback behavior to properly reject fallback when a keyspace has a configured group instead of defaulting to the default group.
  • New Features

    • Added metrics to monitor keyspace fallback events, tracking both rejected fallbacks and fallbacks to the default group.

ystaticy and others added 3 commits April 15, 2026 15:45
* fix fallback

Signed-off-by: ystaticy <y_static_y@sina.com>

* add condition

Signed-off-by: ystaticy <y_static_y@sina.com>

* fix call getKeyspaceGroupMetaWithCheck

Signed-off-by: ystaticy <y_static_y@sina.com>

* fix lint

Signed-off-by: ystaticy <y_static_y@sina.com>

* add UT

Signed-off-by: ystaticy <y_static_y@sina.com>

* fix UT

Signed-off-by: ystaticy <y_static_y@sina.com>

---------

Signed-off-by: ystaticy <y_static_y@sina.com>
(cherry picked from commit 7786830)
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 15, 2026

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot Bot added dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 15, 2026
@bufferflies
Copy link
Copy Markdown
Contributor Author

/ping @ystaticy

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 15, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

📝 Walkthrough

Walkthrough

Adds storage-backed checks and metrics to TSO keyspace group fallback logic: the manager now verifies if a keyspace has a configured group in persistent storage and rejects fallback to the default group when a configured group exists; new Prometheus counters record rejected vs. defaulted fallback events.

Changes

Cohort / File(s) Summary
Manager Logic
pkg/tso/keyspace_group_manager.go
Added checkKeyspaceHasConfiguredGroup helper and a manager-bound getKeyspaceGroupMetaWithCheck usage; fallback now consults storage and rejects fallback when a configured group exists; metrics counters incremented accordingly.
Tests
pkg/tso/keyspace_group_manager_test.go
Updated test call sites to pass manager into getKeyspaceGroupMetaWithCheck; added TestCheckKeyspaceGroupFallback to validate fallback rejection and acceptance depending on saved keyspace metadata.
Metrics
pkg/tso/metrics.go
Introduced tso_keyspace_fallback_events CounterVec and per-manager counters keyspaceFallbackRejectedCounter and keyspaceFallbackToDefaultCounter; initialized and registered.
Formatting
server/cluster/cluster.go
Whitespace/field alignment changes only in RaftCluster struct (no semantic changes).

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant KGM as KeyspaceGroupManager
    participant Store as Storage
    participant Metrics as Metrics
    rect rgba(200,200,255,0.5)
    Client->>KGM: TSO request / lookup keyspaceID
    KGM->>KGM: Check in-memory keyspace-group map
    alt not present in-memory
        KGM->>Store: Load KeyspaceMeta(keyspaceID)
        Store-->>KGM: KeyspaceMeta (exists / not exists)
        alt KeyspaceMeta has configured group
            KGM->>Metrics: Increment "rejected" counter
            KGM-->>Client: Return ErrKeyspaceNotAssigned
        else no configured group
            KGM->>Metrics: Increment "to_default" counter
            KGM-->>Client: Fallback to default keyspace group
        end
    else present in-memory
        KGM-->>Client: Return assigned keyspace-group
    end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

contribution, release-note-none, ok-to-test

Suggested reviewers

  • rleungx
  • JmPotato

Poem

🐰 I hopped through metadata, sniffed the trail,
Found keyspaces set where defaults would fail.
Counters tick-tock for rejected and stray,
I guard the groups as I nibble all day. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main changes: fixing fallback to group 0 and adding metrics for TSO keyspace operations.
Description check ✅ Passed The PR description covers the problem being solved and relates it to linked issues, but lacks detailed explanation of implementation changes and testing strategy.
Linked Issues check ✅ Passed The PR successfully addresses both linked issues: implementing the fallback-to-group-0 fix [#10554] and adding TSO fallback metrics [#10601].
Out of Scope Changes check ✅ Passed All changes are aligned with the linked issues. The gofmt cleanup in server/cluster/cluster.go is explicitly mentioned as necessary for keeping local make check passing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pkg/tso/keyspace_group_manager.go (2)

282-286: Consider accepting a context parameter for graceful shutdown.

Using context.Background() ignores any cancellation signals from the caller's context. During graceful shutdown, this could delay termination if the storage call hangs or is slow.

♻️ Suggested change
-func checkKeyspaceHasConfiguredGroup(storage *endpoint.StorageEndpoint, keyspaceID uint32) bool {
+func checkKeyspaceHasConfiguredGroup(ctx context.Context, storage *endpoint.StorageEndpoint, keyspaceID uint32) bool {
 	// Null keyspace is always allowed to fallback to default group
 	if keyspaceID == constant.NullKeyspaceID {
 		// ...
 	}
 
 	// Load keyspace metadata from storage
 	var meta *keyspacepb.KeyspaceMeta
-	err := storage.RunInTxn(context.Background(), func(txn kv.Txn) error {
+	err := storage.RunInTxn(ctx, func(txn kv.Txn) error {
 		var err error
 		meta, err = storage.LoadKeyspaceMeta(txn, keyspaceID)
 		return err
 	})

Then update the call site at line 353 to pass a context from kgm.ctx or the caller.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tso/keyspace_group_manager.go` around lines 282 - 286, Replace the use of
context.Background() in the storage.RunInTxn call so the method accepts and uses
a caller-provided context: change the function signature that contains this call
to accept a ctx parameter (or use the existing kgm.ctx), pass that ctx into
storage.RunInTxn instead of context.Background(), and propagate the same ctx
into the inner call to storage.LoadKeyspaceMeta; update the call site that
invokes this function to pass kgm.ctx (or the caller's context) so cancellation
and graceful shutdown signals are honored by RunInTxn and LoadKeyspaceMeta.

277-278: Consider reducing log level to Debug for routine checks.

This log.Info will be emitted for every TSO request that triggers the fallback check path. In high-traffic scenarios, this could generate excessive log volume. The successful/rejected outcomes at lines 298-303 also use log.Info.

Consider using log.Debug for these messages to reduce noise while still allowing them to be enabled when troubleshooting.

♻️ Suggested change
-	log.Info("[tso] checking keyspace meta for group configuration before fallback",
+	log.Debug("[tso] checking keyspace meta for group configuration before fallback",
 		zap.Uint32("keyspace-id", keyspaceID))

Similarly for lines 290, 298, and 302.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tso/keyspace_group_manager.go` around lines 277 - 278, Change the routine
informational logs used during the keyspace fallback check to debug-level so
they don't flood logs under normal load: replace the log.Info call that emits
"[tso] checking keyspace meta for group configuration before fallback" and the
subsequent Info logs that report the successful/rejected outcomes with log.Debug
while keeping the same message text and zap fields (e.g., the "keyspace-id"
field); locate and update the logging calls in the keyspace fallback/checking
code paths (the log statements shown in the diff and the related success/reject
logs) to use Debug instead of Info.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/tso/keyspace_group_manager.go`:
- Around line 282-286: Replace the use of context.Background() in the
storage.RunInTxn call so the method accepts and uses a caller-provided context:
change the function signature that contains this call to accept a ctx parameter
(or use the existing kgm.ctx), pass that ctx into storage.RunInTxn instead of
context.Background(), and propagate the same ctx into the inner call to
storage.LoadKeyspaceMeta; update the call site that invokes this function to
pass kgm.ctx (or the caller's context) so cancellation and graceful shutdown
signals are honored by RunInTxn and LoadKeyspaceMeta.
- Around line 277-278: Change the routine informational logs used during the
keyspace fallback check to debug-level so they don't flood logs under normal
load: replace the log.Info call that emits "[tso] checking keyspace meta for
group configuration before fallback" and the subsequent Info logs that report
the successful/rejected outcomes with log.Debug while keeping the same message
text and zap fields (e.g., the "keyspace-id" field); locate and update the
logging calls in the keyspace fallback/checking code paths (the log statements
shown in the diff and the related success/reject logs) to use Debug instead of
Info.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: af0a1a2f-3970-446f-b1ef-04676ef12e99

📥 Commits

Reviewing files that changed from the base of the PR and between b0a3c90 and f4022b3.

📒 Files selected for processing (4)
  • pkg/tso/keyspace_group_manager.go
  • pkg/tso/keyspace_group_manager_test.go
  • pkg/tso/metrics.go
  • server/cluster/cluster.go

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 15, 2026

@liyishuai: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 15, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liyishuai
Once this PR has been reviewed and has the lgtm label, please assign binshi-bing for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

return err
})

if err != nil || meta == nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the error happens when loading from storage, but the keyspace has a configured group?

Comment thread pkg/tso/keyspace_group_manager.go Outdated

// Load keyspace metadata from storage
var meta *keyspacepb.KeyspaceMeta
err := storage.RunInTxn(context.Background(), func(txn kv.Txn) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a context instead of using Background()?

Comment thread pkg/tso/keyspace_group_manager.go Outdated
// Check if Config has TSOKeyspaceGroupIDKey
_, hasGroup := meta.GetConfig()[keyspace.TSOKeyspaceGroupIDKey]
if hasGroup {
log.Info("[tso] keyspace has configured group in meta, should not fallback to default group",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it print a lot?

Signed-off-by: tongjian <1045931706@qq.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/tso/keyspace_group_manager.go (1)

311-355: ⚠️ Potential issue | 🟠 Major

Etcd read performed while holding state.RLock().

getKeyspaceGroupMetaWithCheck takes s.RLock() at line 315-316 and, on the miss path, calls kgm.checkKeyspaceHasConfiguredGroup at line 350 which runs kgm.storage.RunInTxn → etcd load (lines 593-596). This is on the TSO request hot path (HandleTSORequest, GetMember, FindGroupByKeyspaceID), and it blocks every other reader/writer of the keyspace-group state (including concurrent TSO dispatch for other keyspaces) for the duration of an etcd round trip. Under an etcd stall this can serialize/stall TSO service-wide.

Please move the storage check outside the lock: release s.RLock(), perform checkKeyspaceHasConfiguredGroup, then re-acquire the lock if you need to re-validate state. Alternatively, cache the "has configured group" bit per keyspace (the keyspace→group config is rarely changed) and invalidate on keyspace meta updates, so the hot path stays in-memory.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tso/keyspace_group_manager.go` around lines 311 - 355,
getKeyspaceGroupMetaWithCheck currently holds s.RLock() while calling
kgm.checkKeyspaceHasConfiguredGroup (which calls kgm.storage.RunInTxn and
performs an etcd read), blocking all other readers/writers; fix by releasing the
RLock before invoking kgm.checkKeyspaceHasConfiguredGroup and then re-acquiring
the RLock to re-validate the in-memory state (re-check allocators,
s.keyspaceLookupTable or any conditions you relied on), or implement a cached
boolean for "keyspace has configured group" (maintained/invalidated on keyspace
meta updates) and use that cache under the RLock to avoid any storage call on
the hot path; specifically modify getKeyspaceGroupMetaWithCheck to: 1) exit the
RLock before calling kgm.checkKeyspaceHasConfiguredGroup (which invokes
kgm.storage.RunInTxn), 2) call the check, and 3) re-lock and re-check the same
state used earlier (allocators, s.kgs, s.keyspaceLookupTable) before returning
to ensure no race.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/tso/keyspace_group_manager.go`:
- Around line 598-603: The info-level logging in the branch that handles a
failed/missing keyspace meta (the log call emitting "keyspace meta not found or
failed to load, assume legacy keyspace" with zap.Uint32("keyspace-id",
keyspaceID) and zap.Error(err)) is on the hot path and should be downgraded to
debug to avoid log spam; change that log.Info to log.Debug, and make the same
change in the duplicate helper that emits the same message (replace its log.Info
with log.Debug) while preserving the structured fields
(zap.Uint32("keyspace-id", keyspaceID), zap.Error(err)); keep behavior otherwise
consistent with the existing log.Debug used for the related "skip fallback"
message.
- Around line 593-611: When RunInTxn/LoadKeyspaceMeta returns an error, do not
assume legacy keyspace; treat this as a load failure and reject the fallback to
group 0. In getKeyspaceGroupMetaWithCheck, distinguish the three cases: (1) err
!= nil → fail-safe by returning hasGroup=true (reject fallback) and increment
kgm.metrics.keyspaceFallbackRejectedCounter if metrics present; (2) err == nil
&& meta == nil → genuine not-found, allow fallback (return false); (3) err ==
nil && meta != nil → check meta.GetConfig()[keyspace.TSOKeyspaceGroupIDKey] as
before. Use kgm.storage.RunInTxn / LoadKeyspaceMeta /
keyspace.TSOKeyspaceGroupIDKey / kgm.metrics.keyspaceFallbackRejectedCounter as
reference points.
- Around line 266-303: Remove the dead package-level function
checkKeyspaceHasConfiguredGroup entirely and keep only the
kgm.checkKeyspaceHasConfiguredGroup method; move the invocation of
kgm.checkKeyspaceHasConfiguredGroup out of the s.RLock() region in
getKeyspaceGroupMetaWithCheck so the etcd/storage call (storage.RunInTxn /
LoadKeyspaceMeta) does not execute while holding the RLock; change the storage
error-path in the method so that if storage.RunInTxn returns an error you treat
it as “reject fallback” (return true or propagate error) instead of returning
false which enables fallback; and reduce logging noise by replacing the hot-path
log.Info calls related to missing/legacy/fallback keyspace cases with log.Debug
in both the standalone logic and the kgm method.

---

Outside diff comments:
In `@pkg/tso/keyspace_group_manager.go`:
- Around line 311-355: getKeyspaceGroupMetaWithCheck currently holds s.RLock()
while calling kgm.checkKeyspaceHasConfiguredGroup (which calls
kgm.storage.RunInTxn and performs an etcd read), blocking all other
readers/writers; fix by releasing the RLock before invoking
kgm.checkKeyspaceHasConfiguredGroup and then re-acquiring the RLock to
re-validate the in-memory state (re-check allocators, s.keyspaceLookupTable or
any conditions you relied on), or implement a cached boolean for "keyspace has
configured group" (maintained/invalidated on keyspace meta updates) and use that
cache under the RLock to avoid any storage call on the hot path; specifically
modify getKeyspaceGroupMetaWithCheck to: 1) exit the RLock before calling
kgm.checkKeyspaceHasConfiguredGroup (which invokes kgm.storage.RunInTxn), 2)
call the check, and 3) re-lock and re-check the same state used earlier
(allocators, s.kgs, s.keyspaceLookupTable) before returning to ensure no race.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ce017cbe-77a7-4b60-86fd-eab4dd2a8cc1

📥 Commits

Reviewing files that changed from the base of the PR and between f4022b3 and 20b2fb6.

📒 Files selected for processing (1)
  • pkg/tso/keyspace_group_manager.go

Comment on lines +266 to +303
// checkKeyspaceHasConfiguredGroup checks if the keyspace has configured a group in its metadata.
// This is used to determine if we should allow fallback to default group.
func checkKeyspaceHasConfiguredGroup(ctx context.Context, storage *endpoint.StorageEndpoint, keyspaceID uint32) bool {
// Null keyspace is always allowed to fallback to default group
// NullKeyspaceID is used for api v1 or legacy path where is keyspace agnostic
if keyspaceID == constant.NullKeyspaceID {
log.Debug("null keyspace, allow fallback to default group",
zap.Uint32("keyspace-id", keyspaceID))
return false
}

// Load keyspace metadata from storage
var meta *keyspacepb.KeyspaceMeta
var err error
err = storage.RunInTxn(ctx, func(txn kv.Txn) error {
meta, err = storage.LoadKeyspaceMeta(txn, keyspaceID)
return err
})

if err != nil || meta == nil {
// If can't load or not found, assume no group configured (legacy keyspace)
log.Info("[tso] keyspace meta not found or failed to load, assume legacy keyspace",
zap.Uint32("keyspace-id", keyspaceID), zap.Error(err))
return false
}

// Check if Config has TSOKeyspaceGroupIDKey
_, hasGroup := meta.GetConfig()[keyspace.TSOKeyspaceGroupIDKey]
if hasGroup {
log.Info("keyspace has configured group in meta, should not fallback to default group",
zap.Uint32("keyspace-id", keyspaceID),
zap.String("configured-group-id", meta.GetConfig()[keyspace.TSOKeyspaceGroupIDKey]))
} else {
log.Info("keyspace has no group configuration in meta, can fallback to default group",
zap.Uint32("keyspace-id", keyspaceID))
}
return hasGroup
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm the package-level function has no callers.
rg -nP '\bcheckKeyspaceHasConfiguredGroup\s*\(' --type=go -C2

Repository: tikv/pd

Length of output: 1882


🏁 Script executed:

#!/bin/bash
# Read the package-level function and the method implementation to compare
cat -n pkg/tso/keyspace_group_manager.go | sed -n '266,303p'
echo "---"
cat -n pkg/tso/keyspace_group_manager.go | sed -n '579,620p'

Repository: tikv/pd

Length of output: 3920


🏁 Script executed:

#!/bin/bash
# Check the context of the call at line 350 to see if it's under RLock
cat -n pkg/tso/keyspace_group_manager.go | sed -n '310,365p'

Repository: tikv/pd

Length of output: 3094


🏁 Script executed:

#!/bin/bash
# Check if storage.RunInTxn is a blocking etcd IO operation
rg -n 'RunInTxn' --type=go -A5 pkg/tso/ | head -40

Repository: tikv/pd

Length of output: 1770


Remove the package-level checkKeyspaceHasConfiguredGroup function (lines 266–303) — it is dead code with no callers.

Only the method version at line 581 is used (at line 350 in getKeyspaceGroupMetaWithCheck). Keeping both creates maintenance burden and risk of drift.

Move the kgm.checkKeyspaceHasConfiguredGroup call outside the RLock (critical performance issue).

The call at line 350 executes inside s.RLock() (line 315). The method performs a blocking etcd transactional read via storage.RunInTxn() (line 593), which blocks all TSO requests for every keyspace/group in the hot path. Extract this check before acquiring the lock, or defer it to the caller's non-locked section.

Fix error handling: on storage load failure, reject fallback instead of enabling it.

When storage.RunInTxn() fails at line 598, the method returns false (no group configured), allowing fallback to the default group. If the keyspace actually has a configured group but the read failed, this is incorrect. On error, return true (reject fallback) or propagate the error up the stack.

Change log.Info to log.Debug on the hot path.

Lines 287 (standalone), 600 (method), and 295–299 (standalone) use log.Info for fallback/error cases. These are called on every TSO request and will spam logs. Use log.Debug instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tso/keyspace_group_manager.go` around lines 266 - 303, Remove the dead
package-level function checkKeyspaceHasConfiguredGroup entirely and keep only
the kgm.checkKeyspaceHasConfiguredGroup method; move the invocation of
kgm.checkKeyspaceHasConfiguredGroup out of the s.RLock() region in
getKeyspaceGroupMetaWithCheck so the etcd/storage call (storage.RunInTxn /
LoadKeyspaceMeta) does not execute while holding the RLock; change the storage
error-path in the method so that if storage.RunInTxn returns an error you treat
it as “reject fallback” (return true or propagate error) instead of returning
false which enables fallback; and reduce logging noise by replacing the hot-path
log.Info calls related to missing/legacy/fallback keyspace cases with log.Debug
in both the standalone logic and the kgm method.

Comment on lines +593 to +611
err = kgm.storage.RunInTxn(kgm.ctx, func(txn kv.Txn) error {
meta, err = kgm.storage.LoadKeyspaceMeta(txn, keyspaceID)
return err
})

if err != nil {
// If can't load or not found, assume no group configured (legacy keyspace)
log.Info("[tso] keyspace meta not found or failed to load, assume legacy keyspace",
zap.Uint32("keyspace-id", keyspaceID), zap.Error(err))
return false
}

// Check if Config has TSOKeyspaceGroupIDKey
_, hasGroup := meta.GetConfig()[keyspace.TSOKeyspaceGroupIDKey]
if kgm.metrics != nil && hasGroup {
kgm.metrics.keyspaceFallbackRejectedCounter.Inc()
}
return hasGroup
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Storage error silently allows fallback to group 0 — defeats the purpose of this fix.

When LoadKeyspaceMeta (or RunInTxn) returns an error for a real keyspace, this method returns false, so getKeyspaceGroupMetaWithCheck proceeds to fall back to the default keyspace group. That is exactly the "stale state → wrong fallback to group 0" scenario the PR is meant to prevent: a keyspace that in fact has a configured group can still be served TSO from group 0 during a transient etcd failure, producing a TSO regression.

Please fail-safe here: on error, either reject the fallback (return true) or propagate the error up so the caller returns ErrKeyspaceNotAssigned instead of switching to the default group. Note that per pkg/storage/endpoint/keyspace.go, LoadKeyspaceMeta returns (nil, nil) when the keyspace genuinely does not exist, so it is safe to distinguish "not found" (err == nil && meta == nil → allow fallback) from "load failed" (err != nil → reject/propagate).

🛡️ Suggested direction
-	err = kgm.storage.RunInTxn(kgm.ctx, func(txn kv.Txn) error {
-		meta, err = kgm.storage.LoadKeyspaceMeta(txn, keyspaceID)
-		return err
-	})
-
-	if err != nil {
-		// If can't load or not found, assume no group configured (legacy keyspace)
-		log.Info("[tso] keyspace meta not found or failed to load, assume legacy keyspace",
-			zap.Uint32("keyspace-id", keyspaceID), zap.Error(err))
-		return false
-	}
+	err = kgm.storage.RunInTxn(kgm.ctx, func(txn kv.Txn) error {
+		meta, err = kgm.storage.LoadKeyspaceMeta(txn, keyspaceID)
+		return err
+	})
+	if err != nil {
+		// Be conservative on load failure: reject fallback to the default group
+		// to avoid serving TSO from group 0 for a keyspace that may have a
+		// configured group.
+		log.Warn("[tso] failed to load keyspace meta, reject fallback to default group",
+			zap.Uint32("keyspace-id", keyspaceID), zap.Error(err))
+		if kgm.metrics != nil {
+			kgm.metrics.keyspaceFallbackRejectedCounter.Inc()
+		}
+		return true
+	}
+	if meta == nil {
+		// Keyspace really does not exist in storage — legacy/null path, allow fallback.
+		return false
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tso/keyspace_group_manager.go` around lines 593 - 611, When
RunInTxn/LoadKeyspaceMeta returns an error, do not assume legacy keyspace; treat
this as a load failure and reject the fallback to group 0. In
getKeyspaceGroupMetaWithCheck, distinguish the three cases: (1) err != nil →
fail-safe by returning hasGroup=true (reject fallback) and increment
kgm.metrics.keyspaceFallbackRejectedCounter if metrics present; (2) err == nil
&& meta == nil → genuine not-found, allow fallback (return false); (3) err ==
nil && meta != nil → check meta.GetConfig()[keyspace.TSOKeyspaceGroupIDKey] as
before. Use kgm.storage.RunInTxn / LoadKeyspaceMeta /
keyspace.TSOKeyspaceGroupIDKey / kgm.metrics.keyspaceFallbackRejectedCounter as
reference points.

Comment on lines +598 to +603
if err != nil {
// If can't load or not found, assume no group configured (legacy keyspace)
log.Info("[tso] keyspace meta not found or failed to load, assume legacy keyspace",
zap.Uint32("keyspace-id", keyspaceID), zap.Error(err))
return false
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Log level will spam on the hot path.

This branch is hit on every TSO request whose keyspace isn't yet in the in-memory lookup table (e.g., newly created keyspace, transient etcd blip). log.Info here — and the log.Info pair at lines 295-301 in the duplicate helper — will flood logs at request QPS. Please downgrade to log.Debug (or rate-limit), consistent with the log.Debug already used at line 351 for the related "skip fallback" message.

-		log.Info("[tso] keyspace meta not found or failed to load, assume legacy keyspace",
+		log.Debug("[tso] keyspace meta not found or failed to load, assume legacy keyspace",
 			zap.Uint32("keyspace-id", keyspaceID), zap.Error(err))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err != nil {
// If can't load or not found, assume no group configured (legacy keyspace)
log.Info("[tso] keyspace meta not found or failed to load, assume legacy keyspace",
zap.Uint32("keyspace-id", keyspaceID), zap.Error(err))
return false
}
if err != nil {
// If can't load or not found, assume no group configured (legacy keyspace)
log.Debug("[tso] keyspace meta not found or failed to load, assume legacy keyspace",
zap.Uint32("keyspace-id", keyspaceID), zap.Error(err))
return false
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tso/keyspace_group_manager.go` around lines 598 - 603, The info-level
logging in the branch that handles a failed/missing keyspace meta (the log call
emitting "keyspace meta not found or failed to load, assume legacy keyspace"
with zap.Uint32("keyspace-id", keyspaceID) and zap.Error(err)) is on the hot
path and should be downgraded to debug to avoid log spam; change that log.Info
to log.Debug, and make the same change in the duplicate helper that emits the
same message (replace its log.Info with log.Debug) while preserving the
structured fields (zap.Uint32("keyspace-id", keyspaceID), zap.Error(err)); keep
behavior otherwise consistent with the existing log.Debug used for the related
"skip fallback" message.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 20, 2026

@bufferflies: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-1 20b2fb6 link true /test pull-unit-test-next-gen-1

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines +348 to +355
// Before fallback to default group, check if the keyspace has configured a group in its metadata.
// If it has, don't fallback to avoid incorrect switching when state is inconsistent.
if kgm != nil && kgm.checkKeyspaceHasConfiguredGroup(keyspaceID) {
log.Debug("[tso] keyspace has configured group but not found in lookup table, skip fallback to default group",
zap.Uint32("keyspace-id", keyspaceID),
zap.Uint32("requested-group-id", keyspaceGroupID))
return nil, nil, keyspaceGroupID, s.modRevision, errs.ErrKeyspaceNotAssigned.FastGenByArgs(keyspaceID)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug here should be: a keyspace already assigned to a non-default group is wrongly falling back to group 0 when the lookup state is stale.

But this check blocks fallback for any keyspace that has tso_keyspace_group_id, including "0". That also changes the behavior of default-group keyspaces.

Should we only block fallback when the configured group ID is non-zero, and add a test for "0"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

271ac93b add TSO fallback metrics 7786830e fix TSO service fallback group 0 return

5 participants