tso: fix fallback group 0 return and add metrics#10602
tso: fix fallback group 0 return and add metrics#10602bufferflies wants to merge 4 commits intotikv:masterfrom
Conversation
* fix fallback Signed-off-by: ystaticy <y_static_y@sina.com> * add condition Signed-off-by: ystaticy <y_static_y@sina.com> * fix call getKeyspaceGroupMetaWithCheck Signed-off-by: ystaticy <y_static_y@sina.com> * fix lint Signed-off-by: ystaticy <y_static_y@sina.com> * add UT Signed-off-by: ystaticy <y_static_y@sina.com> * fix UT Signed-off-by: ystaticy <y_static_y@sina.com> --------- Signed-off-by: ystaticy <y_static_y@sina.com> (cherry picked from commit 7786830)
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ping @ystaticy |
📝 WalkthroughWalkthroughAdds storage-backed checks and metrics to TSO keyspace group fallback logic: the manager now verifies if a keyspace has a configured group in persistent storage and rejects fallback to the default group when a configured group exists; new Prometheus counters record rejected vs. defaulted fallback events. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant KGM as KeyspaceGroupManager
participant Store as Storage
participant Metrics as Metrics
rect rgba(200,200,255,0.5)
Client->>KGM: TSO request / lookup keyspaceID
KGM->>KGM: Check in-memory keyspace-group map
alt not present in-memory
KGM->>Store: Load KeyspaceMeta(keyspaceID)
Store-->>KGM: KeyspaceMeta (exists / not exists)
alt KeyspaceMeta has configured group
KGM->>Metrics: Increment "rejected" counter
KGM-->>Client: Return ErrKeyspaceNotAssigned
else no configured group
KGM->>Metrics: Increment "to_default" counter
KGM-->>Client: Fallback to default keyspace group
end
else present in-memory
KGM-->>Client: Return assigned keyspace-group
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/tso/keyspace_group_manager.go (2)
282-286: Consider accepting a context parameter for graceful shutdown.Using
context.Background()ignores any cancellation signals from the caller's context. During graceful shutdown, this could delay termination if the storage call hangs or is slow.♻️ Suggested change
-func checkKeyspaceHasConfiguredGroup(storage *endpoint.StorageEndpoint, keyspaceID uint32) bool { +func checkKeyspaceHasConfiguredGroup(ctx context.Context, storage *endpoint.StorageEndpoint, keyspaceID uint32) bool { // Null keyspace is always allowed to fallback to default group if keyspaceID == constant.NullKeyspaceID { // ... } // Load keyspace metadata from storage var meta *keyspacepb.KeyspaceMeta - err := storage.RunInTxn(context.Background(), func(txn kv.Txn) error { + err := storage.RunInTxn(ctx, func(txn kv.Txn) error { var err error meta, err = storage.LoadKeyspaceMeta(txn, keyspaceID) return err })Then update the call site at line 353 to pass a context from
kgm.ctxor the caller.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tso/keyspace_group_manager.go` around lines 282 - 286, Replace the use of context.Background() in the storage.RunInTxn call so the method accepts and uses a caller-provided context: change the function signature that contains this call to accept a ctx parameter (or use the existing kgm.ctx), pass that ctx into storage.RunInTxn instead of context.Background(), and propagate the same ctx into the inner call to storage.LoadKeyspaceMeta; update the call site that invokes this function to pass kgm.ctx (or the caller's context) so cancellation and graceful shutdown signals are honored by RunInTxn and LoadKeyspaceMeta.
277-278: Consider reducing log level to Debug for routine checks.This
log.Infowill be emitted for every TSO request that triggers the fallback check path. In high-traffic scenarios, this could generate excessive log volume. The successful/rejected outcomes at lines 298-303 also uselog.Info.Consider using
log.Debugfor these messages to reduce noise while still allowing them to be enabled when troubleshooting.♻️ Suggested change
- log.Info("[tso] checking keyspace meta for group configuration before fallback", + log.Debug("[tso] checking keyspace meta for group configuration before fallback", zap.Uint32("keyspace-id", keyspaceID))Similarly for lines 290, 298, and 302.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tso/keyspace_group_manager.go` around lines 277 - 278, Change the routine informational logs used during the keyspace fallback check to debug-level so they don't flood logs under normal load: replace the log.Info call that emits "[tso] checking keyspace meta for group configuration before fallback" and the subsequent Info logs that report the successful/rejected outcomes with log.Debug while keeping the same message text and zap fields (e.g., the "keyspace-id" field); locate and update the logging calls in the keyspace fallback/checking code paths (the log statements shown in the diff and the related success/reject logs) to use Debug instead of Info.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/tso/keyspace_group_manager.go`:
- Around line 282-286: Replace the use of context.Background() in the
storage.RunInTxn call so the method accepts and uses a caller-provided context:
change the function signature that contains this call to accept a ctx parameter
(or use the existing kgm.ctx), pass that ctx into storage.RunInTxn instead of
context.Background(), and propagate the same ctx into the inner call to
storage.LoadKeyspaceMeta; update the call site that invokes this function to
pass kgm.ctx (or the caller's context) so cancellation and graceful shutdown
signals are honored by RunInTxn and LoadKeyspaceMeta.
- Around line 277-278: Change the routine informational logs used during the
keyspace fallback check to debug-level so they don't flood logs under normal
load: replace the log.Info call that emits "[tso] checking keyspace meta for
group configuration before fallback" and the subsequent Info logs that report
the successful/rejected outcomes with log.Debug while keeping the same message
text and zap fields (e.g., the "keyspace-id" field); locate and update the
logging calls in the keyspace fallback/checking code paths (the log statements
shown in the diff and the related success/reject logs) to use Debug instead of
Info.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: af0a1a2f-3970-446f-b1ef-04676ef12e99
📒 Files selected for processing (4)
pkg/tso/keyspace_group_manager.gopkg/tso/keyspace_group_manager_test.gopkg/tso/metrics.goserver/cluster/cluster.go
|
@liyishuai: adding LGTM is restricted to approvers and reviewers in OWNERS files. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: liyishuai The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| return err | ||
| }) | ||
|
|
||
| if err != nil || meta == nil { |
There was a problem hiding this comment.
What if the error happens when loading from storage, but the keyspace has a configured group?
|
|
||
| // Load keyspace metadata from storage | ||
| var meta *keyspacepb.KeyspaceMeta | ||
| err := storage.RunInTxn(context.Background(), func(txn kv.Txn) error { |
There was a problem hiding this comment.
Do we need a context instead of using Background()?
| // Check if Config has TSOKeyspaceGroupIDKey | ||
| _, hasGroup := meta.GetConfig()[keyspace.TSOKeyspaceGroupIDKey] | ||
| if hasGroup { | ||
| log.Info("[tso] keyspace has configured group in meta, should not fallback to default group", |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/tso/keyspace_group_manager.go (1)
311-355:⚠️ Potential issue | 🟠 MajorEtcd read performed while holding
state.RLock().
getKeyspaceGroupMetaWithChecktakess.RLock()at line 315-316 and, on the miss path, callskgm.checkKeyspaceHasConfiguredGroupat line 350 which runskgm.storage.RunInTxn→ etcd load (lines 593-596). This is on the TSO request hot path (HandleTSORequest,GetMember,FindGroupByKeyspaceID), and it blocks every other reader/writer of the keyspace-group state (including concurrent TSO dispatch for other keyspaces) for the duration of an etcd round trip. Under an etcd stall this can serialize/stall TSO service-wide.Please move the storage check outside the lock: release
s.RLock(), performcheckKeyspaceHasConfiguredGroup, then re-acquire the lock if you need to re-validate state. Alternatively, cache the "has configured group" bit per keyspace (the keyspace→group config is rarely changed) and invalidate on keyspace meta updates, so the hot path stays in-memory.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tso/keyspace_group_manager.go` around lines 311 - 355, getKeyspaceGroupMetaWithCheck currently holds s.RLock() while calling kgm.checkKeyspaceHasConfiguredGroup (which calls kgm.storage.RunInTxn and performs an etcd read), blocking all other readers/writers; fix by releasing the RLock before invoking kgm.checkKeyspaceHasConfiguredGroup and then re-acquiring the RLock to re-validate the in-memory state (re-check allocators, s.keyspaceLookupTable or any conditions you relied on), or implement a cached boolean for "keyspace has configured group" (maintained/invalidated on keyspace meta updates) and use that cache under the RLock to avoid any storage call on the hot path; specifically modify getKeyspaceGroupMetaWithCheck to: 1) exit the RLock before calling kgm.checkKeyspaceHasConfiguredGroup (which invokes kgm.storage.RunInTxn), 2) call the check, and 3) re-lock and re-check the same state used earlier (allocators, s.kgs, s.keyspaceLookupTable) before returning to ensure no race.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/tso/keyspace_group_manager.go`:
- Around line 598-603: The info-level logging in the branch that handles a
failed/missing keyspace meta (the log call emitting "keyspace meta not found or
failed to load, assume legacy keyspace" with zap.Uint32("keyspace-id",
keyspaceID) and zap.Error(err)) is on the hot path and should be downgraded to
debug to avoid log spam; change that log.Info to log.Debug, and make the same
change in the duplicate helper that emits the same message (replace its log.Info
with log.Debug) while preserving the structured fields
(zap.Uint32("keyspace-id", keyspaceID), zap.Error(err)); keep behavior otherwise
consistent with the existing log.Debug used for the related "skip fallback"
message.
- Around line 593-611: When RunInTxn/LoadKeyspaceMeta returns an error, do not
assume legacy keyspace; treat this as a load failure and reject the fallback to
group 0. In getKeyspaceGroupMetaWithCheck, distinguish the three cases: (1) err
!= nil → fail-safe by returning hasGroup=true (reject fallback) and increment
kgm.metrics.keyspaceFallbackRejectedCounter if metrics present; (2) err == nil
&& meta == nil → genuine not-found, allow fallback (return false); (3) err ==
nil && meta != nil → check meta.GetConfig()[keyspace.TSOKeyspaceGroupIDKey] as
before. Use kgm.storage.RunInTxn / LoadKeyspaceMeta /
keyspace.TSOKeyspaceGroupIDKey / kgm.metrics.keyspaceFallbackRejectedCounter as
reference points.
- Around line 266-303: Remove the dead package-level function
checkKeyspaceHasConfiguredGroup entirely and keep only the
kgm.checkKeyspaceHasConfiguredGroup method; move the invocation of
kgm.checkKeyspaceHasConfiguredGroup out of the s.RLock() region in
getKeyspaceGroupMetaWithCheck so the etcd/storage call (storage.RunInTxn /
LoadKeyspaceMeta) does not execute while holding the RLock; change the storage
error-path in the method so that if storage.RunInTxn returns an error you treat
it as “reject fallback” (return true or propagate error) instead of returning
false which enables fallback; and reduce logging noise by replacing the hot-path
log.Info calls related to missing/legacy/fallback keyspace cases with log.Debug
in both the standalone logic and the kgm method.
---
Outside diff comments:
In `@pkg/tso/keyspace_group_manager.go`:
- Around line 311-355: getKeyspaceGroupMetaWithCheck currently holds s.RLock()
while calling kgm.checkKeyspaceHasConfiguredGroup (which calls
kgm.storage.RunInTxn and performs an etcd read), blocking all other
readers/writers; fix by releasing the RLock before invoking
kgm.checkKeyspaceHasConfiguredGroup and then re-acquiring the RLock to
re-validate the in-memory state (re-check allocators, s.keyspaceLookupTable or
any conditions you relied on), or implement a cached boolean for "keyspace has
configured group" (maintained/invalidated on keyspace meta updates) and use that
cache under the RLock to avoid any storage call on the hot path; specifically
modify getKeyspaceGroupMetaWithCheck to: 1) exit the RLock before calling
kgm.checkKeyspaceHasConfiguredGroup (which invokes kgm.storage.RunInTxn), 2)
call the check, and 3) re-lock and re-check the same state used earlier
(allocators, s.kgs, s.keyspaceLookupTable) before returning to ensure no race.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ce017cbe-77a7-4b60-86fd-eab4dd2a8cc1
📒 Files selected for processing (1)
pkg/tso/keyspace_group_manager.go
| // checkKeyspaceHasConfiguredGroup checks if the keyspace has configured a group in its metadata. | ||
| // This is used to determine if we should allow fallback to default group. | ||
| func checkKeyspaceHasConfiguredGroup(ctx context.Context, storage *endpoint.StorageEndpoint, keyspaceID uint32) bool { | ||
| // Null keyspace is always allowed to fallback to default group | ||
| // NullKeyspaceID is used for api v1 or legacy path where is keyspace agnostic | ||
| if keyspaceID == constant.NullKeyspaceID { | ||
| log.Debug("null keyspace, allow fallback to default group", | ||
| zap.Uint32("keyspace-id", keyspaceID)) | ||
| return false | ||
| } | ||
|
|
||
| // Load keyspace metadata from storage | ||
| var meta *keyspacepb.KeyspaceMeta | ||
| var err error | ||
| err = storage.RunInTxn(ctx, func(txn kv.Txn) error { | ||
| meta, err = storage.LoadKeyspaceMeta(txn, keyspaceID) | ||
| return err | ||
| }) | ||
|
|
||
| if err != nil || meta == nil { | ||
| // If can't load or not found, assume no group configured (legacy keyspace) | ||
| log.Info("[tso] keyspace meta not found or failed to load, assume legacy keyspace", | ||
| zap.Uint32("keyspace-id", keyspaceID), zap.Error(err)) | ||
| return false | ||
| } | ||
|
|
||
| // Check if Config has TSOKeyspaceGroupIDKey | ||
| _, hasGroup := meta.GetConfig()[keyspace.TSOKeyspaceGroupIDKey] | ||
| if hasGroup { | ||
| log.Info("keyspace has configured group in meta, should not fallback to default group", | ||
| zap.Uint32("keyspace-id", keyspaceID), | ||
| zap.String("configured-group-id", meta.GetConfig()[keyspace.TSOKeyspaceGroupIDKey])) | ||
| } else { | ||
| log.Info("keyspace has no group configuration in meta, can fallback to default group", | ||
| zap.Uint32("keyspace-id", keyspaceID)) | ||
| } | ||
| return hasGroup | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm the package-level function has no callers.
rg -nP '\bcheckKeyspaceHasConfiguredGroup\s*\(' --type=go -C2Repository: tikv/pd
Length of output: 1882
🏁 Script executed:
#!/bin/bash
# Read the package-level function and the method implementation to compare
cat -n pkg/tso/keyspace_group_manager.go | sed -n '266,303p'
echo "---"
cat -n pkg/tso/keyspace_group_manager.go | sed -n '579,620p'Repository: tikv/pd
Length of output: 3920
🏁 Script executed:
#!/bin/bash
# Check the context of the call at line 350 to see if it's under RLock
cat -n pkg/tso/keyspace_group_manager.go | sed -n '310,365p'Repository: tikv/pd
Length of output: 3094
🏁 Script executed:
#!/bin/bash
# Check if storage.RunInTxn is a blocking etcd IO operation
rg -n 'RunInTxn' --type=go -A5 pkg/tso/ | head -40Repository: tikv/pd
Length of output: 1770
Remove the package-level checkKeyspaceHasConfiguredGroup function (lines 266–303) — it is dead code with no callers.
Only the method version at line 581 is used (at line 350 in getKeyspaceGroupMetaWithCheck). Keeping both creates maintenance burden and risk of drift.
Move the kgm.checkKeyspaceHasConfiguredGroup call outside the RLock (critical performance issue).
The call at line 350 executes inside s.RLock() (line 315). The method performs a blocking etcd transactional read via storage.RunInTxn() (line 593), which blocks all TSO requests for every keyspace/group in the hot path. Extract this check before acquiring the lock, or defer it to the caller's non-locked section.
Fix error handling: on storage load failure, reject fallback instead of enabling it.
When storage.RunInTxn() fails at line 598, the method returns false (no group configured), allowing fallback to the default group. If the keyspace actually has a configured group but the read failed, this is incorrect. On error, return true (reject fallback) or propagate the error up the stack.
Change log.Info to log.Debug on the hot path.
Lines 287 (standalone), 600 (method), and 295–299 (standalone) use log.Info for fallback/error cases. These are called on every TSO request and will spam logs. Use log.Debug instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/tso/keyspace_group_manager.go` around lines 266 - 303, Remove the dead
package-level function checkKeyspaceHasConfiguredGroup entirely and keep only
the kgm.checkKeyspaceHasConfiguredGroup method; move the invocation of
kgm.checkKeyspaceHasConfiguredGroup out of the s.RLock() region in
getKeyspaceGroupMetaWithCheck so the etcd/storage call (storage.RunInTxn /
LoadKeyspaceMeta) does not execute while holding the RLock; change the storage
error-path in the method so that if storage.RunInTxn returns an error you treat
it as “reject fallback” (return true or propagate error) instead of returning
false which enables fallback; and reduce logging noise by replacing the hot-path
log.Info calls related to missing/legacy/fallback keyspace cases with log.Debug
in both the standalone logic and the kgm method.
| err = kgm.storage.RunInTxn(kgm.ctx, func(txn kv.Txn) error { | ||
| meta, err = kgm.storage.LoadKeyspaceMeta(txn, keyspaceID) | ||
| return err | ||
| }) | ||
|
|
||
| if err != nil { | ||
| // If can't load or not found, assume no group configured (legacy keyspace) | ||
| log.Info("[tso] keyspace meta not found or failed to load, assume legacy keyspace", | ||
| zap.Uint32("keyspace-id", keyspaceID), zap.Error(err)) | ||
| return false | ||
| } | ||
|
|
||
| // Check if Config has TSOKeyspaceGroupIDKey | ||
| _, hasGroup := meta.GetConfig()[keyspace.TSOKeyspaceGroupIDKey] | ||
| if kgm.metrics != nil && hasGroup { | ||
| kgm.metrics.keyspaceFallbackRejectedCounter.Inc() | ||
| } | ||
| return hasGroup | ||
| } |
There was a problem hiding this comment.
Storage error silently allows fallback to group 0 — defeats the purpose of this fix.
When LoadKeyspaceMeta (or RunInTxn) returns an error for a real keyspace, this method returns false, so getKeyspaceGroupMetaWithCheck proceeds to fall back to the default keyspace group. That is exactly the "stale state → wrong fallback to group 0" scenario the PR is meant to prevent: a keyspace that in fact has a configured group can still be served TSO from group 0 during a transient etcd failure, producing a TSO regression.
Please fail-safe here: on error, either reject the fallback (return true) or propagate the error up so the caller returns ErrKeyspaceNotAssigned instead of switching to the default group. Note that per pkg/storage/endpoint/keyspace.go, LoadKeyspaceMeta returns (nil, nil) when the keyspace genuinely does not exist, so it is safe to distinguish "not found" (err == nil && meta == nil → allow fallback) from "load failed" (err != nil → reject/propagate).
🛡️ Suggested direction
- err = kgm.storage.RunInTxn(kgm.ctx, func(txn kv.Txn) error {
- meta, err = kgm.storage.LoadKeyspaceMeta(txn, keyspaceID)
- return err
- })
-
- if err != nil {
- // If can't load or not found, assume no group configured (legacy keyspace)
- log.Info("[tso] keyspace meta not found or failed to load, assume legacy keyspace",
- zap.Uint32("keyspace-id", keyspaceID), zap.Error(err))
- return false
- }
+ err = kgm.storage.RunInTxn(kgm.ctx, func(txn kv.Txn) error {
+ meta, err = kgm.storage.LoadKeyspaceMeta(txn, keyspaceID)
+ return err
+ })
+ if err != nil {
+ // Be conservative on load failure: reject fallback to the default group
+ // to avoid serving TSO from group 0 for a keyspace that may have a
+ // configured group.
+ log.Warn("[tso] failed to load keyspace meta, reject fallback to default group",
+ zap.Uint32("keyspace-id", keyspaceID), zap.Error(err))
+ if kgm.metrics != nil {
+ kgm.metrics.keyspaceFallbackRejectedCounter.Inc()
+ }
+ return true
+ }
+ if meta == nil {
+ // Keyspace really does not exist in storage — legacy/null path, allow fallback.
+ return false
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/tso/keyspace_group_manager.go` around lines 593 - 611, When
RunInTxn/LoadKeyspaceMeta returns an error, do not assume legacy keyspace; treat
this as a load failure and reject the fallback to group 0. In
getKeyspaceGroupMetaWithCheck, distinguish the three cases: (1) err != nil →
fail-safe by returning hasGroup=true (reject fallback) and increment
kgm.metrics.keyspaceFallbackRejectedCounter if metrics present; (2) err == nil
&& meta == nil → genuine not-found, allow fallback (return false); (3) err ==
nil && meta != nil → check meta.GetConfig()[keyspace.TSOKeyspaceGroupIDKey] as
before. Use kgm.storage.RunInTxn / LoadKeyspaceMeta /
keyspace.TSOKeyspaceGroupIDKey / kgm.metrics.keyspaceFallbackRejectedCounter as
reference points.
| if err != nil { | ||
| // If can't load or not found, assume no group configured (legacy keyspace) | ||
| log.Info("[tso] keyspace meta not found or failed to load, assume legacy keyspace", | ||
| zap.Uint32("keyspace-id", keyspaceID), zap.Error(err)) | ||
| return false | ||
| } |
There was a problem hiding this comment.
Log level will spam on the hot path.
This branch is hit on every TSO request whose keyspace isn't yet in the in-memory lookup table (e.g., newly created keyspace, transient etcd blip). log.Info here — and the log.Info pair at lines 295-301 in the duplicate helper — will flood logs at request QPS. Please downgrade to log.Debug (or rate-limit), consistent with the log.Debug already used at line 351 for the related "skip fallback" message.
- log.Info("[tso] keyspace meta not found or failed to load, assume legacy keyspace",
+ log.Debug("[tso] keyspace meta not found or failed to load, assume legacy keyspace",
zap.Uint32("keyspace-id", keyspaceID), zap.Error(err))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if err != nil { | |
| // If can't load or not found, assume no group configured (legacy keyspace) | |
| log.Info("[tso] keyspace meta not found or failed to load, assume legacy keyspace", | |
| zap.Uint32("keyspace-id", keyspaceID), zap.Error(err)) | |
| return false | |
| } | |
| if err != nil { | |
| // If can't load or not found, assume no group configured (legacy keyspace) | |
| log.Debug("[tso] keyspace meta not found or failed to load, assume legacy keyspace", | |
| zap.Uint32("keyspace-id", keyspaceID), zap.Error(err)) | |
| return false | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/tso/keyspace_group_manager.go` around lines 598 - 603, The info-level
logging in the branch that handles a failed/missing keyspace meta (the log call
emitting "keyspace meta not found or failed to load, assume legacy keyspace"
with zap.Uint32("keyspace-id", keyspaceID) and zap.Error(err)) is on the hot
path and should be downgraded to debug to avoid log spam; change that log.Info
to log.Debug, and make the same change in the duplicate helper that emits the
same message (replace its log.Info with log.Debug) while preserving the
structured fields (zap.Uint32("keyspace-id", keyspaceID), zap.Error(err)); keep
behavior otherwise consistent with the existing log.Debug used for the related
"skip fallback" message.
|
@bufferflies: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| // Before fallback to default group, check if the keyspace has configured a group in its metadata. | ||
| // If it has, don't fallback to avoid incorrect switching when state is inconsistent. | ||
| if kgm != nil && kgm.checkKeyspaceHasConfiguredGroup(keyspaceID) { | ||
| log.Debug("[tso] keyspace has configured group but not found in lookup table, skip fallback to default group", | ||
| zap.Uint32("keyspace-id", keyspaceID), | ||
| zap.Uint32("requested-group-id", keyspaceGroupID)) | ||
| return nil, nil, keyspaceGroupID, s.modRevision, errs.ErrKeyspaceNotAssigned.FastGenByArgs(keyspaceID) | ||
| } |
There was a problem hiding this comment.
The bug here should be: a keyspace already assigned to a non-default group is wrongly falling back to group 0 when the lookup state is stale.
But this check blocks fallback for any keyspace that has tso_keyspace_group_id, including "0". That also changes the behavior of default-group keyspaces.
Should we only block fallback when the configured group ID is non-zero, and add a test for "0"?
Summary
0when a keyspace already has a configured TSO group but lookup state is stalemake checkgreen on currentmasterby including the requiredgofmtcleanup inserver/cluster/cluster.goIssue Number: ref #10516, close #10554, close #10601
Author
@ystaticy
CP
7786830e271ac93bSummary by CodeRabbit
Bug Fixes
New Features