Skip to content

tso: validate callee id for tso requests#10600

Open
bufferflies wants to merge 12 commits intotikv:masterfrom
bufferflies:pr-merge/88e4f7d1-validate-callee-id
Open

tso: validate callee id for tso requests#10600
bufferflies wants to merge 12 commits intotikv:masterfrom
bufferflies:pr-merge/88e4f7d1-validate-callee-id

Conversation

@bufferflies
Copy link
Copy Markdown
Contributor

@bufferflies bufferflies commented Apr 14, 2026

Issue Number

ref #10516, close #10552

What problem does this PR solve?

The TSO client can keep using a stale gRPC connection after service endpoint changes. This patch carries a callee ID in TSO requests and lets the server reject requests that land on the wrong endpoint, so the client can drop the stale connection and reconnect.

What is changed and how does it work?

  • attach CalleeId to TSO request headers
  • validate callee ID in the TSO gRPC service against the advertised listen address
  • treat callee-ID mismatch as a reconnect signal in the TSO client
  • add RemoveClientConn to service discovery implementations so stale gRPC connections can be closed and recreated
  • upgrade github.com/pingcap/kvproto to latest v0.0.0-20260414083400-4388bfaaedab
  • tidy affected submodules after the kvproto upgrade
  • include the minimal gofmt/test-stub follow-up required for make check

Check List

  • Tests
  • No release note

Validation

  • go test ./pkg/mcs/tso/server -count=1
  • cd client && go test ./errs ./servicediscovery ./clients/tso -run TestDoesNotExist -count=1
  • make check

author: @iosmanthus
cp 88e4f7d1

Summary by CodeRabbit

  • New Features

    • Requests now include a callee ID and servers validate it, rejecting mismatches.
  • Bug Fixes

    • Stale gRPC connections are proactively removed on callee mismatches; leader-change handling remains distinct and check scheduling refined.
    • Connection cleanup now attempts proper close and logs failures.
  • Tests

    • Added tests for callee-mismatch handling and connection-removal behavior.
  • Chores

    • Updated kvproto dependency.

Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 14, 2026

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot Bot added dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 14, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Clients now attach a CalleeId header to TSO requests; servers validate it against their advertise listen host and return FailedPrecondition on mismatch. Clients remove stale per-URL gRPC connections via ServiceDiscovery.RemoveClientConn and retry with a new connection. ServiceDiscovery gained RemoveClientConn; callee-mismatch error detection was added.

Changes

Cohort / File(s) Summary
TSO client connection & dispatcher
client/clients/tso/client.go, client/clients/tso/dispatcher.go, client/clients/tso/dispatcher_test.go
Switch to GetOrCreateGRPCConn(url) (error-returning), simplify dispatcher ctor, and update tests to inject svcDiscovery. Handle callee-mismatch by removing stale conn via RemoveClientConn.
TSO stream / callee ID
client/clients/tso/stream.go
Add calleeID (parsed from server URL) and include it in outgoing tsopb.TsoRequest header.
Error defs & helpers
client/errs/errno.go, client/errs/errs.go
Add MismatchCalleeIDErr constant and IsCalleeMismatch(err error) bool helper.
ServiceDiscovery interface & impls
client/servicediscovery/service_discovery.go, client/servicediscovery/tso_service_discovery.go, client/servicediscovery/router_service_discovery.go, client/servicediscovery/mock_service_discovery.go
Extend interface with RemoveClientConn(url string) and implement eviction+close in concrete types; add no-op/mock method for tests.
Tests & test doubles
client/resource_manager_client_test.go, client/clients/tso/dispatcher_test.go
Add test helper removal method, countingServiceDiscovery, and test asserting RemoveClientConn is called on callee mismatch.
Server-side validation & caching
pkg/mcs/tso/server/grpc_service.go, pkg/mcs/tso/server/server.go, pkg/mcs/tso/server/config_test.go
Server caches advertise listen host, validates incoming CalleeId against it, and returns gRPC FailedPrecondition on mismatch; add tests for host extraction and caching.
Module bumps
client/go.mod, go.mod, tests/integrations/go.mod, tools/go.mod
Bump github.com/pingcap/kvproto version across modules.
Misc / formatting
server/cluster/cluster.go
Whitespace/field alignment reformat only.

Sequence Diagram(s)

sequenceDiagram
    participant Client as TSO Client
    participant SD as ServiceDiscovery
    participant Conn as gRPC Conn
    participant Server as TSO Server

    rect rgba(100,200,100,0.5)
    Note over Client,Server: Initial request with CalleeId
    Client->>Client: extract calleeID from serverURL
    Client->>SD: GetOrCreateGRPCConn(url)
    SD-->>Client: *grpc.ClientConn
    Client->>Server: TsoRequest (CalleeId header)
    Server->>Server: parse advertise host & compare CalleeId
    alt match
        Server-->>Client: Response
    else mismatch
        Server-->>Client: FailedPrecondition (mismatch)
    end
    end

    rect rgba(100,150,200,0.5)
    Note over Client,SD: Recovery on mismatch
    Client->>Client: IsCalleeMismatch(err)
    Client->>SD: RemoveClientConn(url)
    SD->>Conn: Close() and delete cache entry
    Client->>SD: GetOrCreateGRPCConn(url) [retry]
    SD-->>Client: new *grpc.ClientConn
    Client->>Server: TsoRequest (retry)
    Server-->>Client: Response
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

type/development, lgtm, approved

Suggested reviewers

  • JmPotato

Poem

🐰 I sniffed the server URL today,
CalleeId in paw, I hop away.
When mismatch thumps the old connection's door,
I nudge the cache and close it to the floor—
New hop, new call, the TSO trail restored.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'tso: validate callee id for tso requests' directly and clearly summarizes the main objective of the PR: validating callee IDs for TSO requests.
Description check ✅ Passed The PR description includes issue references, a clear problem statement, detailed explanation of what changed and how it works, and a comprehensive checklist with test validation notes.
Linked Issues check ✅ Passed The PR implements all key objectives from linked issue #10552: attaching CalleeId to TSO requests, validating callee ID on server side, treating mismatches as reconnect signals, adding RemoveClientConn to service discovery, and upgrading kvproto with comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly related to the stated objectives: TSO callee ID validation logic, service discovery enhancements, kvproto dependency updates, and associated test/formatting changes. Struct field formatting in cluster.go and test helper additions are minimal and necessary for code quality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bufferflies
Copy link
Copy Markdown
Contributor Author

/ping @iosmanthus

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
client/errs/errs.go (1)

42-48: Harden callee-mismatch detection to include gRPC status code check.

Line 47 currently relies only on substring matching against error messages. This can produce false positives if other gRPC errors coincidentally contain "mismatch callee id". Adding a status code check first narrows the scope to only codes.FailedPrecondition errors that match the message.

♻️ Proposed refactor
 import (
 	"strings"
 
 	"go.uber.org/zap"
 	"go.uber.org/zap/zapcore"
 	"google.golang.org/grpc/codes"
+	"google.golang.org/grpc/status"
 
 	"github.com/pingcap/errors"
 )
@@
 func IsCalleeMismatch(err error) bool {
 	if err == nil {
 		return false
 	}
-	return strings.Contains(err.Error(), MismatchCalleeIDErr)
+	cause := errors.Cause(err)
+	if status.Code(cause) != codes.FailedPrecondition {
+		return false
+	}
+	return strings.Contains(status.Convert(cause).Message(), MismatchCalleeIDErr)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/errs/errs.go` around lines 42 - 48, IsCalleeMismatch currently only
does substring matching on err.Error(), which can false-positive; update
IsCalleeMismatch to first extract a gRPC status via status.FromError(err),
verify the status.Code() == codes.FailedPrecondition, and only then check that
st.Message() (or err.Error()) contains the MismatchCalleeIDErr constant;
reference the IsCalleeMismatch function, MismatchCalleeIDErr constant, and use
status.FromError and codes.FailedPrecondition in the check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/errs/errs.go`:
- Around line 42-48: IsCalleeMismatch currently only does substring matching on
err.Error(), which can false-positive; update IsCalleeMismatch to first extract
a gRPC status via status.FromError(err), verify the status.Code() ==
codes.FailedPrecondition, and only then check that st.Message() (or err.Error())
contains the MismatchCalleeIDErr constant; reference the IsCalleeMismatch
function, MismatchCalleeIDErr constant, and use status.FromError and
codes.FailedPrecondition in the check.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 96598a2f-f433-46f3-8eb3-101e98530b4f

📥 Commits

Reviewing files that changed from the base of the PR and between b0a3c90 and eeaa902.

⛔ Files ignored due to path filters (4)
  • client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • tests/integrations/go.sum is excluded by !**/*.sum
  • tools/go.sum is excluded by !**/*.sum
📒 Files selected for processing (16)
  • client/clients/tso/client.go
  • client/clients/tso/dispatcher.go
  • client/clients/tso/stream.go
  • client/errs/errno.go
  • client/errs/errs.go
  • client/go.mod
  • client/resource_manager_client_test.go
  • client/servicediscovery/mock_service_discovery.go
  • client/servicediscovery/router_service_discovery.go
  • client/servicediscovery/service_discovery.go
  • client/servicediscovery/tso_service_discovery.go
  • go.mod
  • pkg/mcs/tso/server/grpc_service.go
  • server/cluster/cluster.go
  • tests/integrations/go.mod
  • tools/go.mod

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 14, 2026

@iosmanthus: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 14, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: iosmanthus
Once this PR has been reviewed and has the lgtm label, please assign rleungx for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment thread client/clients/tso/dispatcher.go
Signed-off-by: bufferflies <1045931706@qq.com>
Comment thread pkg/mcs/tso/server/grpc_service.go Outdated
@bufferflies bufferflies requested a review from lhy1024 April 15, 2026 07:15
Signed-off-by: tongjian <1045931706@qq.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 53.22581% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.92%. Comparing base (b21a183) to head (418c916).
⚠️ Report is 1 commits behind head on master.

❌ Your patch check has failed because the patch coverage (53.22%) is below the target coverage (74.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10600      +/-   ##
==========================================
- Coverage   78.96%   78.92%   -0.05%     
==========================================
  Files         532      532              
  Lines       71883    72018     +135     
==========================================
+ Hits        56766    56838      +72     
- Misses      11093    11137      +44     
- Partials     4024     4043      +19     
Flag Coverage Δ
unittests 78.92% <53.22%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@lhy1024 lhy1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rest LGTM

}
cc, ok := c.svcDiscovery.GetClientConns().Load(url)
if !ok {
cc, err := c.svcDiscovery.GetOrCreateGRPCConn(url)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to cover discovery or unary? They also reuse the same gRPC connection, such as FindGroupByKeyspaceID / GetMinTS.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FindGroupByKeyspaceID / GetMinTS -> all the tso services can handle this, it doesn't matter if the node mismatches.

}

// RemoveClientConn removes and closes the gRPC connection of the given URL.
func (c *tsoServiceDiscovery) RemoveClientConn(url string) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to clear the cached *grpc.ClientConn in serviceClient?

Signed-off-by: tongjian <1045931706@qq.com>
@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 16, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pkg/mcs/tso/server/grpc_service.go (1)

104-112: Callee ID validation logic is correct.

The validation correctly:

  • Skips when either calleeID or host is empty (graceful degradation)
  • Returns FailedPrecondition which triggers client-side reconnection via IsCalleeMismatch()
  • Contains "mismatch callee id" in error message matching client detection logic

One minor observation: The error message uses s.GetAdvertiseListenAddr() (full URL like http://127.0.0.1:2379) but calleeID is just host:port (127.0.0.1:2379). Consider using host instead for consistency:

♻️ Optional: Use consistent format in error message
 		if calleeID := header.GetCalleeId(); calleeID != "" && host != "" {
 			if calleeID != host {
 				return status.Errorf(
 					codes.FailedPrecondition, "mismatch callee id, need %s but got %s",
-					s.GetAdvertiseListenAddr(), calleeID,
+					host, calleeID,
 				)
 			}
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/tso/server/grpc_service.go` around lines 104 - 112, The error message
uses the full advertise address while the callee ID is a host:port string, so
update the returned error in the callee ID check (the block using
s.getAdvertiseListenHost(), header.GetCalleeId(), and
s.GetAdvertiseListenAddr()) to include the host variable instead of
s.GetAdvertiseListenAddr() for consistent formatting; keep the same
codes.FailedPrecondition return and message text ("mismatch callee id, need %s
but got %s") but pass host as the expected value to match calleeID's host:port
form.
pkg/mcs/tso/server/config_test.go (1)

131-140: Use subtest's t for require instance in table-driven subtests.

The re instance is created from the outer t (line 102), but it's used inside subtests. If an assertion fails, the error reporting may not correctly associate with the specific subtest.

♻️ Suggested fix
 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
+			re := require.New(t)
 			s := &Server{
 				cfg: &Config{
 					AdvertiseListenAddr: tt.advertiseAddr,
 				},
 			}
 			re.Equal(tt.expected, s.getAdvertiseListenHost())
 		})
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/tso/server/config_test.go` around lines 131 - 140, The table-driven
subtests currently use the require instance "re" created from the outer test, so
assertion failures won't be associated with the specific subtest; inside the
t.Run anonymous function create a new require tied to the subtest (e.g., req :=
require.New(t)) and use that to assert the expected value from
s.getAdvertiseListenHost() for each tt in tests, ensuring failures are reported
under the correct subtest.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/mcs/tso/server/config_test.go`:
- Around line 131-140: The table-driven subtests currently use the require
instance "re" created from the outer test, so assertion failures won't be
associated with the specific subtest; inside the t.Run anonymous function create
a new require tied to the subtest (e.g., req := require.New(t)) and use that to
assert the expected value from s.getAdvertiseListenHost() for each tt in tests,
ensuring failures are reported under the correct subtest.

In `@pkg/mcs/tso/server/grpc_service.go`:
- Around line 104-112: The error message uses the full advertise address while
the callee ID is a host:port string, so update the returned error in the callee
ID check (the block using s.getAdvertiseListenHost(), header.GetCalleeId(), and
s.GetAdvertiseListenAddr()) to include the host variable instead of
s.GetAdvertiseListenAddr() for consistent formatting; keep the same
codes.FailedPrecondition return and message text ("mismatch callee id, need %s
but got %s") but pass host as the expected value to match calleeID's host:port
form.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9056f393-2e32-42cd-a137-bf57323492de

📥 Commits

Reviewing files that changed from the base of the PR and between 418c916 and eb187b9.

📒 Files selected for processing (3)
  • pkg/mcs/tso/server/config_test.go
  • pkg/mcs/tso/server/grpc_service.go
  • pkg/mcs/tso/server/server.go

Comment thread pkg/mcs/tso/server/server.go Outdated
parsed, err := url.Parse(advertiseListenAddr)
if err != nil || parsed.Host == "" {
return ""
func (s *Server) getAdvertiseListenHost() string {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use cas here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called by many threads, so it has a concurrent risk. I think the store is also ok for the last win.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?The address won't be changed after the server is started.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: bufferflies <1045931706@qq.com>
@bufferflies bufferflies force-pushed the pr-merge/88e4f7d1-validate-callee-id branch from 206d15c to 4470418 Compare April 20, 2026 06:36
Signed-off-by: tongjian <1045931706@qq.com>
Signed-off-by: tongjian <1045931706@qq.com>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 20, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456, multiple issues should use full syntax for each issue and be separated by a comma, like: Issue Number: close #123, ref #456.

📖 For more info, you can check the "Linking issues" section in the CONTRIBUTING.md.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 20, 2026

@bufferflies: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-error-log-review 3d6930b link true /test pull-error-log-review

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

addr := cfg.GetAdvertiseListenAddr()
parsed, err := url.Parse(addr)
if err != nil {
panic(fmt.Sprintf("invalid advertise listen address: %s", addr))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about returning an error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

88e4f7d1 validate callee ID for TSO requests

4 participants