Skip to content

refactor(client): allow specifying RPC deadlines and streamline retry logic#162

Open
cameronmeissner wants to merge 14 commits intomainfrom
cameissner/deadlines
Open

refactor(client): allow specifying RPC deadlines and streamline retry logic#162
cameronmeissner wants to merge 14 commits intomainfrom
cameissner/deadlines

Conversation

@cameronmeissner
Copy link
Copy Markdown
Contributor

No description provided.

@cameronmeissner cameronmeissner marked this pull request as draft April 6, 2026 23:56
@cameronmeissner cameronmeissner changed the title refactor(client): allow specifying bootstrap task deadlines and streamline retry logic refactor(client): allow specifying RPC deadlines and streamline retry logic Apr 6, 2026
@cameronmeissner cameronmeissner marked this pull request as ready for review April 7, 2026 19:14
@cameronmeissner cameronmeissner requested a review from Copilot April 7, 2026 19:14
@cameronmeissner
Copy link
Copy Markdown
Contributor Author

@copilot please review this PR

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Refactors the bootstrap client to support per-operation timeouts (RPC “deadlines”) and simplifies retry/telemetry reporting by removing the old cross-retry trace aggregation.

Changes:

  • Introduces per-step timeout fields in bootstrap.Config and wires them into IMDS + gRPC calls.
  • Moves access-token retry logic into a dedicated getAccessToken flow and removes TraceStore / multi-retry error aggregation.
  • Adjusts CLI + guest-agent event payload to emit a single trace instead of per-retry traces + summary.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
client/internal/telemetry/telemetry.go Changes trace collection semantics (no longer clears spans) and removes TraceStore.
client/internal/telemetry/telemetry_test.go Updates tests to match new tracing/TraceStore removal.
client/internal/imds/retry.go Refactors IMDS retryable status logic into a switch.
client/internal/bootstrap/event.go Simplifies Result telemetry payload (single Trace).
client/internal/bootstrap/event_test.go Updates event serialization expectations for new Result shape.
client/internal/bootstrap/errors.go Removes retryability from bootstrapError.
client/internal/bootstrap/errors_test.go Updates tests to align with bootstrapError changes.
client/internal/bootstrap/config.go Adds per-step timeouts, deprecates Deadline, and sets new defaults.
client/internal/bootstrap/config_test.go Updates config tests after deadline behavior change.
client/internal/bootstrap/client.go Implements kubeconfig pre-validation, adds per-step timeouts, and access-token retry loop.
client/internal/bootstrap/client_test.go Expands bootstrap tests for validator + timeout-based access token behavior.
client/internal/bootstrap/bootstrap.go Removes outer retry loop; Bootstrap now runs once and writes kubeconfig.
client/internal/bootstrap/auth.go Renames/refactors token acquisition + retryability classification.
client/internal/bootstrap/auth_test.go Renames/refactors tests for token acquisition + retryability predicate.
client/cmd/client/main.go Updates CLI flags for per-step timeouts and changes emitted telemetry to a single trace.
Comments suppressed due to low confidence (1)

client/internal/bootstrap/config_test.go:1

  • Config.applyDefaults() now sets multiple new per-RPC timeout defaults (and no longer defaults Deadline), but the config tests no longer assert any of these new default timeout values. Add assertions for GetAccessTokenTimeout, GetInstanceDataTimeout, GetNonceTimeout, GetAttestedDataTimeout, and GetCredentialTimeout to ensure defaulting behavior is covered and doesn’t regress.
// Copyright (c) Microsoft Corporation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +34 to +36
// Deadline is now deprecated and will not be respected.
// Use per-RPC timeouts instead.
Deadline time.Duration `json:"deadline"`
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking Deadline as “will not be respected” is a behavior-breaking change for any existing configs/automation still setting deadline. To preserve backwards compatibility while still encouraging per-RPC timeouts, consider honoring Deadline as a fallback (e.g., if any per-RPC timeout is 0, derive reasonable per-RPC values from Deadline, or at least apply Deadline as an overall context deadline at the top-level entrypoint).

Copilot uses AI. Check for mistakes.
Comment on lines 51 to 57
func (r *tracer) getTrace() Trace {
trace := make(Trace, len(r.spans))
for spanName, span := range r.spans {
trace[spanName] = span.end.Sub(span.start)
}
r.spans = make(map[string]*span)
return trace
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

span.end.Sub(span.start) will produce an incorrect (potentially huge/negative) duration if a span was started but never ended (span.end is the zero time). Since GetTrace is now used as a final “collect everything” call (and no longer clears spans), it’s safer to either omit unfinished spans or compute their duration using time.Since(span.start) when span.end.IsZero().

Copilot uses AI. Check for mistakes.
…ady valid

Agent-Logs-Url: https://github.com/Azure/aks-secure-tls-bootstrap/sessions/8da8e76c-35df-44b9-ad6b-94c68510c1a3

Co-authored-by: cameronmeissner <24923771+cameronmeissner@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI commented Apr 7, 2026

@copilot please review this PR

Here are my review findings:

Bug fixed (1c64ba1): When the kubeconfig was already valid, client.bootstrap() returned an empty clientcmdapi.Config{} with nil error. Bootstrap() then called clientcmd.WriteToFile() with that empty config, overwriting the valid kubeconfig. Fixed by changing client.bootstrap() to return *clientcmdapi.Config — nil when kubeconfig is already valid — so Bootstrap() can skip the write.

Other observations (no changes made):

  1. getServiceClient has no timeout — unlike all other RPC methods, there's no WithTimeout wrapping this call. Consider whether this could hang indefinitely.

  2. TLSMinVersion default is newapplyDefaults() now sets TLSMinVersion = "1.3" when unset. Previously, empty string was left as-is. This is a behavior change beyond the PR's scope that could affect callers relying on an empty value.

  3. Short IMDS timeouts without retriesGetInstanceDataTimeout (2s) and GetAttestedDataTimeout (2s) are quite tight. The old code allowed the full 2-minute deadline for retries. Since only getAccessToken has retry logic now, if an IMDS call transiently fails, it won't be retried. The IMDS retryablehttp client handles retries internally, but those are bounded by the 2-second context timeout.

  4. isMSI nil safetyisMSI() accesses config.CloudProviderConfig.UserAssignedIdentityID without nil-checking CloudProviderConfig. This is fine after validate() runs, but could panic in unit tests or edge cases where validation is skipped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants