refactor(client): allow specifying RPC deadlines and streamline retry logic#162
refactor(client): allow specifying RPC deadlines and streamline retry logic#162cameronmeissner wants to merge 14 commits intomainfrom
Conversation
|
@copilot please review this PR |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Refactors the bootstrap client to support per-operation timeouts (RPC “deadlines”) and simplifies retry/telemetry reporting by removing the old cross-retry trace aggregation.
Changes:
- Introduces per-step timeout fields in
bootstrap.Configand wires them into IMDS + gRPC calls. - Moves access-token retry logic into a dedicated
getAccessTokenflow and removesTraceStore/ multi-retry error aggregation. - Adjusts CLI + guest-agent event payload to emit a single trace instead of per-retry traces + summary.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| client/internal/telemetry/telemetry.go | Changes trace collection semantics (no longer clears spans) and removes TraceStore. |
| client/internal/telemetry/telemetry_test.go | Updates tests to match new tracing/TraceStore removal. |
| client/internal/imds/retry.go | Refactors IMDS retryable status logic into a switch. |
| client/internal/bootstrap/event.go | Simplifies Result telemetry payload (single Trace). |
| client/internal/bootstrap/event_test.go | Updates event serialization expectations for new Result shape. |
| client/internal/bootstrap/errors.go | Removes retryability from bootstrapError. |
| client/internal/bootstrap/errors_test.go | Updates tests to align with bootstrapError changes. |
| client/internal/bootstrap/config.go | Adds per-step timeouts, deprecates Deadline, and sets new defaults. |
| client/internal/bootstrap/config_test.go | Updates config tests after deadline behavior change. |
| client/internal/bootstrap/client.go | Implements kubeconfig pre-validation, adds per-step timeouts, and access-token retry loop. |
| client/internal/bootstrap/client_test.go | Expands bootstrap tests for validator + timeout-based access token behavior. |
| client/internal/bootstrap/bootstrap.go | Removes outer retry loop; Bootstrap now runs once and writes kubeconfig. |
| client/internal/bootstrap/auth.go | Renames/refactors token acquisition + retryability classification. |
| client/internal/bootstrap/auth_test.go | Renames/refactors tests for token acquisition + retryability predicate. |
| client/cmd/client/main.go | Updates CLI flags for per-step timeouts and changes emitted telemetry to a single trace. |
Comments suppressed due to low confidence (1)
client/internal/bootstrap/config_test.go:1
Config.applyDefaults()now sets multiple new per-RPC timeout defaults (and no longer defaultsDeadline), but the config tests no longer assert any of these new default timeout values. Add assertions forGetAccessTokenTimeout,GetInstanceDataTimeout,GetNonceTimeout,GetAttestedDataTimeout, andGetCredentialTimeoutto ensure defaulting behavior is covered and doesn’t regress.
// Copyright (c) Microsoft Corporation.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Deadline is now deprecated and will not be respected. | ||
| // Use per-RPC timeouts instead. | ||
| Deadline time.Duration `json:"deadline"` |
There was a problem hiding this comment.
Marking Deadline as “will not be respected” is a behavior-breaking change for any existing configs/automation still setting deadline. To preserve backwards compatibility while still encouraging per-RPC timeouts, consider honoring Deadline as a fallback (e.g., if any per-RPC timeout is 0, derive reasonable per-RPC values from Deadline, or at least apply Deadline as an overall context deadline at the top-level entrypoint).
| func (r *tracer) getTrace() Trace { | ||
| trace := make(Trace, len(r.spans)) | ||
| for spanName, span := range r.spans { | ||
| trace[spanName] = span.end.Sub(span.start) | ||
| } | ||
| r.spans = make(map[string]*span) | ||
| return trace | ||
| } |
There was a problem hiding this comment.
span.end.Sub(span.start) will produce an incorrect (potentially huge/negative) duration if a span was started but never ended (span.end is the zero time). Since GetTrace is now used as a final “collect everything” call (and no longer clears spans), it’s safer to either omit unfinished spans or compute their duration using time.Since(span.start) when span.end.IsZero().
…ady valid Agent-Logs-Url: https://github.com/Azure/aks-secure-tls-bootstrap/sessions/8da8e76c-35df-44b9-ad6b-94c68510c1a3 Co-authored-by: cameronmeissner <24923771+cameronmeissner@users.noreply.github.com>
Here are my review findings: Bug fixed (1c64ba1): When the kubeconfig was already valid, Other observations (no changes made):
|
No description provided.