Skip to content

Speed up guest-agent exec readiness retries#242

Merged
sjmiller609 merged 5 commits into
mainfrom
hypeship/instrument-guest-network-init
May 27, 2026
Merged

Speed up guest-agent exec readiness retries#242
sjmiller609 merged 5 commits into
mainfrom
hypeship/instrument-guest-network-init

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented May 26, 2026

Summary

  • add guest exec tracing for WaitForAgent readiness paths so restore/fork hot paths expose total wait time, retry counts, retry interval config, and first/last retryable error type
  • keep normal no-wait execs out of this new tracing path because API exec already has its own session span
  • keep one retry behavior; no fast/slow mode
  • replace the old fixed 500ms wait-for-agent retry sleep with 25ms retries for the first 2s, then 250ms retries until the existing timeout
  • drop the pooled grpc connection after retryable guest-agent connection failures so retries create a fresh vsock/grpc path instead of reusing a connection in grpc backoff
  • aggregate retry data on the parent guest.exec span instead of emitting a child span per retry attempt
  • keep the shared test exec-readiness helper on explicit test-level polling instead of using WaitForAgent internally
  • update the VZ standby test to wait for instance state Running before calling standby, instead of relying on exec readiness as a lifecycle-state proxy
  • record only the command basename in trace attributes to avoid capturing shell arguments or env content
  • add coverage for retry interval selection, command-name sanitization, and fresh-connection retry behavior

Testing

  • go test ./lib/guest ./lib/hypervisor
  • git diff --check

Note

Medium Risk
Touches core guest vsock/gRPC exec and connection pooling on boot/restore hot paths; behavior change is intentional but affects all WaitForAgent callers.

Overview
Guest exec readiness is faster and easier to observe when WaitForAgent is set (restore/fork/API paths that wait for the agent).

Retry backoff replaces a fixed 500ms sleep with 25ms for the first 2s, then 250ms until the existing deadline. On retryable vsock/gRPC-unavailable errors, the pooled connection is removed and closed so the next attempt dials fresh instead of sitting in gRPC backoff.

OpenTelemetry adds a guest.exec span only for WaitForAgent > 0 (single-attempt exec stays untraced to avoid duplicating API exec.session spans). The span records wait time, attempt counts, retry intervals, first/last retryable error types, and command basename only (no argv/env).

Tests: new unit tests for retry timing, sanitized command names, fresh-connection retries, and CloseConn behavior; integration helpers stop nesting WaitForAgent inside waitForExecAgent, and the VZ standby test waits for Running before standby.

Reviewed by Cursor Bugbot for commit d53eb85. Bugbot is set up for automated code reviews on this repo. Configure here.

@sjmiller609 sjmiller609 force-pushed the hypeship/instrument-guest-network-init branch 2 times, most recently from cc17b29 to ce11fd6 Compare May 26, 2026 20:55
@sjmiller609 sjmiller609 changed the title Instrument guest exec retries Speed up guest-agent exec readiness retries May 26, 2026
@sjmiller609 sjmiller609 force-pushed the hypeship/instrument-guest-network-init branch from ce11fd6 to 9fb95cc Compare May 26, 2026 21:04
@sjmiller609 sjmiller609 changed the title Speed up guest-agent exec readiness retries Speed up restore network exec readiness retries May 26, 2026
@sjmiller609 sjmiller609 force-pushed the hypeship/instrument-guest-network-init branch from 9fb95cc to 8e1bc54 Compare May 26, 2026 21:21
@sjmiller609 sjmiller609 changed the title Speed up restore network exec readiness retries Speed up guest-agent exec readiness retries May 26, 2026
@sjmiller609 sjmiller609 force-pushed the hypeship/instrument-guest-network-init branch 2 times, most recently from 2fa30ed to fa97c17 Compare May 26, 2026 21:37
Comment thread lib/guest/client.go
@sjmiller609 sjmiller609 force-pushed the hypeship/instrument-guest-network-init branch from fa97c17 to 1d696bc Compare May 26, 2026 22:04
@sjmiller609 sjmiller609 marked this pull request as ready for review May 26, 2026 22:14
@sjmiller609 sjmiller609 requested a review from hiroTamada May 26, 2026 22:14
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: PR modifies guest-agent execution logic and tracing in packages/api/lib/guest and packages/api/lib/hypervisor, not the kernel API endpoints or Temporal workflows specified in the filter.

To monitor this PR anyway, reply with @firetiger monitor this.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8b5f21d. Configure here.

Comment thread lib/guest/client.go
Comment thread lib/guest/client.go
@sjmiller609 sjmiller609 merged commit aa65a64 into main May 27, 2026
11 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/instrument-guest-network-init branch May 27, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants