fix(auth): self-heal agent tokens after reset-auth (signing-key rotation recovery)#164
Closed
scion-gteam[bot] wants to merge 2 commits into
Closed
fix(auth): self-heal agent tokens after reset-auth (signing-key rotation recovery)#164scion-gteam[bot] wants to merge 2 commits into
scion-gteam[bot] wants to merge 2 commits into
Conversation
Improve the reset-auth signal failure path: refine the log message and response text to clarify that the token was written and the file poller will reload it, even when SIGUSR2 fails (common EPERM in rootless containers). Add dedicated tests for signal-failure, signal-success, and missing-token scenarios.
60186ac to
f034523
Compare
Cache the resolved token home directory with sync.Once in
resolveTokenHome — user.Lookup("scion") is expensive and the result
never changes at runtime. This avoids repeated lookups on every
ReadTokenFile/TokenFilePath call (e.g. metadata server TokenFunc).
In handlers_reset_auth_test.go: alias the runtime import as scionrt
to avoid shadowing Go's stdlib runtime package, add *testing.T param
and t.Helper() to the doResetAuth test helper.
Owner
|
Merged upstream — PR GoogleCloudPlatform#337 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the auth-token self-heal bug where, after a hub signing-key rotation invalidates an agent's JWT,
reset-authwrites a fresh valid token to~/.scion/scion-tokenbut the agent never recovers — the refresh loop keeps 401-ing withgo-jose/go-jose: error in cryptographic primitiveand repeatedreset-authruns make no difference.Root cause
Recovery depended entirely on a single
kill -USR2 1that the broker fires as the non-rootscionuser against PID 1 (sciontool init), which runs as root. A non-root process can't signal a root-owned one → thekillfails with EPERM, the SIGUSR2 handler never runs, and the in-memory token is never reloaded. The refresh loop only ever sends its in-memory token (RefreshToken), nothing else re-reads the token file, and there is no file watcher — so the valid on-disk token just sits unused. The broker compounded this by returning HTTP 200 even when the signal failed, with a comment falsely claiming the loop would pick up the token without the signal.This is the same cross-UID signal limitation the codebase already worked around for the limits subsystem (
watchLimitsTriggerFile), which the auth-reset path never got.Full investigation:
/scion-volumes/scratchpad/auth-reset-bug-investigation.md.Changes (three layers of defense)
pkg/sciontool/hub/client.go) — inStartTokenRefresh, onErrTokenRefreshUnauthorized, re-readReadTokenFile(); if it differs from the in-memory token, adopt it and retry immediately. Recovers regardless of how the disk token got there.cmd/sciontool/commands/init.go) —watchTokenFilepolls the token file and reloads (via the existinghandleAuthResetpath) when the on-disk token diverges from the in-memory token. Compares content, not mtime, so the refresh loop's own writes don't re-trigger it. Mirrors the existingwatchLimitsTriggerFilepattern.pkg/runtimebroker/handlers.go) —resetAuthnow returns a 500 (instead of a misleading 200) whenkill -USR2 1fails, with an explanatory message; removed the false comment.Tests
pkg/sciontool/hub:StartTokenRefreshself-heals from an on-disk token after a 401.cmd/sciontool/commands:watchTokenFilesignals on divergence and stays quiet when disk matches memory.pkg/runtimebroker: reset-auth returns an error on signal failure (token still written), 200 on success, validation error on missing token.All touched packages:
go vetclean,go testgreen.Notes
🤖 Generated with Claude Code