Skip to content

fix(auth): self-heal agent tokens after reset-auth (signing-key rotation recovery)#164

Closed
scion-gteam[bot] wants to merge 2 commits into
mainfrom
scion/fix-auth-reset-self-heal
Closed

fix(auth): self-heal agent tokens after reset-auth (signing-key rotation recovery)#164
scion-gteam[bot] wants to merge 2 commits into
mainfrom
scion/fix-auth-reset-self-heal

Conversation

@scion-gteam
Copy link
Copy Markdown

@scion-gteam scion-gteam Bot commented Jun 7, 2026

Summary

Fixes the auth-token self-heal bug where, after a hub signing-key rotation invalidates an agent's JWT, reset-auth writes a fresh valid token to ~/.scion/scion-token but the agent never recovers — the refresh loop keeps 401-ing with go-jose/go-jose: error in cryptographic primitive and repeated reset-auth runs make no difference.

Root cause

Recovery depended entirely on a single kill -USR2 1 that the broker fires as the non-root scion user against PID 1 (sciontool init), which runs as root. A non-root process can't signal a root-owned one → the kill fails with EPERM, the SIGUSR2 handler never runs, and the in-memory token is never reloaded. The refresh loop only ever sends its in-memory token (RefreshToken), nothing else re-reads the token file, and there is no file watcher — so the valid on-disk token just sits unused. The broker compounded this by returning HTTP 200 even when the signal failed, with a comment falsely claiming the loop would pick up the token without the signal.

This is the same cross-UID signal limitation the codebase already worked around for the limits subsystem (watchLimitsTriggerFile), which the auth-reset path never got.

Full investigation: /scion-volumes/scratchpad/auth-reset-bug-investigation.md.

Changes (three layers of defense)

  1. Signal-independent self-heal (pkg/sciontool/hub/client.go) — in StartTokenRefresh, on ErrTokenRefreshUnauthorized, re-read ReadTokenFile(); if it differs from the in-memory token, adopt it and retry immediately. Recovers regardless of how the disk token got there.
  2. UID-safe token-file poller (cmd/sciontool/commands/init.go) — watchTokenFile polls the token file and reloads (via the existing handleAuthReset path) when the on-disk token diverges from the in-memory token. Compares content, not mtime, so the refresh loop's own writes don't re-trigger it. Mirrors the existing watchLimitsTriggerFile pattern.
  3. Honest broker contract (pkg/runtimebroker/handlers.go) — resetAuth now returns a 500 (instead of a misleading 200) when kill -USR2 1 fails, with an explanatory message; removed the false comment.

Tests

  • pkg/sciontool/hub: StartTokenRefresh self-heals from an on-disk token after a 401.
  • cmd/sciontool/commands: watchTokenFile signals on divergence and stays quiet when disk matches memory.
  • pkg/runtimebroker: reset-auth returns an error on signal failure (token still written), 200 on success, validation error on missing token.

All touched packages: go vet clean, go test green.

Notes

  • Do not merge — opened for review per coordinator request.

🤖 Generated with Claude Code

Improve the reset-auth signal failure path: refine the log message and
response text to clarify that the token was written and the file poller
will reload it, even when SIGUSR2 fails (common EPERM in rootless
containers). Add dedicated tests for signal-failure, signal-success,
and missing-token scenarios.
@scion-gteam scion-gteam Bot force-pushed the scion/fix-auth-reset-self-heal branch from 60186ac to f034523 Compare June 7, 2026 03:20
Cache the resolved token home directory with sync.Once in
resolveTokenHome — user.Lookup("scion") is expensive and the result
never changes at runtime. This avoids repeated lookups on every
ReadTokenFile/TokenFilePath call (e.g. metadata server TokenFunc).

In handlers_reset_auth_test.go: alias the runtime import as scionrt
to avoid shadowing Go's stdlib runtime package, add *testing.T param
and t.Helper() to the doResetAuth test helper.
@ptone ptone closed this Jun 7, 2026
@ptone
Copy link
Copy Markdown
Owner

ptone commented Jun 7, 2026

Merged upstream — PR GoogleCloudPlatform#337

@scion-gteam scion-gteam Bot deleted the scion/fix-auth-reset-self-heal branch June 7, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant