Skip to content

Scion/fix auth reset self heal#337

Merged
ptone merged 2 commits into
GoogleCloudPlatform:mainfrom
ptone:scion/fix-auth-reset-self-heal
Jun 7, 2026
Merged

Scion/fix auth reset self heal#337
ptone merged 2 commits into
GoogleCloudPlatform:mainfrom
ptone:scion/fix-auth-reset-self-heal

Conversation

@ptone
Copy link
Copy Markdown
Member

@ptone ptone commented Jun 7, 2026

Fix use of old token

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a file-polling fallback mechanism to watch and reload the token file for out-of-band updates, allowing the agent to self-heal from unauthorized errors (401/403) even if the SIGUSR2 signal fails. However, the reviewer notes that returning a 500 Internal Server Error when the SIGUSR2 signal fails in the resetAuth handler is a regression, as the token is still successfully written and will be picked up by the new poller; they suggest returning a 200 OK with a warning instead, along with updating the corresponding tests. Additionally, the reviewer recommends caching the resolved home directory to avoid redundant system user lookups during the periodic polling.

Comment on lines 1654 to 1673
if _, err := rt.Exec(ctx, target, signalCmd); err != nil {
s.agentLifecycleLog.Warn("reset-auth: failed to signal PID 1 (best-effort)", "agent_id", id, "error", err)
signaled = false
s.agentLifecycleLog.Error("reset-auth: failed to signal PID 1", "agent_id", id, "error", err)
// Still nudge heartbeats so the hub re-evaluates the agent promptly once
// its poller reloads the freshly-written token.
s.forceHeartbeatAll("reset-auth", id)
RuntimeError(w, fmt.Sprintf(
"token written but failed to signal agent init (PID 1): %v; the agent's "+
"token-file poller will reload it within a few seconds, but the immediate "+
"reset signal failed (commonly EPERM: broker runs as scion, PID 1 as root)",
err))
return
}

s.agentLifecycleLog.Info("Auth reset completed", "agent_id", id, "signaled", signaled)
s.agentLifecycleLog.Info("Auth reset completed", "agent_id", id, "signaled", true)

s.forceHeartbeatAll("reset-auth", id)

msg := "Auth reset: token written and init signaled"
if !signaled {
msg = "Auth reset: token written (signal skipped — refresh loop will pick it up)"
}
writeJSON(w, http.StatusOK, ResetAuthResponse{
Message: msg,
Message: "Auth reset: token written and init signaled",
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Returning a 500 Internal Server Error when the SIGUSR2 signal fails is a regression for standard rootful deployments. As noted in the comments, kill -USR2 1 commonly fails with EPERM because the broker runs as the non-root scion user while sciontool init (PID 1) runs as root.

Since a robust token-file polling fallback has been implemented specifically to handle this scenario, the auth reset will still succeed within a few seconds. Failing the entire HTTP request with a 500 error will mislead callers (such as the Hub or automated scripts) into thinking the auth reset failed completely, when in fact the token was successfully written and will be reloaded.

Instead, we should log the signal failure as a warning and return a 200 OK indicating that the token was written but the immediate signal failed (and the agent will reload it via the file poller).

Suggested change
if _, err := rt.Exec(ctx, target, signalCmd); err != nil {
s.agentLifecycleLog.Warn("reset-auth: failed to signal PID 1 (best-effort)", "agent_id", id, "error", err)
signaled = false
s.agentLifecycleLog.Error("reset-auth: failed to signal PID 1", "agent_id", id, "error", err)
// Still nudge heartbeats so the hub re-evaluates the agent promptly once
// its poller reloads the freshly-written token.
s.forceHeartbeatAll("reset-auth", id)
RuntimeError(w, fmt.Sprintf(
"token written but failed to signal agent init (PID 1): %v; the agent's "+
"token-file poller will reload it within a few seconds, but the immediate "+
"reset signal failed (commonly EPERM: broker runs as scion, PID 1 as root)",
err))
return
}
s.agentLifecycleLog.Info("Auth reset completed", "agent_id", id, "signaled", signaled)
s.agentLifecycleLog.Info("Auth reset completed", "agent_id", id, "signaled", true)
s.forceHeartbeatAll("reset-auth", id)
msg := "Auth reset: token written and init signaled"
if !signaled {
msg = "Auth reset: token written (signal skipped — refresh loop will pick it up)"
}
writeJSON(w, http.StatusOK, ResetAuthResponse{
Message: msg,
Message: "Auth reset: token written and init signaled",
})
signaled := true
if _, err := rt.Exec(ctx, target, signalCmd); err != nil {
s.agentLifecycleLog.Warn("reset-auth: failed to signal PID 1 (best-effort)", "agent_id", id, "error", err)
signaled = false
}
s.agentLifecycleLog.Info("Auth reset completed", "agent_id", id, "signaled", signaled)
s.forceHeartbeatAll("reset-auth", id)
msg := "Auth reset: token written and init signaled"
if !signaled {
msg = "Auth reset: token written (signal failed — poller will reload it within a few seconds)"
}
writeJSON(w, http.StatusOK, ResetAuthResponse{
Message: msg,
})

Comment on lines +54 to +86
// TestResetAuth_SignalFailureReturnsError verifies that when the SIGUSR2 signal
// to PID 1 fails (e.g. EPERM because the broker execs as the non-root scion user
// against the root-owned init), the handler reports an error instead of falsely
// returning 200 — while still having written the fresh token to disk.
func TestResetAuth_SignalFailureReturnsError(t *testing.T) {
mgr := resetAuthAgents()

var wroteToken bool
rt := &runtime.MockRuntime{
NameFunc: func() string { return "docker" },
ExecFunc: func(_ context.Context, _ string, cmd []string) (string, error) {
if len(cmd) > 0 && cmd[0] == "kill" {
// Simulate the cross-UID signal failure.
return "", fmt.Errorf("kill: (1) - Operation not permitted")
}
wroteToken = true
return "", nil
},
}
srv := New(DefaultServerConfig(), mgr, rt)

w := doResetAuth(srv, "fresh-token")

if w.Code != http.StatusInternalServerError {
t.Fatalf("expected 500 when the reset signal fails, got %d (%s)", w.Code, w.Body.String())
}
if !wroteToken {
t.Error("token should still be written to disk even when the signal fails")
}
if !strings.Contains(w.Body.String(), "failed to signal") {
t.Errorf("error body should explain the signal failure, got %q", w.Body.String())
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If we update the handler to return 200 OK on signal failure (since the token is successfully written and will be reloaded via the poller), we should update this test to verify that a signal failure returns 200 OK with the appropriate message instead of a 500 Internal Server Error.

// TestResetAuth_SignalFailureReturns200 verifies that when the SIGUSR2 signal
// to PID 1 fails (e.g. EPERM because the broker execs as the non-root scion user
// against the root-owned init), the handler still reports success (200 OK) since
// the token was successfully written to disk and will be picked up by the poller.
func TestResetAuth_SignalFailureReturns200(t *testing.T) {
	mgr := resetAuthAgents()

	var wroteToken bool
	rt := &runtime.MockRuntime{
		NameFunc: func() string { return "docker" },
		ExecFunc: func(_ context.Context, _ string, cmd []string) (string, error) {
			if len(cmd) > 0 && cmd[0] == "kill" {
				// Simulate the cross-UID signal failure.
				return "", fmt.Errorf("kill: (1) - Operation not permitted")
			}
			wroteToken = true
			return "", nil
		},
	}
	srv := New(DefaultServerConfig(), mgr, rt)

	w := doResetAuth(srv, "fresh-token")

	if w.Code != http.StatusOK {
		t.Fatalf("expected 200 when the reset signal fails, got %d (%s)", w.Code, w.Body.String())
	}
	if !wroteToken {
		t.Error("token should still be written to disk even when the signal fails")
	}
	if !strings.Contains(w.Body.String(), "signal failed") {
		t.Errorf("response body should explain the signal failure, got %q", w.Body.String())
	}
}

Comment thread cmd/sciontool/commands/init.go Outdated
Comment on lines +1073 to +1076
diskToken := hub.ReadTokenFile()
if diskToken == "" || diskToken == hubClient.GetToken() {
continue
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling hub.ReadTokenFile() every 2 seconds inside the polling loop triggers user.Lookup("scion") on every single iteration under the hood (via resolveTokenHome()).

While we cannot modify resolveTokenHome directly in this PR as it is not part of the diff hunks, it is worth noting that repeated user.Lookup calls can be highly inefficient (especially if NSS or other lookup mechanisms are configured).

As a future improvement, we should cache the resolved home directory in the hub package (e.g., using a sync.Once or package-level variable) so that subsequent calls to TokenFilePath() do not perform redundant system lookups.

Improve the reset-auth signal failure path: refine the log message and
response text to clarify that the token was written and the file poller
will reload it, even when SIGUSR2 fails (common EPERM in rootless
containers). Add dedicated tests for signal-failure, signal-success,
and missing-token scenarios.
@scion-gteam scion-gteam Bot force-pushed the scion/fix-auth-reset-self-heal branch from 60186ac to f034523 Compare June 7, 2026 03:20
@ptone
Copy link
Copy Markdown
Member Author

ptone commented Jun 7, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the resetAuth handler to gracefully handle signal failures (such as EPERM in rootless containers) when notifying PID 1, updating logs and response messages to indicate that the token was successfully written and will be reloaded by the poller. It also introduces comprehensive unit tests for these scenarios. The review feedback suggests avoiding shadowing the standard library's runtime package by using an import alias, and refactoring the test helper doResetAuth to include proper error handling and use t.Helper().

"testing"

"github.com/GoogleCloudPlatform/scion/pkg/api"
"github.com/GoogleCloudPlatform/scion/pkg/runtime"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Importing the project's runtime package as runtime shadows the Go standard library's runtime package. This is inconsistent with handlers.go (which imports it as scionrt) and can lead to confusion or subtle bugs if the standard runtime package is ever needed in this test file.

Please use the scionrt alias for this import and update references to scionrt.MockRuntime accordingly.

Suggested change
"github.com/GoogleCloudPlatform/scion/pkg/runtime"
scionrt "github.com/GoogleCloudPlatform/scion/pkg/runtime"

Comment on lines +45 to +52
func doResetAuth(srv *Server, token string) *httptest.ResponseRecorder {
body, _ := json.Marshal(ResetAuthRequest{Token: token})
r := httptest.NewRequest(http.MethodPost,
"/api/v1/agents/coordinator/reset-auth?projectId=grove-A", bytes.NewReader(body))
w := httptest.NewRecorder()
srv.handleAgentByID(w, r)
return w
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In Go, test helper functions should ideally accept *testing.T and call t.Helper() so that any failures (such as JSON marshaling errors) are reported immediately and accurately at the call site. Ignoring the error from json.Marshal with _ can mask issues if the struct definition changes in the future.

Please refactor doResetAuth to accept t *testing.T and handle the error properly. Note that you will also need to update the callers of doResetAuth to pass t.

Suggested change
func doResetAuth(srv *Server, token string) *httptest.ResponseRecorder {
body, _ := json.Marshal(ResetAuthRequest{Token: token})
r := httptest.NewRequest(http.MethodPost,
"/api/v1/agents/coordinator/reset-auth?projectId=grove-A", bytes.NewReader(body))
w := httptest.NewRecorder()
srv.handleAgentByID(w, r)
return w
}
func doResetAuth(t *testing.T, srv *Server, token string) *httptest.ResponseRecorder {
t.Helper()
body, err := json.Marshal(ResetAuthRequest{Token: token})
if err != nil {
t.Fatalf("failed to marshal ResetAuthRequest: %v", err)
}
r := httptest.NewRequest(http.MethodPost,
"/api/v1/agents/coordinator/reset-auth?projectId=grove-A", bytes.NewReader(body))
w := httptest.NewRecorder()
srv.handleAgentByID(w, r)
return w
}

Cache the resolved token home directory with sync.Once in
resolveTokenHome — user.Lookup("scion") is expensive and the result
never changes at runtime. This avoids repeated lookups on every
ReadTokenFile/TokenFilePath call (e.g. metadata server TokenFunc).

In handlers_reset_auth_test.go: alias the runtime import as scionrt
to avoid shadowing Go's stdlib runtime package, add *testing.T param
and t.Helper() to the doResetAuth test helper.
@ptone ptone merged commit ce31734 into GoogleCloudPlatform:main Jun 7, 2026
5 checks passed
@scion-gteam scion-gteam Bot deleted the scion/fix-auth-reset-self-heal branch June 7, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant