Scion/fix auth reset self heal#337
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a file-polling fallback mechanism to watch and reload the token file for out-of-band updates, allowing the agent to self-heal from unauthorized errors (401/403) even if the SIGUSR2 signal fails. However, the reviewer notes that returning a 500 Internal Server Error when the SIGUSR2 signal fails in the resetAuth handler is a regression, as the token is still successfully written and will be picked up by the new poller; they suggest returning a 200 OK with a warning instead, along with updating the corresponding tests. Additionally, the reviewer recommends caching the resolved home directory to avoid redundant system user lookups during the periodic polling.
| if _, err := rt.Exec(ctx, target, signalCmd); err != nil { | ||
| s.agentLifecycleLog.Warn("reset-auth: failed to signal PID 1 (best-effort)", "agent_id", id, "error", err) | ||
| signaled = false | ||
| s.agentLifecycleLog.Error("reset-auth: failed to signal PID 1", "agent_id", id, "error", err) | ||
| // Still nudge heartbeats so the hub re-evaluates the agent promptly once | ||
| // its poller reloads the freshly-written token. | ||
| s.forceHeartbeatAll("reset-auth", id) | ||
| RuntimeError(w, fmt.Sprintf( | ||
| "token written but failed to signal agent init (PID 1): %v; the agent's "+ | ||
| "token-file poller will reload it within a few seconds, but the immediate "+ | ||
| "reset signal failed (commonly EPERM: broker runs as scion, PID 1 as root)", | ||
| err)) | ||
| return | ||
| } | ||
|
|
||
| s.agentLifecycleLog.Info("Auth reset completed", "agent_id", id, "signaled", signaled) | ||
| s.agentLifecycleLog.Info("Auth reset completed", "agent_id", id, "signaled", true) | ||
|
|
||
| s.forceHeartbeatAll("reset-auth", id) | ||
|
|
||
| msg := "Auth reset: token written and init signaled" | ||
| if !signaled { | ||
| msg = "Auth reset: token written (signal skipped — refresh loop will pick it up)" | ||
| } | ||
| writeJSON(w, http.StatusOK, ResetAuthResponse{ | ||
| Message: msg, | ||
| Message: "Auth reset: token written and init signaled", | ||
| }) |
There was a problem hiding this comment.
Returning a 500 Internal Server Error when the SIGUSR2 signal fails is a regression for standard rootful deployments. As noted in the comments, kill -USR2 1 commonly fails with EPERM because the broker runs as the non-root scion user while sciontool init (PID 1) runs as root.
Since a robust token-file polling fallback has been implemented specifically to handle this scenario, the auth reset will still succeed within a few seconds. Failing the entire HTTP request with a 500 error will mislead callers (such as the Hub or automated scripts) into thinking the auth reset failed completely, when in fact the token was successfully written and will be reloaded.
Instead, we should log the signal failure as a warning and return a 200 OK indicating that the token was written but the immediate signal failed (and the agent will reload it via the file poller).
| if _, err := rt.Exec(ctx, target, signalCmd); err != nil { | |
| s.agentLifecycleLog.Warn("reset-auth: failed to signal PID 1 (best-effort)", "agent_id", id, "error", err) | |
| signaled = false | |
| s.agentLifecycleLog.Error("reset-auth: failed to signal PID 1", "agent_id", id, "error", err) | |
| // Still nudge heartbeats so the hub re-evaluates the agent promptly once | |
| // its poller reloads the freshly-written token. | |
| s.forceHeartbeatAll("reset-auth", id) | |
| RuntimeError(w, fmt.Sprintf( | |
| "token written but failed to signal agent init (PID 1): %v; the agent's "+ | |
| "token-file poller will reload it within a few seconds, but the immediate "+ | |
| "reset signal failed (commonly EPERM: broker runs as scion, PID 1 as root)", | |
| err)) | |
| return | |
| } | |
| s.agentLifecycleLog.Info("Auth reset completed", "agent_id", id, "signaled", signaled) | |
| s.agentLifecycleLog.Info("Auth reset completed", "agent_id", id, "signaled", true) | |
| s.forceHeartbeatAll("reset-auth", id) | |
| msg := "Auth reset: token written and init signaled" | |
| if !signaled { | |
| msg = "Auth reset: token written (signal skipped — refresh loop will pick it up)" | |
| } | |
| writeJSON(w, http.StatusOK, ResetAuthResponse{ | |
| Message: msg, | |
| Message: "Auth reset: token written and init signaled", | |
| }) | |
| signaled := true | |
| if _, err := rt.Exec(ctx, target, signalCmd); err != nil { | |
| s.agentLifecycleLog.Warn("reset-auth: failed to signal PID 1 (best-effort)", "agent_id", id, "error", err) | |
| signaled = false | |
| } | |
| s.agentLifecycleLog.Info("Auth reset completed", "agent_id", id, "signaled", signaled) | |
| s.forceHeartbeatAll("reset-auth", id) | |
| msg := "Auth reset: token written and init signaled" | |
| if !signaled { | |
| msg = "Auth reset: token written (signal failed — poller will reload it within a few seconds)" | |
| } | |
| writeJSON(w, http.StatusOK, ResetAuthResponse{ | |
| Message: msg, | |
| }) |
| // TestResetAuth_SignalFailureReturnsError verifies that when the SIGUSR2 signal | ||
| // to PID 1 fails (e.g. EPERM because the broker execs as the non-root scion user | ||
| // against the root-owned init), the handler reports an error instead of falsely | ||
| // returning 200 — while still having written the fresh token to disk. | ||
| func TestResetAuth_SignalFailureReturnsError(t *testing.T) { | ||
| mgr := resetAuthAgents() | ||
|
|
||
| var wroteToken bool | ||
| rt := &runtime.MockRuntime{ | ||
| NameFunc: func() string { return "docker" }, | ||
| ExecFunc: func(_ context.Context, _ string, cmd []string) (string, error) { | ||
| if len(cmd) > 0 && cmd[0] == "kill" { | ||
| // Simulate the cross-UID signal failure. | ||
| return "", fmt.Errorf("kill: (1) - Operation not permitted") | ||
| } | ||
| wroteToken = true | ||
| return "", nil | ||
| }, | ||
| } | ||
| srv := New(DefaultServerConfig(), mgr, rt) | ||
|
|
||
| w := doResetAuth(srv, "fresh-token") | ||
|
|
||
| if w.Code != http.StatusInternalServerError { | ||
| t.Fatalf("expected 500 when the reset signal fails, got %d (%s)", w.Code, w.Body.String()) | ||
| } | ||
| if !wroteToken { | ||
| t.Error("token should still be written to disk even when the signal fails") | ||
| } | ||
| if !strings.Contains(w.Body.String(), "failed to signal") { | ||
| t.Errorf("error body should explain the signal failure, got %q", w.Body.String()) | ||
| } | ||
| } |
There was a problem hiding this comment.
If we update the handler to return 200 OK on signal failure (since the token is successfully written and will be reloaded via the poller), we should update this test to verify that a signal failure returns 200 OK with the appropriate message instead of a 500 Internal Server Error.
// TestResetAuth_SignalFailureReturns200 verifies that when the SIGUSR2 signal
// to PID 1 fails (e.g. EPERM because the broker execs as the non-root scion user
// against the root-owned init), the handler still reports success (200 OK) since
// the token was successfully written to disk and will be picked up by the poller.
func TestResetAuth_SignalFailureReturns200(t *testing.T) {
mgr := resetAuthAgents()
var wroteToken bool
rt := &runtime.MockRuntime{
NameFunc: func() string { return "docker" },
ExecFunc: func(_ context.Context, _ string, cmd []string) (string, error) {
if len(cmd) > 0 && cmd[0] == "kill" {
// Simulate the cross-UID signal failure.
return "", fmt.Errorf("kill: (1) - Operation not permitted")
}
wroteToken = true
return "", nil
},
}
srv := New(DefaultServerConfig(), mgr, rt)
w := doResetAuth(srv, "fresh-token")
if w.Code != http.StatusOK {
t.Fatalf("expected 200 when the reset signal fails, got %d (%s)", w.Code, w.Body.String())
}
if !wroteToken {
t.Error("token should still be written to disk even when the signal fails")
}
if !strings.Contains(w.Body.String(), "signal failed") {
t.Errorf("response body should explain the signal failure, got %q", w.Body.String())
}
}| diskToken := hub.ReadTokenFile() | ||
| if diskToken == "" || diskToken == hubClient.GetToken() { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Calling hub.ReadTokenFile() every 2 seconds inside the polling loop triggers user.Lookup("scion") on every single iteration under the hood (via resolveTokenHome()).
While we cannot modify resolveTokenHome directly in this PR as it is not part of the diff hunks, it is worth noting that repeated user.Lookup calls can be highly inefficient (especially if NSS or other lookup mechanisms are configured).
As a future improvement, we should cache the resolved home directory in the hub package (e.g., using a sync.Once or package-level variable) so that subsequent calls to TokenFilePath() do not perform redundant system lookups.
Improve the reset-auth signal failure path: refine the log message and response text to clarify that the token was written and the file poller will reload it, even when SIGUSR2 fails (common EPERM in rootless containers). Add dedicated tests for signal-failure, signal-success, and missing-token scenarios.
60186ac to
f034523
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the resetAuth handler to gracefully handle signal failures (such as EPERM in rootless containers) when notifying PID 1, updating logs and response messages to indicate that the token was successfully written and will be reloaded by the poller. It also introduces comprehensive unit tests for these scenarios. The review feedback suggests avoiding shadowing the standard library's runtime package by using an import alias, and refactoring the test helper doResetAuth to include proper error handling and use t.Helper().
| "testing" | ||
|
|
||
| "github.com/GoogleCloudPlatform/scion/pkg/api" | ||
| "github.com/GoogleCloudPlatform/scion/pkg/runtime" |
There was a problem hiding this comment.
Importing the project's runtime package as runtime shadows the Go standard library's runtime package. This is inconsistent with handlers.go (which imports it as scionrt) and can lead to confusion or subtle bugs if the standard runtime package is ever needed in this test file.
Please use the scionrt alias for this import and update references to scionrt.MockRuntime accordingly.
| "github.com/GoogleCloudPlatform/scion/pkg/runtime" | |
| scionrt "github.com/GoogleCloudPlatform/scion/pkg/runtime" |
| func doResetAuth(srv *Server, token string) *httptest.ResponseRecorder { | ||
| body, _ := json.Marshal(ResetAuthRequest{Token: token}) | ||
| r := httptest.NewRequest(http.MethodPost, | ||
| "/api/v1/agents/coordinator/reset-auth?projectId=grove-A", bytes.NewReader(body)) | ||
| w := httptest.NewRecorder() | ||
| srv.handleAgentByID(w, r) | ||
| return w | ||
| } |
There was a problem hiding this comment.
In Go, test helper functions should ideally accept *testing.T and call t.Helper() so that any failures (such as JSON marshaling errors) are reported immediately and accurately at the call site. Ignoring the error from json.Marshal with _ can mask issues if the struct definition changes in the future.
Please refactor doResetAuth to accept t *testing.T and handle the error properly. Note that you will also need to update the callers of doResetAuth to pass t.
| func doResetAuth(srv *Server, token string) *httptest.ResponseRecorder { | |
| body, _ := json.Marshal(ResetAuthRequest{Token: token}) | |
| r := httptest.NewRequest(http.MethodPost, | |
| "/api/v1/agents/coordinator/reset-auth?projectId=grove-A", bytes.NewReader(body)) | |
| w := httptest.NewRecorder() | |
| srv.handleAgentByID(w, r) | |
| return w | |
| } | |
| func doResetAuth(t *testing.T, srv *Server, token string) *httptest.ResponseRecorder { | |
| t.Helper() | |
| body, err := json.Marshal(ResetAuthRequest{Token: token}) | |
| if err != nil { | |
| t.Fatalf("failed to marshal ResetAuthRequest: %v", err) | |
| } | |
| r := httptest.NewRequest(http.MethodPost, | |
| "/api/v1/agents/coordinator/reset-auth?projectId=grove-A", bytes.NewReader(body)) | |
| w := httptest.NewRecorder() | |
| srv.handleAgentByID(w, r) | |
| return w | |
| } |
Cache the resolved token home directory with sync.Once in
resolveTokenHome — user.Lookup("scion") is expensive and the result
never changes at runtime. This avoids repeated lookups on every
ReadTokenFile/TokenFilePath call (e.g. metadata server TokenFunc).
In handlers_reset_auth_test.go: alias the runtime import as scionrt
to avoid shadowing Go's stdlib runtime package, add *testing.T param
and t.Helper() to the doResetAuth test helper.
Fix use of old token