Skip to content

Fix K8s 401 after token expiry (#6918)#7093

Open
pditommaso wants to merge 1 commit intomasterfrom
fix-k8s-token-refresh-6918
Open

Fix K8s 401 after token expiry (#6918)#7093
pditommaso wants to merge 1 commit intomasterfrom
fix-k8s-token-refresh-6918

Conversation

@pditommaso
Copy link
Copy Markdown
Member

Summary

Closes #6918.

Fixes the remaining 401 Unauthorized failure modes on the K8s executor after token expiry, left open after #6742 and #6925.

Three caching layers contributed to the bug. After this PR, all three are self-healing:

Layer Before After
K8sConfig.getClient() Guava 50-min cache (#6742) unchanged
K8sExecutor.getClient() Guava cache wrapping the above (#6925) unchanged
K8sTaskHandler.client captured at construction, never refreshed replaced by getClient() delegating to executor
K8sClient request loop 401 propagated immediately retries with token re-read from disk when tokenPath is set

Why a retry alone wasn't enough

K8s service-account tokens aren't refreshed via an API call — the kubelet rotates them by overwriting the file at the mount path (/var/run/secrets/kubernetes.io/serviceaccount/token). The client just needs to re-read the file from disk on 401. This PR records the token file path on ClientConfig (tokenPath) for the three discovery paths (in-cluster, kubeconfig tokenFile, Nextflow-config tokenFile) and uses it in the retry handler.

For inline token: strings in kubeconfig there's no file to re-read; those still fail fast on 401, unchanged.

Files

  • K8sTaskHandler.groovy — drop cached client field, add getClient() delegating to executor
  • ClientConfig.groovy — add Path tokenPath; populate it from fromNextflowConfig/fromUserAndCluster when a tokenFile is given
  • ConfigDiscovery.groovy — populate tokenPath in fromCluster
  • K8sClient.groovyapply() retries on 401 when tokenPath is set; onRetry listener calls new refreshToken() to re-read the file and update config.token

Test plan

  • New ClientConfigTest cases assert tokenPath is preserved for tokenFile (Nextflow + kubeconfig) and not set for inline tokens
  • ConfigDiscoveryTest.should load from cluster env extended to assert config.tokenPath == TOKEN_FILE
  • New K8sClientTest.should re-read token from disk and retry on 401 when tokenPath is set — verifies file is re-read and second request uses the fresh token
  • New K8sClientTest.should not retry 401 when tokenPath is not set — preserves existing fail-fast behavior for inline tokens
  • All 247 nf-k8s tests pass (./gradlew :plugins:nf-k8s:test)

🤖 Generated with Claude Code

Three caching layers caused the K8s client to hold an expired
service-account token after the kubelet rotated it on disk, leading
to 401 Unauthorized after ~60 minutes on clusters with short-lived
projected tokens (e.g. AKS default 3607s).

This change closes the remaining failure modes left open after the
fixes in #6742 and #6925:

- K8sTaskHandler no longer caches the K8sClient at construction;
  every access now delegates to the executor's Guava-cached client,
  so handlers pick up refreshes automatically.
- ClientConfig retains the token file path (in-cluster discovery
  and kubeconfig/Nextflow-config tokenFile entries) so the token
  can be re-read from disk later.
- K8sClient.apply() retries on HTTP 401 when tokenPath is set:
  the onRetry hook re-reads the token file (kubelet writes the
  rotated token to the same mount path) and updates the in-memory
  config before retrying. 401s with no tokenPath still propagate
  immediately as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 30, 2026

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit bb396fd
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69f39c5e9c0512000894d48c

@pditommaso pditommaso requested a review from adamrtalbot April 30, 2026 18:17
Copy link
Copy Markdown
Collaborator

@adamrtalbot adamrtalbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks good to me:

  • if 401,
  • refresh token
  • to maxAttempts

Will need some real world verification.

@pditommaso
Copy link
Copy Markdown
Member Author

Have you a way to replicate and test a dev build?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

K8s executor caches client token at wrong layer — PR #6742 token refresh never triggered

2 participants