Skip to content

security(github-copilot): per-pod KDF for stored Copilot token#300

Open
pjdoland wants to merge 1 commit into
plmbr:mainfrom
pjdoland:feat/per-user-token-kdf
Open

security(github-copilot): per-pod KDF for stored Copilot token#300
pjdoland wants to merge 1 commit into
plmbr:mainfrom
pjdoland:feat/per-user-token-kdf

Conversation

@pjdoland
Copy link
Copy Markdown
Collaborator

@pjdoland pjdoland commented May 18, 2026

Summary

The encrypted Copilot token at ~/.jupyter/nbi/user-data.json was keyed via PBKDF2 from NBI_GH_ACCESS_TOKEN_PASSWORD whose default value is the public literal nbi-access-token-password. Every install with the default ships the same key. An attacker who could read another tenant's user-data.json (shared NFS home, accidental backup leak, errant git commit) could decrypt the OAuth access token directly: the ciphertext is decryption-equivalent to plaintext for anyone holding the known password.

Solution

Mix a per-pod + per-user secret into the KDF password input before PBKDF2. The on-disk blob format is unchanged (16 random salt bytes + Fernet ciphertext); the change is purely in what string feeds PBKDF2HMAC.derive().

Resolution order in _machine_user_secret() (first non-empty wins, the rest still contribute to the concatenation):

  1. NBI_POD_IDENTITY env var (highest priority). On a multi-tenant Kubernetes deployment, /etc/machine-id typically resolves to the host node's value (shared across co-tenants) and every container runs as uid 1000. The automatic sources collapse to a value shared across co-tenants on the same node. Admins pin NBI_POD_IDENTITY to a value that stays stable across pod restarts (JupyterHub username, per-user spawn secret) for real cross-tenant isolation. Avoid metadata.uid from the downward API: that value rotates on every restart and would invalidate stored tokens on each restart.
  2. /etc/machine-id or /var/lib/dbus/machine-id when mounted.
  3. socket.gethostname() as a last resort. In Kubernetes the hostname equals the pod name, which differs across pods and gives per-pod isolation even though it is not a confidential value.

POSIX uid is always included. The Windows path is getattr-guarded so a missing os.getuid does not crash the read path.

New encrypt_user_secret / decrypt_user_secret helpers in notebook_intelligence/util.py wrap the existing encrypt_with_password / decrypt_with_password with the derived password. The github_copilot.py read/write call sites switch to the new helpers; tests on the bare helpers continue to cover the underlying crypto.

Legacy migration: blobs encrypted before this change decrypt via a fallback to the bare password and are silently rewritten under the v2 KDF on the next read. The legacy-fallback exception handler catches InvalidToken only, so a corrupted blob or programmer error surfaces loudly instead of being masked. The rewrite goes through the existing _save_user_data helper (introduced in PR #293) so the 0o600 file mode contract is preserved across the upgrade and a crash mid-rewrite (a K8s OOMKill during the silent upgrade) cannot leave a torn user-data.json on disk. If the rewrite itself fails (read-only filesystem, transient I/O), the read still returns the decrypted plaintext so the user's login completes; the next read falls back to the legacy path again until the underlying I/O problem clears.

Testing

  • tests/test_user_secret_kdf.py: 13 cases covering round-trip, cross-pod isolation by machine-id and uid, hostname fallback when /etc/machine-id is absent, legacy-blob fallback (and refuses with allow_legacy=False), NBI_POD_IDENTITY overriding the automatic sources, an end-to-end silent-upgrade test that drives read_stored_github_access_token against a fake user_data_file on tmp_path and verifies the on-disk bytes change between calls, and a rewrite-failure test that asserts the read still returns plaintext when write_github_access_token raises.
  • Full suites: pytest 786 pass, tsc clean, prettier clean, jest clean.

Risks / follow-ups

  • Behavior change: legacy blobs decrypt via fallback and are upgraded transparently; users do not have to re-login. A pod that loses its machine-id source between sessions (e.g., pod respawn without /etc/machine-id mounted, hostname changes, NBI_POD_IDENTITY unset or pointed at a rotating value like metadata.uid) will fail to decrypt v2 blobs and the user is silently logged out. The admin-guide and PRIVACY notes now document this, recommend stable NBI_POD_IDENTITY values, and explicitly warn against metadata.uid from the downward API.
  • Argon2id over PBKDF2: 1.2M PBKDF2-HMAC-SHA256 iterations meet OWASP 2023 guidance. Argon2id would resist GPU/ASIC attack better; tracked as a separate hardening follow-up.
  • File mode 0o600: enforced via _save_user_data (per PR security: enforce 0o600 on the encrypted GitHub token file #293) on every write including the silent legacy-blob upgrade. The merge into this branch adopts that helper at both write sites.
  • Cross-tenant isolation on K8s: this PR is necessary but not sufficient when /etc/machine-id is shared across co-tenants and uid pins to 1000. The NBI_POD_IDENTITY env knob lets admins close that gap with a one-line spawn config; the admin guide section calls this out.

@pjdoland pjdoland added the bug Something isn't working label May 18, 2026
@pjdoland pjdoland changed the title feat(github-copilot): per-pod KDF for stored Copilot token security(github-copilot): per-pod KDF for stored Copilot token May 18, 2026
pjdoland added a commit to pjdoland/notebook-intelligence that referenced this pull request May 18, 2026
Second-pass review surfaced two items:

1. The default-password WARNING mentioned NBI_POD_IDENTITY as a knob
   the operator could set, but that env var ships in PR plmbr#300 (per-pod
   KDF) and does not exist on the branch's merge base. Drop the
   reference until plmbr#300 lands; the remaining knobs
   (NBI_GH_ACCESS_TOKEN_PASSWORD per-user, NBI_ALLOW_DEFAULT_TOKEN_PASSWORD
   opt-out) are sufficient guidance for the deployments this PR
   targets.

2. The read-path warn test only asserted the WARNING fired; the
   contract is "warn AND return the plaintext" because a regression
   that aborts the read on warn would silently log the user out.
   Rewritten to encrypt a real token, run the read, and assert both
   halves.

Also pin that the non-shared branch of the warning omits the
directory path so log analysis can tell the two postures apart by
message content alone.

The silent-fail UX (when NBI_REFUSE_DEFAULT_TOKEN_PASSWORD_ON_SHARED_FS
refuses the write, the user sees no in-app signal) is documented as a
known limitation in the PR description but not addressed here:
plumbing storage-failure status back to the chat sidebar is a
separable change that touches the capability handshake, and the
audit's recommended remediation (operators set the env vars before
rollout) means in-app feedback is a UX enhancement rather than a
regression blocker.
@mbektas
Copy link
Copy Markdown
Collaborator

mbektas commented May 18, 2026

@pjdoland this looks great but can we allow to turn off this new logic and use the existing method. In some environments filesystem is not shared and reboots change the machine id which would require a re-login.

@pjdoland
Copy link
Copy Markdown
Collaborator Author

Definitely. That is almost certainly the correct approach...as it might be for some of these other security PRS.

pjdoland added a commit to pjdoland/notebook-intelligence that referenced this pull request May 19, 2026
Second-pass review surfaced two items:

1. The default-password WARNING mentioned NBI_POD_IDENTITY as a knob
   the operator could set, but that env var ships in PR plmbr#300 (per-pod
   KDF) and does not exist on the branch's merge base. Drop the
   reference until plmbr#300 lands; the remaining knobs
   (NBI_GH_ACCESS_TOKEN_PASSWORD per-user, NBI_ALLOW_DEFAULT_TOKEN_PASSWORD
   opt-out) are sufficient guidance for the deployments this PR
   targets.

2. The read-path warn test only asserted the WARNING fired; the
   contract is "warn AND return the plaintext" because a regression
   that aborts the read on warn would silently log the user out.
   Rewritten to encrypt a real token, run the read, and assert both
   halves.

Also pin that the non-shared branch of the warning omits the
directory path so log analysis can tell the two postures apart by
message content alone.

The silent-fail UX (when NBI_REFUSE_DEFAULT_TOKEN_PASSWORD_ON_SHARED_FS
refuses the write, the user sees no in-app signal) is documented as a
known limitation in the PR description but not addressed here:
plumbing storage-failure status back to the chat sidebar is a
separable change that touches the capability handshake, and the
audit's recommended remediation (operators set the env vars before
rollout) means in-app feedback is a UX enhancement rather than a
regression blocker.
The encrypted token at ~/.jupyter/nbi/user-data.json was keyed via
PBKDF2 from a public default password (NBI_GH_ACCESS_TOKEN_PASSWORD,
default literal "nbi-access-token-password"). Every install with the
default shipped the same key, so an attacker who read user-data.json
off a shared filesystem could decrypt the OAuth access token directly,
no brute force.

Mix a per-pod + per-user secret into the KDF password input before
PBKDF2. Resolution order (first non-empty wins, the rest still
contribute to the concatenation):

  1. NBI_POD_IDENTITY env var (highest priority). Admins on K8s where
     /etc/machine-id and uid are shared across co-tenants pin this to
     a per-pod or per-user value (JupyterHub username, spawn token,
     Pod metadata.uid via the downward API).
  2. /etc/machine-id or /var/lib/dbus/machine-id when mounted.
  3. socket.gethostname() as a last resort (pod name in K8s).

POSIX uid is always included; getattr-guarded for Windows.

The on-disk blob format is unchanged (16 salt bytes + Fernet
ciphertext). New encrypt_user_secret / decrypt_user_secret helpers
wrap the existing crypto with the derived password. Legacy blobs
(those written before this change) decrypt via a fallback to the
bare password and are silently rewritten under the v2 KDF on the
next read, so existing users do not have to re-login.

The rewrite goes through the existing _atomic_write_json helper so a
crash mid-rewrite (K8s OOMKill during the silent upgrade) cannot
leave a torn user-data.json on disk. The legacy-fallback exception
handler catches InvalidToken only, so a corrupted blob or programmer
error surfaces instead of being masked as "needs the legacy path."

Admin-guide and PRIVACY notes updated to call out the K8s
shared-machine-id reality and recommend NBI_POD_IDENTITY for
multi-tenant deployments.
@pjdoland pjdoland force-pushed the feat/per-user-token-kdf branch from 05f10cd to f78be29 Compare May 19, 2026 10:19
@pjdoland pjdoland added this to the 5.1.x milestone May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants