Skip to content

v0.2 — Kubernetes management tab + fat-privileged container + CI repair#1

Merged
tm4rtin17 merged 25 commits into
mainfrom
v0.2
May 9, 2026
Merged

v0.2 — Kubernetes management tab + fat-privileged container + CI repair#1
tm4rtin17 merged 25 commits into
mainfrom
v0.2

Conversation

@tm4rtin17
Copy link
Copy Markdown
Owner

@tm4rtin17 tm4rtin17 commented May 9, 2026

Summary

Promotes 25 commits from v0.2 to main for the v0.2.0 release. Three big themes:

  • Kubernetes management tab end-to-end (Phases A–D). Read-only inventory + detail drawers + pod log streaming + pod exec + lifecycle actions (restart/scale/delete/cordon) + ConfigMap structured editor + read-only Secret viewer (masked, audited) + Monaco YAML editor with server-side dry-run.
  • Fat-privileged Docker container as a deployment shape that drives every host integration (Services, Updates, Network, Logs/journal, Terminal, K8s) without sidecars. Bare-metal install remains the unprivileged path. Image went from ~25 MB distroless to ~230 MB Debian-slim with the binaries the integrations need; documented as the higher-blast-radius option.
  • CI repaired end-to-end. The pipeline had been failing at workflow startup since v0.1 — even main's last push (m9 polish) failed at 0s. Fixed in five commits: Go version drift, invalid YAML in the awk-format step, missing package-lock.json + outdated golangci-lint, missing v2 config, Dockerfile/vite path mismatch.

Plus: terminal PAM login flow, Tailscale CGNAT awareness in the public-bind banner, capability-gated nav (hides tabs whose backend isn't reachable), two nil-slice JSON fixes that were blanking the Containers and Network tabs, and a docker-compose volume-external fix so admin accounts don't get stranded across docker rundocker compose up switches.

Full per-section detail in CHANGELOG.md.

Verification

  • CI: all three jobs (Go vet+lint+test, Web typecheck+build, Docker image) pass on the v0.2 head (latest run).
  • Sensitive-data audit re-run before this PR — zero secret-shaped strings introduced across the 25 commits; runtime data lives under /var/lib/controlroom/ and the Docker named volume, both outside the repo; .claude/ (local agent definitions) ignored.
  • All four K8s phases verified in-cluster on the test K3s cluster: image loaded onto both nodes, RBAC widening applied, deployment rolled, every endpoint mounts and auth-gates correctly, pods/exec works through SPDY, manifest dry-run + apply work through the Monaco editor.

Migration notes for main

  • Container deployment: run docker volume create controlroom-data once before docker compose up. The compose volume is now external: true to avoid the docker run ↔ compose namespacing trap.
  • In-cluster deployment: re-apply deploy/k8s/rbac.yaml to pick up the widened verbs (pods/exec: create, pods: delete, nodes: patch, apps/*: patch,update, apps/*/scale: update, services/configmaps: update, secrets: get,list).

Test plan

  • Reviewer pulls v0.2, runs make image && docker volume create controlroom-data && docker compose -f deploy/docker-compose.yml up -d, sees the K8s tab with full feature set.
  • Reviewer scans CHANGELOG.md § 0.2.0 for accuracy.
  • Reviewer skims docs/SECURITY.md § "Privilege scoping by deployment shape" — make sure the three-shape compromise-impact statements match what we want to publish.
  • Optional: kubectl apply -f deploy/k8s/{namespace,rbac,pvc,deployment,service}.yaml against a test cluster to verify the in-cluster shape still rolls cleanly.

🤖 Generated with Claude Code

tm4rtin17 added 25 commits May 8, 2026 15:28
- PublicBindBanner: treat 100.64.0.0/10 (Tailscale CGNAT) as private so
  the destructive "public-looking address" warning no longer fires when
  reaching ControlRoom over Tailscale.
- internal/logs: add Available() (cached exec.LookPath); /api/logs/journal
  and /ws/logs/journal now return 503 with a clear operator message when
  journalctl is missing instead of bubbling up an exec error as 500.
- main: warn at boot when journalctl is not in PATH, matching the existing
  systemd-unavailable warning.
- web/Logs: source toggle (Journal / Containers). Containers mode reuses
  /api/containers + the existing /ws/containers/:id/logs WebSocket — no
  new backend route — with picker, client-side substring filter, pause,
  and 1k-line cap. Journal-mode errors now surface the real backend
  message instead of "Could not fetch logs."
- deploy/docker-compose: add group_add: ["${DOCKER_GID:-999}"] so the
  nonroot uid 65532 can read /var/run/docker.sock; refresh the header
  comment to match the new logs behavior.
- .claude/ — local Claude Code config and project agent definitions
  (.claude/agents/*.md). These contain working-tree-specific tooling
  and should not be published.
- *.crt — round out the existing *.pem / *.key entries so a stray
  TLS certificate dropped in the tree can't be committed.
Adds GET /api/system/capabilities — `{systemd, docker, journal}` booleans
derived from the existing nil-client / logs.Available() patterns — and
threads useCapabilities() into AppLayout to filter the sidebar.

In the container deployment this hides the Services tab (no dbus socket
mounted, no privilege to drive host systemd) instead of letting the
operator click into a dead-end 503. The Containers tab gets the same
treatment so a host without /var/run/docker.sock isn't shown a broken
tab either.

The Logs tab stays visible regardless: it has its own Journal/Containers
source toggle and degrades gracefully on its own.
The frontend types Container.ports and Container.labels as non-null
arrays/objects (web/src/lib/containers.ts) and dereferences them
directly (e.g. ContainerCard.tsx accesses container.ports.length).
Go's encoding/json marshals nil slices and nil maps as null, so a
container with no published ports or no labels would crash the SPA
with a blank-screen runtime error.

Initialize Ports, Labels, Mounts, Networks, Command, and Env to
non-nil empty values in List() and Inspect() before the struct is
returned. The compose label lookups now read from the local
nil-guarded copy instead of the daemon's possibly-nil map.
Adds a Kubernetes tab to the SPA backed by client-go, plus K8s
manifests to deploy ControlRoom into the cluster it's meant to
manage. Read-only for now: Nodes, Namespaces, Workloads
(Deployment/StatefulSet/DaemonSet), Pods, Services. No write
actions, log streaming, or exec — those land in Phase B/C/D.

Backend:
- internal/k8s/ — client-go wrapper. New(ctx) tries in-cluster first,
  then $KUBECONFIG, ~/.kube/config, /etc/rancher/k3s/k3s.yaml. Returns
  ErrUnavailable on all failures, mirroring the docker pattern.
- internal/api/k8s/ — REST handlers under /api/k8s/{nodes,namespaces,
  workloads,pods,services}. Nil-client → 503.
- /api/system/capabilities adds `kubernetes: bool` so the SPA
  hides the tab when the cluster client isn't reachable.
- Best-effort wiring in main.go matching systemd/docker patterns.

Frontend:
- web/src/lib/k8s.ts — types + TanStack Query hooks (10s refetch).
- web/src/routes/Kubernetes.tsx — namespace picker, sub-tabs, search.
- web/src/components/k8s/ — list components per resource kind.
- formatRelativeTime() helper for the Age columns.

Deploy:
- deploy/k8s/{namespace,rbac,pvc,deployment,service,ingress}.yaml
  — minimal in-cluster install. ClusterRole has get/list/watch only
  on the resources we surface; widens in Phase C when actions land.
- Single-replica deployment by design — SQLite + ROX PVC.
- Pod runs as uid 65532, drop ALL caps, readOnlyRootFilesystem.

Verified:
- make image succeeds.
- Image loaded into both K3s nodes (multi-node cluster).
- Pod reaches Ready, /api/healthz returns 200.
- No "kubernetes unavailable" warning in logs — client-go connects
  via the projected SA token.
Adds click-to-detail across the Kubernetes tab and a live log
viewer for pods. Read-only still — no write actions yet.

Backend (internal/k8s/detail.go, internal/api/k8s/k8s.go):
- GET /api/k8s/nodes/:name → NodeDetail (conditions, allocatable,
  taints, events).
- GET /api/k8s/workloads/:namespace/:kind/:name → WorkloadDetail
  for Deployment / StatefulSet / DaemonSet (selector, strategy,
  conditions, events).
- GET /api/k8s/pods/:namespace/:name → PodDetail (per-container
  status, conditions, QoS, node, events).
- GET /api/k8s/services/:namespace/:name → ServiceDetail
  (endpoint addresses, selector, events).
- WS /ws/k8s/pods/:namespace/:name/logs?container=&tail= — streams
  via CoreV1.Pods().GetLogs().Stream(). Frame shape matches
  /ws/containers/:id/logs (type=line, stream=stdout, line) so the
  SPA reuses the LogStream UX.
- Events listed via field selector on involvedObject — server-side
  filter, no full-list scan. Cluster-scoped Node events scan all
  namespaces.
- DNS-1123 regex with length guard validates :namespace and :name
  before they reach the API server.

Frontend (web/src/components/k8s/*Detail.tsx + helpers):
- Sheet-based drawer per resource kind, opened by row click in the
  list views. State held in routes/Kubernetes.tsx so opening a pod
  doesn't clobber a node selection.
- Reusable EventsTable and ConditionsTable.
- PodLogStream — WebSocket viewer mirroring ContainerLogStream
  (pause/resume/clear, 1000-line cap, auto-scroll).
- Container picker in PodDetail: clicking a container row syncs
  the active container in the picker chips and the log stream.
- Used the existing Sheet primitive — no new deps.

RBAC widening (deploy/k8s/rbac.yaml):
- pods/log → get (so the SA can stream container output).
- endpoints → get/list/watch (for ServiceDetail.endpoints).
- Still no secrets, no PVs, no write verbs. Phase C territory.

Verified:
- make image, image loaded into both K3s nodes (piserver + piserver2).
- kubectl apply -f deploy/k8s/rbac.yaml succeeded.
- rollout restart deployment/controlroom rolled cleanly within 90s.
- /api/healthz → 200, /api/system/capabilities reports
  kubernetes:true.
- kubectl auth can-i get pods/log as the SA → yes.
- No "kubernetes unavailable" warning in pod logs.
The docker-compose container can't reach K3s' API server at
127.0.0.1:6443 from inside its bridge namespace, and the cert SAN
doesn't include the docker bridge gateway IP. So we rewrite the
kubeconfig server URL to the host's K3s node-ip (which is in the
cert SAN by default) and mount that copy read-only into the
container.

deploy/scripts/setup-host-kubeconfig.sh
  - One-shot helper run on the host (sudo). Reads
    /etc/rancher/k3s/k3s.yaml, picks a host IP that's in the cert
    SAN (override → /etc/rancher/k3s/config.yaml node-ip →
    hostname -I), writes the rewritten config to
    /var/lib/controlroom/kubeconfig with mode 0600 and owner
    65532:65532 (the controlroom container's nonroot uid).
  - Verifies the chosen IP is in the cert's SubjectAltName before
    writing — warns the operator if not.

deploy/docker-compose.yml
  - Mount /var/lib/controlroom/kubeconfig:/etc/k3s/kubeconfig:ro
  - KUBECONFIG=/etc/k3s/kubeconfig — client-go's first lookup path.
  - Comment header reframed: progressive integration model. Docker
    + K8s wired now; journal / systemd / apt left as TODO mounts.

Security note: the K3s admin kubeconfig has cluster-admin. The
container now has cluster-admin. Matches the user's stated intent
of "fat-privileged container" but worth calling out — RBAC the SA
later if you want to scope this down.
…, network

Bundle that lights up every host-integration tab in the docker-compose
deployment by giving the container the host namespaces, capabilities,
and bind mounts it needs. Bare-metal install (deploy/install.sh) is
unchanged and remains the unprivileged path.

Image (deploy/Dockerfile)
  - Drop gcr.io/distroless/static-debian12:nonroot, switch runtime to
    debian:bookworm-slim.
  - Install: bash, ca-certificates, sudo, util-linux (nsenter), procps,
    iproute2, iptables, ufw, systemd (journalctl + systemctl + libsystemd0
    + libdbus-1-3), apt-utils.
  - Run as root; SYS_ADMIN can't survive a uid transition.
  - Image size 230MB (up from ~20MB distroless). Acceptable; documented.

Compose (deploy/docker-compose.yml)
  - network_mode: host, pid: host
  - cap_add: SYS_ADMIN (nsenter) + SYS_PTRACE (host ps/top) + NET_ADMIN
    (ufw/iptables)
  - security_opt: apparmor:unconfined + seccomp:unconfined
  - user: 0:0
  - Mounts: dbus socket (rw), /run/log/journal + /var/log/journal +
    /etc/machine-id (ro) for journalctl, /etc/rancher/k3s/k3s.yaml
    directly (network_mode:host means 127.0.0.1:6443 is reachable so
    the rewritten kubeconfig from setup-host-kubeconfig.sh is no longer
    needed), /var/cache/apt rw + /etc/apt + /var/lib/dpkg ro for apt.
  - env: CR_HOST_SHELL=true so pty wraps shells with nsenter.
  - Header reframed with security model + integration matrix.

pty (internal/pty/pty.go + config.go + api/terminal + api/router)
  - New Options.HostShell. When true, exec.Command becomes
    `/usr/bin/nsenter -t 1 -m -u -i -n -p -- <shell> --login` so the
    user gets a shell in the host's mount/UTS/IPC/network/PID namespaces
    — equivalent to ssh-ing into the host.
  - CR_HOST_SHELL env var (default false). Wired through
    config.Config → api.Deps.Cfg → terminalapi.Deps.HostShell →
    pty.Options.HostShell.
  - Stat-checks /usr/bin/nsenter when HostShell=true; clear error if
    missing instead of an opaque exec failure.
  - Bare-metal behavior is byte-identical when CR_HOST_SHELL=false.

setup-host-kubeconfig.sh + deploy/k8s/README.md
  - Header rewritten: script no longer needed for the standard
    docker-compose deployment, retained for non-host-network cases.
  - K8s README adds a paragraph contrasting in-cluster (unprivileged
    nonroot SA, scoped RBAC) vs docker-compose (fat-privileged) shapes.

Verified:
  - make image succeeds at 230MB.
  - Container boots clean — none of the four "unavailable" warnings
    (systemd / journalctl / kubernetes / docker).
  - host_shell=true logged at boot.
  - All required binaries present in image: bash, sudo, nsenter, ip,
    ufw, journalctl, systemctl, apt, apt-get.
  - `nsenter -t 1` from inside the container produces a real host
    shell (returns the host hostname + kernel).
  - /api/healthz → 200.

Security note (re-stating compose header): a compromise of ControlRoom
is now effectively root on the host. Intended tradeoff for a single-user
homelab where the operator already has root. The bare-metal install
remains the unprivileged option.
The fat-privileged container runs as root, and the previous nsenter
default dropped the operator straight into a root shell — a meaningful
escalation over the SSH UX it should mirror. Now it runs login(1)
inside the host namespaces; the user types host credentials in the
xterm, PAM authenticates, and login execs the user's shell at the
user's uid only on success. Same audit trail as SSH (auth.log,
last-login records).

Wiring:
- internal/pty/pty.go: new Options.LoginMode. When HostShell+LoginMode
  the spawned process is `nsenter -t 1 -m -u -i -n -p -- /bin/login`.
  Removed the misleading container-side stat for login (after nsenter
  -m the binary resolves from the host's filesystem; a stat from the
  container's view doesn't tell us anything useful).
- internal/config/config.go: CR_TERMINAL_LOGIN env var, default false.
- internal/api/terminal + router: thread cfg.TerminalLogin through to
  pty.Options.LoginMode.
- cmd/controlroom/main.go: log the new flag at boot.
- deploy/docker-compose.yml: CR_TERMINAL_LOGIN=true alongside
  CR_HOST_SHELL=true. Comment explains the security improvement.

Bare-metal install is unchanged — CR_TERMINAL_LOGIN defaults false,
existing pty.New behavior preserved.

Refused setup: LoginMode without HostShell returns an error. login(1)
is meaningful only when there are namespaces to enter; otherwise we'd
just be running it as the controlroom user with nothing to gain.
…ompt

util-linux login(1) refuses to run interactively when execv'd from a
regular process — it expects to be slave to getty/sshd, with specific
TTY ioctls, and exits silently otherwise. Confirmed end-to-end: even
with a real PTY allocated by Python and a controlling terminal set,
/bin/login produces no output and exits 1.

su(1) is a regular interactive program that works the way login should.
Replace the LoginMode exec target with a tiny inline bash script that:

  1. Prints a short banner with the hostname.
  2. Prompts for the username.
  3. exec su -l "$username".

su then prompts for the password through PAM (host /etc/pam.d/su rules),
authenticates, and execs the user's shell at the user's uid. The
password never touches Go, audit goes through PAM as before, and the
UX is identical to SSH: type user, type password, you're in.

LoginPath constant removed; replaced with the loginScript string. No
external file added — the script is small enough to live as a string
constant in pty.go.

Verified: a Python PTY harness around the same exec.Command path that
pty.New uses now prints "ControlRoom Terminal — host login at <host>"
followed by "username: " and waits for input, exactly like SSH.
Root invoking su(1) is special-cased to skip PAM authentication —
"root can become anyone" — so when the controlroom container (uid 0)
ran `su -l pi` the operator landed as pi without ever proving the
password. That's not the SSH-equivalent UX we wanted; it's worse than
the old root-shell default because now it's pretending to authenticate.

Drop to nobody:nogroup with setpriv before invoking su. su now sees a
non-root caller, runs full PAM auth via /etc/pam.d/su, the password is
prompted in the xterm, and on success the setuid bit on /bin/su lets
it escalate to root and then drop to the target user. Lockouts and
auth.log entries flow through PAM as expected.

Verified: a Python PTY harness around the new exec path now prints
"Password:" after entering the username, exactly like SSH.
Same bug class as 74dd538 fixed for the Docker client — Network.tsx
crashes the SPA with a blank screen when iface.ips or fw.data.rules
arrives as null instead of an empty array, because the frontend
dereferences .length directly.

Two fields needed nil-guards:

- internal/network/iface.go: Interface.IPs starts as a nil slice when
  an interface has no addresses (e.g. some down/dummy/bridge ifaces).
  Now initialized to []string{}. Flags inherits whatever `ip -j addr
  show` returns (which is non-nil in practice but not guaranteed) so
  we nil-guard that too.

- internal/network/ufw.go: UFWStatus.Rules starts as a nil slice when
  the firewall is disabled or has no rules. Now initialized to
  []UFWRule{} in parseUFWStatus before the parse loop.

Verified: rebuilt image, recreated container. The Network tab should
render an empty Interfaces grid + "No rules" instead of crashing.
The shipped docs lagged badly behind v0.2:
  - README claimed "v0.1 in development", didn't mention the
    Kubernetes tab, listed footprint targets that no longer apply,
    and warned that the Docker container deployment lost Services
    and Logs (we wired both via the fat-privileged container).
  - INSTALL.md only covered bare-metal + the old distroless docker
    flavor; nothing about the in-cluster Pod, host-shell login, or
    the new env vars.
  - SECURITY.md described privilege scoping for one shape only;
    the new fat-privileged container needed an explicit "this is
    effectively root on the host" callout, and the in-cluster Pod
    needed a section on the scoped ClusterRole.

This commit:

README.md (~140 lines)
  - v0.2 status; Kubernetes in the feature list; three deployment
    shapes table; quick-start per shape; updated project layout
    (deploy/k8s, deploy/scripts, internal/k8s, internal/api/k8s);
    Go 1.25+; doc index pointing at INSTALL/CONFIG/SECURITY/SPEC.

docs/INSTALL.md (rewritten, ~340 lines)
  - Choose-a-shape comparison table with privilege model + which
    tabs work in each.
  - Path A bare-metal: prerequisites, install, first-run wizard,
    update, uninstall, common pitfalls.
  - Path B Docker fat-privileged: full mount/cap inventory, the
    nsenter+setpriv+su login flow explained step-by-step, common
    pitfalls (terminal misbehaviors, kubeconfig issues).
  - Path C in-cluster Pod: prerequisites, image-loading on
    multi-node K3s, ClusterRole scope (current + Phase C plans),
    three access patterns (Ingress / port-forward / NodePort),
    common pitfalls (ImagePullBackOff, missing pods/log RBAC).
  - TLS modes section; Kubernetes integration section covering
    the four kubeconfig discovery paths.
  - Post-install checklist.

docs/CONFIG.md (new, ~180 lines)
  - Every CR_* env var with default + purpose, grouped by Core,
    TLS, Integrations, Container-only, Compose helpers.
  - Per-tab capability matrix (which tab works in which shape and
    what mount/cap each one needs).
  - Per-shape file paths table.
  - Capability detection / debugging walkthrough.
  - Compose env-file pattern.

docs/SECURITY.md (rewritten, ~180 lines)
  - Bumped to v0.2.
  - Privilege scoping section split into three subsections
    (bare-metal / fat-privileged container / in-cluster Pod) with
    explicit compromise-impact statements.
  - Audit section calls out PAM/auth.log + apt history.log as the
    real audit trail for the fat-privileged container's host ops.
  - Public-bind detection updated to mention RFC 6598 / Tailscale
    CGNAT awareness (was listed as a gap; now shipped).
  - Known gaps refreshed: K8s Phase C/D, cert-manager TLS for the
    Pod shape, etc.
Adds the four standard cluster operations to the SPA, with audit and
RBAC scoped to exactly what's needed.

Backend (internal/k8s/actions.go, internal/api/k8s/k8s.go)
  - RestartWorkload(ns, kind, name): merge-patches the pod template
    annotations with kubectl.kubernetes.io/restartedAt = now.
    Same trick `kubectl rollout restart` uses. Works for Deployment /
    StatefulSet / DaemonSet via case-insensitive kind dispatch.
  - ScaleWorkload(ns, kind, name, replicas): GetScale → mutate
    Spec.Replicas → UpdateScale, preserving resourceVersion so the
    optimistic-concurrency check in the API server actually fires.
    Deployment + StatefulSet only; DaemonSet returns 400.
  - DeletePod(ns, name, gracePeriod, force): Delete with optional
    grace-period override. force=true hard-overrides ?grace=N to 0.
  - CordonNode(name, cordoned): merge-patches spec.unschedulable.
    Distinct audit actions for cordon vs uncordon so the trail
    captures intent.

  - Validation: every :namespace, :name, :kind goes through the
    existing DNS-1123 regex + length guard. Replicas bounds-checked
    0..1000 server-side.
  - Audit: every action writes a store.AuditEntry with
    target=ns/name (or just name for cluster-scoped Node), detail
    map carrying the relevant params. Outcome=failure includes the
    error message. Best-effort write, never fails the request.
  - DB threaded into k8sapi.Deps; router updated.

RBAC widening (deploy/k8s/rbac.yaml)
  - apps/{deployments,statefulsets,daemonsets}: patch  (restart)
  - apps/{deployments,statefulsets}/scale:    update (scale)
  - core/pods:                                 delete (rotate)
  - core/nodes:                                patch  (cordon)
  - Still NO secrets, NO persistentvolumes, NO delete on apps/*
    (would let the SPA wipe a workload outside GitOps).

Frontend (web/src/lib/k8s.ts + the three Detail drawers)
  - useRestartWorkload, useScaleWorkload, useDeletePod, useCordonNode
    — each useMutation invalidates the matching list + detail keys
    on success. Pattern matches useContainerAction.
  - WorkloadDetail: Restart button (always); Scale button with a
    number input prefilled from ready.desired (Deployment+STS only).
  - PodDetail: red Delete button with a "Force (skip graceful
    shutdown)" checkbox; closes the drawer on success via a new
    onClose prop wired up from routes/Kubernetes.tsx.
  - NodeDetail: toggle button derives current cordon state from the
    node.kubernetes.io/unschedulable taint (which `kubectl cordon`
    applies), falling back to the SchedulingDisabled condition.
  - Every destructive action goes through AlertDialog with the
    impact spelled out. Pending state + inline error on failure.

Verified end-to-end:
  - make image succeeds at 230MB.
  - Image loaded to both K3s nodes.
  - kubectl apply rbac.yaml; rollout restart succeeds.
  - 4 new endpoints all return 401 (auth required, routes mounted).
  - kubectl auth can-i {patch deployments, update deployments/scale,
    delete pods, patch nodes} as the SA → all "yes".
  - SPA bundle contains all four action mutation hooks.
Adds "shell into a container" to the Kubernetes tab. Equivalent to
`kubectl exec -it <pod> -c <container> -- /bin/sh`, but in the browser.

Backend (internal/k8s/exec.go + internal/api/k8s/k8s.go)
  - Client.PodExec uses k8s.io/client-go/tools/remotecommand to open a
    SPDY exec stream to the API server. Stashes the rest.Config on the
    Client struct (NewSPDYExecutor needs it; the existing typed
    clientset wasn't enough).
  - WS handler at /ws/k8s/pods/:namespace/:name/exec wires:
      WS binary in   → io.Pipe → SPDY stdin
      SPDY stdout    → wsExecWriter → WS binary out
      SPDY stderr    → wsExecWriter → WS binary out (merged; xterm
                                       doesn't distinguish)
      WS text JSON   → resize        → wsSizeQueue → SPDY resize stream
  - Wire format mirrors /ws/terminal exactly so the frontend can clone
    Terminal.tsx's xterm wiring with minimal changes:
      init: { rows, cols, container, command? }   (default command ["/bin/sh"])
      err:  { type: "error", err: "..." }
  - Lifecycle correctness: cancelling ctx unblocks StreamWithContext,
    closing stdinW unblocks any blocked SPDY stdin read, closing
    sq.ch causes wsSizeQueue.Next to return nil so the SPDY size
    tracker exits cleanly. No goroutine leaks.
  - Audit: k8s.pod.exec.start at session open, k8s.pod.exec.end on
    close with bytes_in/bytes_out/duration_ms — same shape as
    terminal.session_end so a single audit query catches both.
  - Idle timeout 30 min, matching the terminal handler.

RBAC (deploy/k8s/rbac.yaml)
  - core/pods/exec: [create]. Minimum verb needed for the K8s API
    server to accept the SubjectAccessReview that the SPDY exec call
    triggers. Nothing else widened in this commit.

Frontend (web/src/components/k8s/PodExecModal.tsx + PodDetail.tsx)
  - Sheet-based modal: container picker, command picker (sh / bash /
    custom), Reconnect, status pill — same xterm theme + addons as
    routes/Terminal.tsx.
  - PodDetail Containers table gets an Exec column. Click → modal
    opens with that container preselected. e.stopPropagation on the
    button so it doesn't fight the row's log-viewer container picker.
  - Reconnect generation counter bumps on container or command change,
    forcing a clean WS rebuild with the new init frame.
  - podExecURL helper added to web/src/lib/k8s.ts.

Dependency churn
  - go mod tidy pulled in the SPDY transitive set:
      github.com/gorilla/websocket
      github.com/moby/spdystream
      github.com/mxk/go-flowrate
    All transitive of k8s.io/client-go/tools/remotecommand. No new
    direct dependencies.

Verified end-to-end:
  - make image succeeds.
  - Image loaded into both K3s nodes.
  - kubectl apply rbac.yaml; rollout restart succeeds.
  - kubectl auth can-i create --subresource=exec pods → yes (SA's
    perms via can-i --list also confirm pods/exec [create]).
  - WS upgrade attempt returns 401 unauthenticated (route mounted,
    auth gate firing).
  - SPA bundle contains the new PodExecModal + podExecURL strings.
Adds a ConfigMaps tab to the Kubernetes route, with full list / view /
edit. Uses a simple key-value editor for now (no Monaco — that lands
with D4's manifest editor for the full YAML edit experience).

Backend (internal/k8s/configmaps.go + internal/api/k8s/k8s.go)
  - GET  /api/k8s/configmaps?namespace=
         → { configmaps: [{name, namespace, keys[], age}] }
         List omits data values to keep responses small; only metadata
         and the lex-sorted key list.
  - GET  /api/k8s/configmaps/:ns/:name
         → ConfigMapDetail (configmap + labels + annotations + data
         + binary_keys + events).
         binary_keys lists names of binaryData entries; the bytes
         themselves aren't returned (avoids accidentally exposing
         non-text secrets-in-configmaps).
  - PUT  /api/k8s/configmaps/:ns/:name
         body: { data: Record<string,string> }
         resp: {ok: true} / 400 (invalid key) / 404 / 409 (conflict)
              / 413 (over 1 MiB).
         Uses Update (not Patch) so the API server's
         resourceVersion-based optimistic concurrency check fires and
         we get 409 instead of silent overwrite when two operators
         edit at once.
  - Validation: key matches `^[a-zA-Z0-9._-]+$`, total `len(k)+len(v)`
         under 1 MiB. Audit row per update with key_count + size_bytes.
         metadata.name/namespace are URL-canonical and not editable;
         labels/annotations are read-only here (D4 will handle them
         via manifest edit).

Frontend (lib/k8s.ts + components/k8s/ConfigMap{List,Detail}.tsx)
  - Three new TS interfaces matching the backend contract; three
    hooks (useK8sConfigMaps, useK8sConfigMapDetail,
    useUpdateConfigMap) following the Phase B/C invalidation pattern.
  - ConfigMapList: table with Name / Namespace / Keys-count / Age,
    namespace picker driven from the existing parent state.
  - ConfigMapDetail: Sheet drawer with key/value table editor
    (`<Input>` for keys, multi-line `<textarea>` for values),
    add/remove rows, Save with AlertDialog confirmation showing
    key count + total bytes via formatBytes. Dirty detection via
    JSON.stringify of sorted entries against the original snapshot.
  - 409 conflict UX: dedicated inline Alert with a Reload button
    that refetches detail and discards local edits.
  - Read-only sections for labels (chips), annotations (collapsible),
    binary_keys ("Edit via kubectl; not editable here"), events
    (reuses existing EventsTable).

RBAC (deploy/k8s/rbac.yaml)
  - core/configmaps: [update]. get/list/watch were already in scope
    from Phase A. No widening for binary data — read still returns
    only key names.

Verified end-to-end:
  - make image succeeds.
  - Image loaded into both K3s nodes; rollout restart clean.
  - Docker compose container recreated with the new image.
  - kubectl auth can-i update/list configmaps as the SA → both yes.
  - All 3 endpoints return 401 (mounted, auth-gated).
  - SPA bundle contains ConfigMapList / ConfigMapDetail / hooks.
CI has been failing at workflow startup since before v0.2 (the m9
push to main also failed). Three real issues plus a couple of polish
items:

1. Go version drift. ci.yml pinned Go 1.23 while go.mod requires
   go 1.25.0 (k8s.io/client-go pulled in via Phase A). setup-go would
   install 1.23 and `go mod download` would fail. Switched to
   `go-version-file: go.mod` so the toolchain follows go.mod and
   can't drift again.

2. Wrong embed-stub path. internal/web/web.go does
       //go:embed all:dist
   from its own package directory, so the embed needs
   internal/web/dist/index.html — not web/dist/index.html. The old
   stub step created the wrong path; the Go compile would fail with
   "no matching files for embed". Fixed.

3. Trigger only ran on `push: [main]`, never on dev branches. Added
   "v*" so v0.2 (and future v0.3, v0.4) get CI feedback before merge.

Polish:
   - Added a `Vet` step (cheap, fast, often catches things lint
     misses).
   - Scoped the test run to ./internal/... ./cmd/... — ./... would
     descend into web/node_modules/flatted/golang/pkg/flatted (an
     npm dep that ships an embedded .go file) and fail to compile.
   - npm cache + cache-dependency-path on the web job for faster
     reruns.
   - Lint kept on `latest` with a comment explaining we'll pin if
     it ever flakes from a release upgrade.

The Docker image job is unchanged. It builds the same fat-privileged
runtime image we deploy from. Image size is ~230MB now (vs ~25MB
when ci.yml was first written) — the existing "Inspect image size"
step just reports, no enforcement, so no further change needed.
That step's run: scalar contained `image size: %.1f` — YAML plain
scalars can't contain colon-space, which made the parser see the
`%.1f` as a value of an implicit mapping with key `image size`. The
result was the entire workflow failing at startup with "this run
likely failed because of a workflow file issue" — zero jobs running,
zero seconds duration. Same bug existed on v0.1; main's m9 push hit
it too.

Switched to `run: |` block scalar so the awk format string is opaque
to YAML.
Two fixes for the first real-execution CI run:

1. web/package-lock.json wasn't tracked, so the runner couldn't
   resolve `cache-dependency-path`, the cache step errored, and
   `npm ci` would have fallen through to `npm install` anyway —
   neither reproducible nor cached. Tracked the lock file (188K,
   one-time inflation; future updates will be small diffs).

2. golangci-lint v1.64.8 (which the v6 action's `latest` resolves
   to) is built with Go 1.24 and refuses to lint code targeting
   Go 1.25 ("the Go language version used to build golangci-lint
   is lower than the targeted Go version"). Bumped to action @v8
   pinned at v2.5.0 (built with Go 1.25). v2.x has a different
   config format but we have no .golangci.yml, so default linters
   run unchanged.
golangci-lint v2 doesn't run any linters without an explicit config
(v1 silently ran the standard set). Added a minimal v2 config that
enables the same set v1 had on by default — errcheck, govet,
ineffassign, staticcheck, unused — plus targeted exclusion rules
for idiomatic patterns we don't want to flag:

  - `defer x.Close()` — the canonical Go idiom; close-on-defer
    failures aren't actionable.
  - `.Set(Read|Write)Deadline(...)` on websocket conns — same idea;
    a failed deadline just means the next op will error and we'll
    exit cleanly.
  - `_ = d.DB.WriteAudit(...)` — audits are best-effort by design.
  - tests routinely ignore cleanup errors.

Two real findings fixed in source:

  - internal/docker/fake.go:98 — staticcheck QF1008 flagged
    `c.Container.State = state` as redundant once `c.State = state`
    was already on the line above. Embedded field rewrites are
    transparent here; dropped the duplicate.
  - internal/api/auth/auth.go:191 — `type pendingEnrollments
    struct{}` was declared with a TODO comment but never referenced.
    Removed; the TODO now lives in MILESTONES.md territory rather
    than dead code.

Local `golangci-lint run` against the same v2.5.0 the CI runs is
clean: 0 issues.
…b/dist)

The committed web-builder stage writes its bundle to /src/web/dist
(vite.config.ts has `outDir: 'dist'` resolved relative to web/), but
the go-builder COPY was reading from /src/internal/web/dist —
nothing there, build error:

    failed to compute cache key: ... "/src/internal/web/dist": not found

`make image` worked locally because my working-tree Dockerfile had
the correct path, but I never committed that fix; the broken COPY
from a pre-v0.2 in-flight relocation attempt has been on origin/v0.2
since the fat-privileged container rewrite. Aligning the path so
the bundle from web-builder actually reaches go-builder for embed.

Embed target inside the runtime image is unchanged — internal/web/dist
relative to the Go module — only the source path of the COPY changed.
Adds a Secrets tab to the Kubernetes route. Read-only — no edit/create/
delete. Most-sensitive RBAC widening so far; the design choices reflect
that.

Backend (internal/k8s/secrets.go + internal/api/k8s/k8s.go)
  - GET  /api/k8s/secrets?namespace=
         → { secrets: [{name, namespace, type, keys[], age}] }
         List omits values; only metadata + lex-sorted key list. Type
         is the K8s SecretType (Opaque, kubernetes.io/tls,
         dockerconfigjson, etc.).
  - GET  /api/k8s/secrets/:ns/:name
         → SecretDetail (secret + labels + annotations + data + events).
         data values are base64-decoded to plaintext on the backend.
         Non-UTF-8 bytes get U+FFFD substitution via standard JSON
         encoder behavior — most secrets are text, this is acceptable.

  - Per-detail audit: every detail GET (success or failure) writes a
    store.AuditEntry with action `k8s.secret.read`, target `ns/name`,
    detail `{"key_count": N}`. Key NAMES are not logged, values are
    not logged. The audit table records THAT a secret was read, not
    what was in it. List endpoint is intentionally NOT audited (no
    values exposed there).

RBAC (deploy/k8s/rbac.yaml)
  - core/secrets: [get, list]
  - Deliberately NOT watch — keeps a long-lived stream of all cluster
    secrets from being established. Reads are point-in-time and
    audited; watch would not be.
  - Comment explicitly explains the trade.

Frontend (web/src/components/k8s/Secret{List,Detail}.tsx + lib hooks)
  - SecretList: table with Name/Namespace/Type-badge/Keys/Age. Type
    badge strips the kubernetes.io/ prefix for compactness.
  - SecretDetail: Sheet drawer with:
      • muted informational banner: "Read-only. Values are sensitive
        — every detail view writes an audit row." (informational, not
        alarmist red.)
      • Reveal All / Hide All toggle resets to OFF every time the
        drawer opens — re-opening always re-masks.
      • Per-row eye toggle + copy-to-clipboard button.
      • Masking: 8 fixed bullets regardless of value length so the
        DOM never leaks length information.
      • JSON pretty-print: try/catch around JSON.parse + stringify;
        falls back to raw text on parse failure.
      • TLS-shaped values (start with -----BEGIN, contain newlines)
        rendered in a monospace <pre>.
      • Copy fallback: navigator.clipboard.writeText with a hidden
        <textarea>+execCommand fallback for insecure contexts.
  - useK8sSecretDetail uses staleTime: 60s and refetchOnWindowFocus:
    false so re-opening doesn't immediately re-hit the API and pile
    up audit rows.

Verified end-to-end:
  - Local golangci-lint v2.5.0 clean (0 issues), tsc --noEmit clean.
  - make image succeeds at 230MB.
  - Image loaded into both K3s nodes; deployment rolled clean.
  - Docker compose container recreated with the standard
    fat-privileged flags.
  - kubectl auth can-i: list secrets=yes, get secrets=yes,
    watch secrets=no (intentional).
  - Both endpoints return 401 (mounted, auth-gated).
  - SPA bundle contains 24 references to SecretList/SecretDetail/
    useK8sSecrets/secrets path.

Side change: deploy/docker-compose.yml `image:` flipped from
ghcr.io/tm4rtin17/controlroom:latest to controlroom:dev so
`docker compose up` matches the local-build workflow without
forcing a registry pull. Production deployments should pin a
specific tag.
Adds an "Edit YAML" button to the existing detail drawers for the
five editable resource kinds (Deployment / StatefulSet / DaemonSet
/ Service / ConfigMap). Click → Sheet with a Monaco editor; Dry-run
validates server-side; Apply writes back via PUT semantics.

Backend (internal/k8s/manifest.go + internal/api/k8s/k8s.go)
  - GET  /api/k8s/manifest?kind=&namespace=&name=
         → { yaml, resource_version, gvk }
         The dynamic client fetches the resource as unstructured;
         we strip metadata.{managedFields,creationTimestamp,uid,
         resourceVersion,generation} and the entire status before
         marshalling so the editable surface is just spec + labels +
         annotations. resourceVersion is sent separately so the
         apply round-trip can detect conflicts without leaking RV
         into the editable buffer.
  - POST /api/k8s/manifest
         body: { yaml, resource_version, dry_run }
         → { ok, resource_version, warnings, dry_run }
         dyn.Update with metav1.UpdateOptions{DryRun: [DryRunAll]}
         when dry_run=true. PUT (Update) semantics, not server-side
         Apply — kubectl-edit-equivalent. The API server rejects on
         stale RV with k8serrors.IsConflict → 409.
  - Cross-check enforced: the YAML's `kind`, `metadata.namespace`,
    and `metadata.name` must match the URL path values; mismatch
    returns 400 ErrManifestMismatch. Prevents an "edit foo, but
    YAML targets bar" cross-resource attack.
  - Editable kinds allowlist (deployment / statefulset / daemonset
    / service / configmap). Anything outside → 400. Pods are
    excluded (immutable fields). Nodes are excluded (metadata-only
    edits warrant a different flow). Secrets stay read-only per D3.
  - Audit: k8s.manifest.dry_run / k8s.manifest.apply with
    {kind, namespace, name, bytes, dry_run}. The YAML body itself
    is never logged — only its size — so a future audit dump can't
    replay sensitive data that flowed through the editor.
  - Error mapping: 400 (parse / cross-check / not-editable kind),
    404, 409 (IsConflict), 422 (IsInvalid — admission webhook
    denial passes through verbatim), 500.

RBAC (deploy/k8s/rbac.yaml)
  - apps/{deployments,statefulsets,daemonsets}: patch + update
    (patch was already in scope from Phase C for restart; update
    is new for full YAML edits).
  - core/services: update (was get/list/watch only).
  - core/configmaps: update was already in scope from D2.

Frontend (lib/k8s.ts + components/k8s/ManifestEditor.tsx)
  - useK8sManifest: useQuery with refetchOnWindowFocus:false and
    staleTime:Infinity so unsaved edits aren't clobbered by a
    background refetch. Operator-driven Reload only.
  - useApplyManifest: useMutation. On non-dry-run success, invalidates
    workload/service/configmap detail + list queries.
  - ManifestEditor.tsx (Sheet drawer):
    • Monaco lazy-loaded via React.lazy → ~3MB chunk fetched only
      when the Sheet first opens; main bundle stays at the size it
      was before D4.
    • Dark theme, YAML mode, no minimap, font-mono, wordWrap on,
      scrollBeyondLastLine off.
    • Header buttons: Reload (with dirty-state confirmation),
      Dry-run, Apply (always confirms via AlertDialog).
    • Apply disabled when YAML is unchanged.
    • 409 conflict UX has its own dedicated Alert + inline Reload
      button, separate from generic / 422-admission errors.
    • resource_version shown muted in the header so operators know
      which revision they're working against.
  - WorkloadDetail / ServiceDetail / ConfigMapDetail each grew an
    Edit YAML button alongside their existing actions. ConfigMap
    keeps its structured key/value editor too — the YAML editor is
    the alternative when you need to edit metadata/labels/annotations.

Dependency: web/package.json adds @monaco-editor/react ^4.6.0 (the
small wrapper; monaco-editor itself is its peer/transitive). Lock
file regenerated locally before this commit so CI's `npm ci` works
without falling back to `npm install`.

Verified end-to-end:
  - Local golangci-lint v2.5.0 clean (0 issues), tsc --noEmit clean.
  - make image succeeds at ~230MB (Monaco lives in lazy chunks under
    /assets, not in the main runtime image bloat path).
  - Image loaded into both K3s nodes; deployment rolled cleanly.
  - kubectl auth can-i update {deployments|statefulsets|daemonsets|
    services|configmaps} as the SA → all five yes.
  - GET /api/k8s/manifest and POST /api/k8s/manifest both return
    401 (mounted, auth-gated).
  - SPA main bundle contains ManifestEditor + the manifest URL
    helpers; Monaco itself is in a separate lazy chunk.
`docker compose up` namespaces named volumes with the project name —
when run from deploy/ the volume becomes deploy_controlroom-data, not
controlroom-data. That silently creates a fresh empty volume rather
than reusing the host-level one, stranding the operator's admin user,
JWT signing key, and TLS material on the other volume; the SPA then
shows the first-run setup wizard as if it were a clean install.

Marking the volume external means compose refuses to start if the
volume isn't already present (operator runs `docker volume create
controlroom-data` once on first install) AND uses that exact named
volume thereafter. Switching between `docker run -v controlroom-data:
/data` and `docker compose up` now lands on the same persistent state.
- Add CHANGELOG.md (Keep-a-Changelog) covering v0.1.0 → v0.2.0. The
  Unreleased section is reserved for post-merge work on main.
- README features table: rewrite the Kubernetes row to mention
  D1 exec, D2 ConfigMap edit, D3 Secret view, D4 manifest editor.
  Add a CHANGELOG link to the doc index.
- docs/INSTALL.md: rewrite the in-cluster ClusterRole table (Phase A
  → D verbs grouped by phase that introduced them, with explicit
  exclusions); add the `docker volume create controlroom-data` step
  to the compose path now that the volume is `external: true`.
- docs/CONFIG.md: per-tab capability matrix gains rows for the K8s
  sub-features (exec, lifecycle, configmap edit, secret view,
  manifest edit) so operators can see which RBAC verbs each one
  needs.
- docs/SECURITY.md: in-cluster Pod section's RBAC table updated for
  the full Phases A→D verb set; per-Secret-read and per-manifest-edit
  audit rows called out; Known gaps refreshed (K8s tab moved from
  "open" to "closed in v0.2", multi-cluster + RBAC roles added as
  v0.3 candidates).

Audit re-run before this commit: zero secret-shaped strings introduced
across the 24 commits in v0.2; runtime data still under /var/lib/* outside
the repo; .claude/ still ignored.
@tm4rtin17 tm4rtin17 merged commit be8ab22 into main May 9, 2026
6 checks passed
@tm4rtin17 tm4rtin17 deleted the v0.2 branch May 9, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant