Skip to content

Cache-locality-aware worker claim + reap#451

Merged
benben merged 1 commit intomainfrom
ben/node-age-aware-scheduling
Apr 23, 2026
Merged

Cache-locality-aware worker claim + reap#451
benben merged 1 commit intomainfrom
ben/node-age-aware-scheduling

Conversation

@benben
Copy link
Copy Markdown
Member

@benben benben commented Apr 23, 2026

Summary

Both worker scheduling paths in `K8sWorkerPool` iterate `p.workers` map in Go's randomized order, so claims and reaps are non-deterministic. That churns nodes: a new query may land on a cold-cache node while an idle reap may kill the oldest, warmest one — defeating the NVMe cache proxy we just rolled out.

Fix: track the first time this CP saw a worker on each node (`nodeFirstSeen map[string]time.Time`, stamped when a pod reports Ready with `Spec.NodeName`) and use it to order both paths:

  • `findIdleWorkerLocked` — prefer the idle worker on the oldest-seen node so sessions land where the NVMe cache is warmest.
  • `reapIdleWorkers` — sort reap candidates by nodeFirstSeen desc so excess idle workers are evicted from the newest-seen nodes first, letting Karpenter consolidate those and keeping old warm nodes alive longer.

Workers missing a nodeName (informer race on the first few ms after spawn) sort as "new" — reaped first, claimed last — which is the safe direction.

Design choices

  • Per-CP state, no coordination. Each CP only decides about workers it owns. The shared store already prevents over-spawn; the local `idleCount > minWorkers` gate already prevents over-reap. Cross-CP "oldest node absolutely" would need a shared store round-trip or a cluster-scoped `Node.Get` (extra RBAC) for marginal benefit.
  • No new k8s API calls, no new RBAC. nodeName comes from the pod the informer already watches; no `Nodes().Get` needed.
  • Map is pruned when the last worker on a given nodeName retires (O(N) scan inside pool mutex, N = worker count per CP, small).

Test plan

  • 8 new unit tests in `k8s_pool_test.go` covering:
    • `stampNodeFirstSeen` idempotency + empty-node guard
    • `pruneNodeFirstSeen` only drops entries with no remaining worker
    • claim prefers oldest node, falls back cleanly when nodeName is empty
    • reap picks newest-node first
    • reap stops at minWorkers
    • reap skips workers still within idleTimeout
  • Full `go test -tags kubernetes ./controlplane/` passes
  • After merge: watch `duckgres_org_workers_active` by node and Karpenter node churn — expect fewer long-lived nodes getting recycled, more newest nodes evicted.

🤖 Generated with Claude Code

Both paths iterated p.workers map in randomized Go order, so claims
and reaps were non-deterministic. That churned worker nodes — a new
query might claim a worker on a freshly-provisioned node with a cold
cache-proxy, and an idle reap might kill the worker on the longest-
lived node (best cache hit rate).

Track first-seen time per node in the pool (nodeFirstSeen map, populated
when a worker pod reports Ready with a Spec.NodeName) and use it to
order both paths:

  - findIdleWorkerLocked: prefer idle worker on the oldest-seen node,
    so sessions land on the node with the most warmed-up NVMe parquet
    pages.
  - reapIdleWorkers: sort reap candidates by nodeFirstSeen desc so
    excess idle workers are evicted from the newest-seen nodes first,
    letting Karpenter consolidate those and keeping the old nodes warm.

Workers with no known nodeName (informer race on first few ms after
spawn) sort as 'new' — get reaped first, claimed last — which is the
safe direction.

No RBAC changes. Per-CP state, no cross-CP coordination needed (store
already prevents over-spawn; the local idleCount > minWorkers gate
prevents over-reap). Map is pruned on last-worker-retire per node.

Tests cover: stamp idempotency, empty-node guard, prune correctness,
oldest-node-wins claim, unknown-node fallback, newest-node-first reap,
minWorkers floor, idleTimeout gate.
@benben benben requested a review from a team April 23, 2026 11:26
@benben benben merged commit 40c2809 into main Apr 23, 2026
21 checks passed
@benben benben deleted the ben/node-age-aware-scheduling branch April 23, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant