You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a heavyweight, cluster-backed verification tier to the Foreman coder gate: a clean-room Kubernetes Job that runs envtest + e2e (kind) on a node that has a cluster, executed after the fast in-workspace gate passes, with failures fed back to the coder loop the same way the fast gate's failures are.
Problem Statement
As a maintainer running Foreman coders, I want the gate to catch real-cluster integration regressions before a PR is opened, so coder solutions don't pass the gate and then fail CI on envtest/e2e.
The current coder gate runs entirely in the coder's workspace: gofmt, go vet, go build, golangci-lint, and (since #762/#763) a fast unit-test tier. It has no Kubernetes cluster, so it structurally cannot run envtest (KUBEBUILDER_ASSETS) or e2e (kind). Any coder change whose behavior only manifests against a real cluster — CRD reconciliation, operator logic, CLI-against-cluster — can pass the full in-workspace gate (including unit tests, which mock the cluster and give false confidence) and only fail in CI after the PR is open.
This is systematic, not a one-off. It has now bitten the #731 cache-list work twice and the earlier envtest trap once:
[FEATURE] Make llmkube cache list per-InferenceService cache aware (post #729) #731: the coder produced a correct cache list per-InferenceService feature (verified working against a live 31-model cluster), but it regressed a pre-existing e2e: cache list inspects every labeled cache PVC, and a Pending (unbound) per-service PVC — created by an InferenceService whose fake e2e model never produces a running serving pod — sent the inspector into a waitForPodRunning block until the suite's global timeout. The unit tests passed (mocked client); only the kind e2e caught it.
The earlier envtest trap: the coder ran envtest in its workspace where KUBEBUILDER_ASSETS is unset and hung.
No amount of better unit testing closes this: mocked tests cannot model real PVC layout, real pod/volume lifecycle, or real reconciliation timing. It is integration territory by nature.
Proposed Solution
A post-fast-gate tier that runs the integration suite on a cluster-backed Job:
The Job checks out the coder's branch and runs make test (envtest) and/or make test-e2e (kind), bounded by a timeout.
On failure, feed the output back to the coder loop as gate feedback (same mechanism as the fast gate), so the coder fixes and resubmits; on pass, the GO stands.
Keep the fast in-workspace gate as the first, cheap line of defense; the cluster Job is the heavyweight second tier (only run when the fast gate is green).
Alternatives Considered
Run e2e/envtest in the coder workspace — rejected: the workspace has no cluster, and provisioning one per coder is heavy and slow; the envtest trap shows the coder should not run these directly.
Feature Description
Add a heavyweight, cluster-backed verification tier to the Foreman coder gate: a clean-room Kubernetes Job that runs envtest + e2e (kind) on a node that has a cluster, executed after the fast in-workspace gate passes, with failures fed back to the coder loop the same way the fast gate's failures are.
Problem Statement
The current coder gate runs entirely in the coder's workspace: gofmt, go vet, go build, golangci-lint, and (since #762/#763) a fast unit-test tier. It has no Kubernetes cluster, so it structurally cannot run envtest (
KUBEBUILDER_ASSETS) or e2e (kind). Any coder change whose behavior only manifests against a real cluster — CRD reconciliation, operator logic, CLI-against-cluster — can pass the full in-workspace gate (including unit tests, which mock the cluster and give false confidence) and only fail in CI after the PR is open.This is systematic, not a one-off. It has now bitten the #731 cache-list work twice and the earlier envtest trap once:
cache listper-InferenceService feature (verified working against a live 31-model cluster), but it regressed a pre-existing e2e:cache listinspects every labeled cache PVC, and a Pending (unbound) per-service PVC — created by an InferenceService whose fake e2e model never produces a running serving pod — sent the inspector into awaitForPodRunningblock until the suite's global timeout. The unit tests passed (mocked client); only the kind e2e caught it.KUBEBUILDER_ASSETSis unset and hung.No amount of better unit testing closes this: mocked tests cannot model real PVC layout, real pod/volume lifecycle, or real reconciliation timing. It is integration territory by nature.
Proposed Solution
A post-fast-gate tier that runs the integration suite on a cluster-backed Job:
make test(envtest) and/ormake test-e2e(kind), bounded by a timeout.Alternatives Considered
Additional Context
Priority
Willingness to Contribute