Skip to content

[FEATURE] Foreman coder gate: cluster-backed clean-room Job tier for envtest/e2e #768

@Defilan

Description

@Defilan

Feature Description

Add a heavyweight, cluster-backed verification tier to the Foreman coder gate: a clean-room Kubernetes Job that runs envtest + e2e (kind) on a node that has a cluster, executed after the fast in-workspace gate passes, with failures fed back to the coder loop the same way the fast gate's failures are.

Problem Statement

As a maintainer running Foreman coders, I want the gate to catch real-cluster integration regressions before a PR is opened, so coder solutions don't pass the gate and then fail CI on envtest/e2e.

The current coder gate runs entirely in the coder's workspace: gofmt, go vet, go build, golangci-lint, and (since #762/#763) a fast unit-test tier. It has no Kubernetes cluster, so it structurally cannot run envtest (KUBEBUILDER_ASSETS) or e2e (kind). Any coder change whose behavior only manifests against a real cluster — CRD reconciliation, operator logic, CLI-against-cluster — can pass the full in-workspace gate (including unit tests, which mock the cluster and give false confidence) and only fail in CI after the PR is open.

This is systematic, not a one-off. It has now bitten the #731 cache-list work twice and the earlier envtest trap once:

  • [FEATURE] Make llmkube cache list per-InferenceService cache aware (post #729) #731: the coder produced a correct cache list per-InferenceService feature (verified working against a live 31-model cluster), but it regressed a pre-existing e2e: cache list inspects every labeled cache PVC, and a Pending (unbound) per-service PVC — created by an InferenceService whose fake e2e model never produces a running serving pod — sent the inspector into a waitForPodRunning block until the suite's global timeout. The unit tests passed (mocked client); only the kind e2e caught it.
  • The earlier envtest trap: the coder ran envtest in its workspace where KUBEBUILDER_ASSETS is unset and hung.

No amount of better unit testing closes this: mocked tests cannot model real PVC layout, real pod/volume lifecycle, or real reconciliation timing. It is integration territory by nature.

Proposed Solution

A post-fast-gate tier that runs the integration suite on a cluster-backed Job:

  • After the in-workspace fast gate (gofmt/vet/build/lint/unit) passes a coder's GO, dispatch a clean-room Job on a node that has a cluster (Shadowstack; ties to [FEATURE] Example: spot-capacity GPU NodePool for Foreman gate Jobs #659 spot GPU NodePool for gate Jobs).
  • The Job checks out the coder's branch and runs make test (envtest) and/or make test-e2e (kind), bounded by a timeout.
  • On failure, feed the output back to the coder loop as gate feedback (same mechanism as the fast gate), so the coder fixes and resubmits; on pass, the GO stands.
  • Keep the fast in-workspace gate as the first, cheap line of defense; the cluster Job is the heavyweight second tier (only run when the fast gate is green).

Alternatives Considered

  • Run e2e/envtest in the coder workspace — rejected: the workspace has no cluster, and provisioning one per coder is heavy and slow; the envtest trap shows the coder should not run these directly.
  • Rely on better unit tests — insufficient: mocked tests gave false confidence on exactly the [FEATURE] Make llmkube cache list per-InferenceService cache aware (post #729) #731 regression.
  • Accept CI as the only integration gate — status quo; it pushes the failure past the GO and into review, which is what this issue exists to fix.

Additional Context

Priority

  • High - Would significantly improve my workflow

Willingness to Contribute

  • Yes, I can submit a PR
  • Yes, I can help test

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions