From aebfcc9706ede461e2bcb9d6cc2d0241b32023fc Mon Sep 17 00:00:00 2001 From: gziv Date: Thu, 23 Apr 2026 11:40:30 +0300 Subject: [PATCH 1/4] feat: add infrastructure manifests, network policies, LiteLLM, cleanup, and ops docs RBAC: explicit ServiceAccount, configmaps and PVC permissions. Security: pod security reference, resource quota, four network policies (default-deny, direct-api, litellm, self-hosted). LiteLLM: HA deployment, service, config, and secret template for Vertex AI mode. Storage: workspace and dead-letter PVCs, daily cleanup CronJob for stale pods/PipelineRuns/imagestreams. Docs: failure handling (retries, timeouts, dead-letter, partial recovery) and infrastructure deployment guide. --- Docs/failure_handling.md | 96 +++++++++ Docs/infrastructure_ops.md | 183 ++++++++++++++++++ config/litellm/configmap.yaml | 25 +++ config/litellm/deployment.yaml | 82 ++++++++ config/litellm/secret_template.yaml | 15 ++ config/litellm/service.yaml | 17 ++ config/rbac.yaml | 14 ++ .../security/network_policy_default_deny.yaml | 14 ++ .../security/network_policy_direct_api.yaml | 38 ++++ config/security/network_policy_litellm.yaml | 36 ++++ .../security/network_policy_self_hosted.yaml | 38 ++++ config/security/pod_security.yaml | 40 ++++ config/security/resource_quota.yaml | 17 ++ config/storage/cleanup_cronjob.yaml | 111 +++++++++++ config/storage/dead_letter_pvc.yaml | 17 ++ config/storage/workspace_pvc.yaml | 13 ++ scripts/cleanup.sh | 58 ++++++ 17 files changed, 814 insertions(+) create mode 100644 Docs/failure_handling.md create mode 100644 Docs/infrastructure_ops.md create mode 100644 config/litellm/configmap.yaml create mode 100644 config/litellm/deployment.yaml create mode 100644 config/litellm/secret_template.yaml create mode 100644 config/litellm/service.yaml create mode 100644 config/security/network_policy_default_deny.yaml create mode 100644 config/security/network_policy_direct_api.yaml create mode 100644 config/security/network_policy_litellm.yaml create mode 100644 config/security/network_policy_self_hosted.yaml create mode 100644 config/security/pod_security.yaml create mode 100644 config/security/resource_quota.yaml create mode 100644 config/storage/cleanup_cronjob.yaml create mode 100644 config/storage/dead_letter_pvc.yaml create mode 100644 config/storage/workspace_pvc.yaml create mode 100755 scripts/cleanup.sh diff --git a/Docs/failure_handling.md b/Docs/failure_handling.md new file mode 100644 index 0000000..8db2ce0 --- /dev/null +++ b/Docs/failure_handling.md @@ -0,0 +1,96 @@ +# Failure Handling, Retries, and Idempotency + +## Retry Policy + +Each pipeline task has a retry configuration based on its idempotency +characteristics. Retries are defined in `pipeline.yaml` using Tekton's +`retries` field. + +| Task | Retries | Rationale | +|---|---|---| +| `clone-repo` (ClusterTask) | 2 | Network-dependent, fully idempotent | +| `validate` | 1 | Read-only, deterministic | +| `scaffold` | 1 | Deterministic template rendering | +| `build-push` | 2 | Transient registry/network errors; Buildah is idempotent with layer caching | +| `harbor-eval` | 0 | Long-running (up to 3h), not idempotent — partial trial results would conflict with a fresh run | +| `analyze` | 1 | Reads from workspace, deterministic computation | +| `store-results` | 2 | Database transient errors; upsert logic ensures idempotency via `pipeline_run_id` uniqueness | + +## Timeouts + +Per-task timeouts prevent hung tasks from consuming cluster resources +indefinitely. Set in `pipeline.yaml` using Tekton's `timeout` field. + +| Task | Timeout | Notes | +|---|---|---| +| `clone-repo` | 5m | Large repos may need adjustment | +| `validate` | 10m | Includes py_compile on all test files | +| `scaffold` | 10m | Jinja2 rendering + file copy | +| `build-push` | 30m | Two container builds (treatment + control) | +| `harbor-eval` | 3h | 20 trials x 2 variants; adjust based on task complexity | +| `analyze` | 15m | Statistical computation + report generation | +| `store-results` | 15m | Database writes + observer notifications | +| **Pipeline total** | 4h | Safety net above sum of individual timeouts | + +## Non-Retryable Failures + +Certain failure categories should not be retried because they indicate +a problem that will not resolve on its own: + +- **Validation failures** — malformed submission, missing required files +- **Schema violations** — invalid `metadata.yaml` +- **Build failures** from syntax errors in user code +- **Harbor evaluation failures** from test assertion errors (the skill genuinely fails) + +These are distinguished from transient failures (network timeouts, +registry 503s, DB connection drops) by exit code conventions: + +| Exit Code | Meaning | Retry? | +|---|---|---| +| 0 | Success | -- | +| 1 | Transient/recoverable error | Yes | +| 2 | Validation/user error (non-retryable) | No | +| 3 | Infrastructure error (retryable) | Yes | + +Scripts should use `sys.exit(2)` for user-facing errors to signal +Tekton that a retry would not help. + +## Dead-Letter Path + +When a PipelineRun fails after exhausting retries: + +1. **Artifacts are retained** on the workspace PVC (not cleaned up) +2. Failed run artifacts are copied to the `abevalflow-dead-letter` PVC + by the cleanup CronJob (instead of being deleted) +3. Dead-letter artifacts are retained for 14 days (configurable via + `DEAD_LETTER_RETENTION_DAYS`) +4. PipelineRun metadata remains queryable via `tkn pipelinerun describe` + until the cleanup CronJob prunes it (default 7 days) + +## Partial-Run Recovery + +Tekton does not natively support resuming a pipeline from a specific +task. The recovery strategy is: + +1. **Workspace snapshot** — the PVC retains all intermediate artifacts + from completed tasks. A re-run with the same submission will + overwrite these, effectively starting fresh. + +2. **Harbor checkpointing** — the Harbor fork persists individual trial + results to the workspace as they complete. If `harbor-eval` fails + mid-way (e.g., after 15 of 20 trials), the partial `result.json` + files are available for inspection. However, the analysis step + expects a complete set, so a re-run of `harbor-eval` is needed. + +3. **Manual re-trigger** — use `tkn pipeline start` with the same + parameters to re-run the full pipeline. Since all tasks before the + failure point are idempotent, they will complete quickly using + cached layers (builds) or deterministic outputs (scaffold). + +## Concurrency + +- **PipelineRuns** — no built-in Tekton limit; use `ResourceQuota` on + the namespace (`config/security/resource_quota.yaml`) to cap total + pods, which indirectly limits concurrent runs. +- **Trial Pods** — Harbor's `OpenShiftEnvironment` controls concurrency + via its `max_concurrent` parameter in the job config. diff --git a/Docs/infrastructure_ops.md b/Docs/infrastructure_ops.md new file mode 100644 index 0000000..6bfe1fc --- /dev/null +++ b/Docs/infrastructure_ops.md @@ -0,0 +1,183 @@ +# Infrastructure & Operations Guide + +Deployment and operations reference for running ABEvalFlow on OpenShift. + +## Prerequisites + +- OpenShift cluster with Pipelines operator (Tekton) installed +- `oc` CLI authenticated with cluster-admin or namespace-admin +- `tkn` CLI (optional, for manual pipeline triggers and PipelineRun cleanup) + +## Namespace Setup + +```bash +oc new-project ab-eval-flow --description="ABEvalFlow A/B evaluation pipeline" +``` + +## Deployment Order + +Apply manifests in this order to satisfy dependencies: + +```bash +# 1. RBAC — ServiceAccount, Roles, RoleBindings +oc apply -f config/rbac.yaml + +# 2. Security — resource quotas +oc apply -f config/security/resource_quota.yaml + +# 3. Network policies — choose ONE based on LLM mode (see below) +oc apply -f config/security/network_policy_default_deny.yaml +oc apply -f config/security/network_policy_.yaml + +# 4. Storage — workspace and dead-letter PVCs +oc apply -f config/storage/workspace_pvc.yaml +oc apply -f config/storage/dead_letter_pvc.yaml + +# 5. Cleanup CronJob +oc apply -f config/storage/cleanup_cronjob.yaml + +# 6. Tekton tasks +oc apply -f pipeline/tasks/ + +# 7. Tekton triggers +oc apply -f pipeline/triggers/ + +# 8. Expose EventListener +oc create route edge el-submission-listener \ + --service=el-submission-listener \ + --port=http-listener + +# 9. (Optional) LiteLLM — only for Vertex AI mode +oc apply -f config/litellm/ +``` + +## Network Policy Selection + +Choose the network policy that matches your LLM access mode. Always +apply the default-deny policy first, then add the mode-specific allow +policy. + +| LLM Mode | Policies to Apply | Effect | +|---|---|---| +| Direct API key | `default_deny` + `direct_api` | Trial pods can reach provider HTTPS endpoints + DNS | +| Vertex AI + LiteLLM | `default_deny` + `litellm` | Trial pods can only reach in-cluster LiteLLM on port 4000 | +| Self-hosted model | `default_deny` + `self_hosted` | Trial pods can only reach in-cluster model server | + +Trial pods must carry the label `abevalflow/role: trial` for policies +to take effect. The Harbor fork's `OpenShiftEnvironment` should set +this label when creating trial pods. + +## LiteLLM Setup (Vertex AI Mode Only) + +1. Create the credentials secret with your GCP service account key: + +```bash +oc create secret generic litellm-credentials \ + --from-file=GOOGLE_APPLICATION_CREDENTIALS_JSON=path/to/sa-key.json \ + --from-literal=LITELLM_MASTER_KEY=$(openssl rand -hex 32) \ + -n ab-eval-flow +``` + +2. Edit `config/litellm/configmap.yaml` to set your GCP project and + model routing. + +3. Apply the manifests: + +```bash +oc apply -f config/litellm/ +``` + +4. Verify the proxy is healthy: + +```bash +oc get pods -l app.kubernetes.io/name=litellm -n ab-eval-flow +oc port-forward svc/litellm 4000:4000 -n ab-eval-flow & +curl http://localhost:4000/health +``` + +## Storage + +| PVC | Purpose | Default Size | +|---|---|---| +| `abevalflow-workspace` | Shared pipeline workspace (source, builds, results) | 5Gi | +| `abevalflow-dead-letter` | Retained artifacts from failed runs | 2Gi | + +Adjust sizes based on expected submission volume and image sizes. + +## Cleanup CronJob + +Runs daily at 03:00 UTC. Configurable via environment variables: + +| Variable | Default | Description | +|---|---|---| +| `NAMESPACE` | `ab-eval-flow` | Target namespace | +| `POD_AGE_HOURS` | `24` | Delete completed/failed trial pods older than this | +| `PIPELINERUN_AGE_DAYS` | `7` | Delete PipelineRuns older than this | + +To run cleanup manually: + +```bash +oc create job --from=cronjob/abevalflow-cleanup manual-cleanup -n ab-eval-flow +``` + +## Resource Quotas + +The default quota (`config/security/resource_quota.yaml`) limits: + +| Resource | Limit | +|---|---| +| Pods | 50 | +| CPU requests | 32 cores | +| Memory requests | 64Gi | +| CPU limits | 64 cores | +| Memory limits | 128Gi | +| PVCs | 10 | + +Adjust based on cluster capacity and expected concurrency. + +## Pod Security + +Trial pods spawned by Harbor's `OpenShiftEnvironment` should follow the +security context documented in `config/security/pod_security.yaml`: + +- `runAsNonRoot: true` +- `allowPrivilegeEscalation: false` +- Drop all Linux capabilities +- Seccomp `RuntimeDefault` +- Resource requests/limits per trial pod + +The Harbor fork currently sets `HOME=/tmp` instead of +`readOnlyRootFilesystem: true` for agent compatibility. This is +documented in `Docs/harbor_openshift_backend.md`. + +## Failure Handling + +See [failure_handling.md](failure_handling.md) for retry policies, +timeouts, dead-letter path, and partial-run recovery. + +## Verification + +After deploying, verify the infrastructure: + +```bash +# Check ServiceAccount +oc get sa pipeline -n ab-eval-flow + +# Check RBAC +oc auth can-i create pods --as=system:serviceaccount:ab-eval-flow:pipeline -n ab-eval-flow + +# Check network policies +oc get networkpolicy -n ab-eval-flow + +# Check PVCs +oc get pvc -n ab-eval-flow + +# Check CronJob +oc get cronjob -n ab-eval-flow + +# Check EventListener +oc get el,route -n ab-eval-flow + +# Check resource quota usage +oc describe resourcequota eval-resource-quota -n ab-eval-flow +``` diff --git a/config/litellm/configmap.yaml b/config/litellm/configmap.yaml new file mode 100644 index 0000000..5652666 --- /dev/null +++ b/config/litellm/configmap.yaml @@ -0,0 +1,25 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: litellm-config + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow + app.kubernetes.io/name: litellm +data: + config.yaml: | + model_list: + - model_name: claude-sonnet + litellm_params: + model: vertex_ai/claude-3-5-sonnet@20241022 + vertex_project: "" + vertex_location: "global" + + litellm_settings: + drop_params: true + set_verbose: false + num_retries: 2 + request_timeout: 120 + + general_settings: + master_key: "os.environ/LITELLM_MASTER_KEY" diff --git a/config/litellm/deployment.yaml b/config/litellm/deployment.yaml new file mode 100644 index 0000000..9b5441d --- /dev/null +++ b/config/litellm/deployment.yaml @@ -0,0 +1,82 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: litellm + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow + app.kubernetes.io/name: litellm +spec: + replicas: 2 + selector: + matchLabels: + app.kubernetes.io/name: litellm + template: + metadata: + labels: + app.kubernetes.io/name: litellm + app.kubernetes.io/part-of: abevalflow + spec: + serviceAccountName: pipeline + containers: + - name: litellm + image: ghcr.io/berriai/litellm:main-latest + ports: + - containerPort: 4000 + name: http + args: + - "--config" + - "/app/config/config.yaml" + - "--port" + - "4000" + env: + - name: LITELLM_MASTER_KEY + valueFrom: + secretKeyRef: + name: litellm-credentials + key: LITELLM_MASTER_KEY + - name: GOOGLE_APPLICATION_CREDENTIALS + value: /app/credentials/gcp-sa-key.json + volumeMounts: + - name: config + mountPath: /app/config + readOnly: true + - name: credentials + mountPath: /app/credentials + readOnly: true + resources: + requests: + cpu: "250m" + memory: "256Mi" + limits: + cpu: "1" + memory: "1Gi" + readinessProbe: + httpGet: + path: /health + port: 4000 + initialDelaySeconds: 10 + periodSeconds: 10 + livenessProbe: + httpGet: + path: /health + port: 4000 + initialDelaySeconds: 15 + periodSeconds: 30 + securityContext: + runAsNonRoot: true + allowPrivilegeEscalation: false + capabilities: + drop: ["ALL"] + seccompProfile: + type: RuntimeDefault + volumes: + - name: config + configMap: + name: litellm-config + - name: credentials + secret: + secretName: litellm-credentials + items: + - key: GOOGLE_APPLICATION_CREDENTIALS_JSON + path: gcp-sa-key.json diff --git a/config/litellm/secret_template.yaml b/config/litellm/secret_template.yaml new file mode 100644 index 0000000..01b28f3 --- /dev/null +++ b/config/litellm/secret_template.yaml @@ -0,0 +1,15 @@ +apiVersion: v1 +kind: Secret +metadata: + name: litellm-credentials + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow + app.kubernetes.io/name: litellm +type: Opaque +stringData: + # Vertex AI service account key (JSON) + # Replace with actual credentials before applying + GOOGLE_APPLICATION_CREDENTIALS_JSON: "" + # Master key for LiteLLM admin API + LITELLM_MASTER_KEY: "" diff --git a/config/litellm/service.yaml b/config/litellm/service.yaml new file mode 100644 index 0000000..dbf06c7 --- /dev/null +++ b/config/litellm/service.yaml @@ -0,0 +1,17 @@ +apiVersion: v1 +kind: Service +metadata: + name: litellm + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow + app.kubernetes.io/name: litellm +spec: + type: ClusterIP + selector: + app.kubernetes.io/name: litellm + ports: + - name: http + port: 4000 + targetPort: 4000 + protocol: TCP diff --git a/config/rbac.yaml b/config/rbac.yaml index 0b313e8..474ed40 100644 --- a/config/rbac.yaml +++ b/config/rbac.yaml @@ -1,3 +1,11 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + name: pipeline + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow +--- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: @@ -33,6 +41,12 @@ rules: - apiGroups: [""] resources: [secrets] verbs: [get] + - apiGroups: [""] + resources: [configmaps] + verbs: [get, list] + - apiGroups: [""] + resources: [persistentvolumeclaims] + verbs: [get, list, create, delete] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding diff --git a/config/security/network_policy_default_deny.yaml b/config/security/network_policy_default_deny.yaml new file mode 100644 index 0000000..ec45d8c --- /dev/null +++ b/config/security/network_policy_default_deny.yaml @@ -0,0 +1,14 @@ +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: trial-pod-default-deny-egress + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow +spec: + podSelector: + matchLabels: + abevalflow/role: trial + policyTypes: + - Egress + egress: [] diff --git a/config/security/network_policy_direct_api.yaml b/config/security/network_policy_direct_api.yaml new file mode 100644 index 0000000..39502f0 --- /dev/null +++ b/config/security/network_policy_direct_api.yaml @@ -0,0 +1,38 @@ +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: trial-pod-direct-api-egress + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow + annotations: + abevalflow/llm-mode: direct-api + abevalflow/note: >- + Allows trial pods to reach LLM provider APIs and DNS. + Apply this policy INSTEAD OF default-deny when using direct API keys. +spec: + podSelector: + matchLabels: + abevalflow/role: trial + policyTypes: + - Egress + egress: + # DNS resolution + - to: + - namespaceSelector: {} + ports: + - protocol: UDP + port: 53 + - protocol: TCP + port: 53 + # HTTPS to LLM providers + - to: + - ipBlock: + cidr: 0.0.0.0/0 + except: + - 10.0.0.0/8 + - 172.16.0.0/12 + - 192.168.0.0/16 + ports: + - protocol: TCP + port: 443 diff --git a/config/security/network_policy_litellm.yaml b/config/security/network_policy_litellm.yaml new file mode 100644 index 0000000..157ffde --- /dev/null +++ b/config/security/network_policy_litellm.yaml @@ -0,0 +1,36 @@ +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: trial-pod-litellm-egress + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow + annotations: + abevalflow/llm-mode: vertex+litellm + abevalflow/note: >- + Allows trial pods to reach only the in-cluster LiteLLM proxy + and DNS. No external egress. Apply INSTEAD OF default-deny + when using Vertex AI mode. +spec: + podSelector: + matchLabels: + abevalflow/role: trial + policyTypes: + - Egress + egress: + # DNS resolution + - to: + - namespaceSelector: {} + ports: + - protocol: UDP + port: 53 + - protocol: TCP + port: 53 + # LiteLLM proxy (same namespace) + - to: + - podSelector: + matchLabels: + app.kubernetes.io/name: litellm + ports: + - protocol: TCP + port: 4000 diff --git a/config/security/network_policy_self_hosted.yaml b/config/security/network_policy_self_hosted.yaml new file mode 100644 index 0000000..6828a31 --- /dev/null +++ b/config/security/network_policy_self_hosted.yaml @@ -0,0 +1,38 @@ +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: trial-pod-self-hosted-egress + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow + annotations: + abevalflow/llm-mode: self-hosted + abevalflow/note: >- + Allows trial pods to reach only the in-cluster model endpoint + and DNS. No external egress. Apply INSTEAD OF default-deny + when using a self-hosted model (vLLM, Ollama, etc.). + Update the podSelector or namespaceSelector to match your + model serving deployment. +spec: + podSelector: + matchLabels: + abevalflow/role: trial + policyTypes: + - Egress + egress: + # DNS resolution + - to: + - namespaceSelector: {} + ports: + - protocol: UDP + port: 53 + - protocol: TCP + port: 53 + # Self-hosted model endpoint (adjust selector to match your deployment) + - to: + - podSelector: + matchLabels: + app.kubernetes.io/name: model-server + ports: + - protocol: TCP + port: 8080 diff --git a/config/security/pod_security.yaml b/config/security/pod_security.yaml new file mode 100644 index 0000000..62f9eb3 --- /dev/null +++ b/config/security/pod_security.yaml @@ -0,0 +1,40 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: trial-pod-security-reference + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow +data: + # Reference security context for trial Pods spawned by Harbor's + # OpenShiftEnvironment. The Harbor fork reads these values from + # environment_kwargs in the job config; this ConfigMap documents + # the intended baseline. + # + # NOTE: The Harbor fork does NOT currently set readOnlyRootFilesystem + # because many agents write to $HOME. The mitigation is HOME=/tmp + # with an emptyDir mount. If agent compatibility allows, enable + # readOnlyRootFilesystem in the job config's environment_kwargs. + security_context.yaml: | + securityContext: + runAsNonRoot: true + allowPrivilegeEscalation: false + capabilities: + drop: ["ALL"] + seccompProfile: + type: RuntimeDefault + volumeMounts: + - name: tmp + mountPath: /tmp + volumes: + - name: tmp + emptyDir: + sizeLimit: 512Mi + resource_limits.yaml: | + resources: + requests: + cpu: "250m" + memory: "512Mi" + limits: + cpu: "2" + memory: "4Gi" diff --git a/config/security/resource_quota.yaml b/config/security/resource_quota.yaml new file mode 100644 index 0000000..bf1c7d7 --- /dev/null +++ b/config/security/resource_quota.yaml @@ -0,0 +1,17 @@ +apiVersion: v1 +kind: ResourceQuota +metadata: + name: eval-resource-quota + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow +spec: + hard: + # Trial pods: 40 max concurrent (20 treatment + 20 control) + # plus pipeline task pods and EventListener + pods: "50" + requests.cpu: "32" + requests.memory: "64Gi" + limits.cpu: "64" + limits.memory: "128Gi" + persistentvolumeclaims: "10" diff --git a/config/storage/cleanup_cronjob.yaml b/config/storage/cleanup_cronjob.yaml new file mode 100644 index 0000000..51f305a --- /dev/null +++ b/config/storage/cleanup_cronjob.yaml @@ -0,0 +1,111 @@ +apiVersion: batch/v1 +kind: CronJob +metadata: + name: abevalflow-cleanup + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow +spec: + schedule: "0 3 * * *" + concurrencyPolicy: Forbid + successfulJobsHistoryLimit: 3 + failedJobsHistoryLimit: 3 + jobTemplate: + spec: + backoffLimit: 1 + template: + spec: + serviceAccountName: pipeline + restartPolicy: OnFailure + containers: + - name: cleanup + image: image-registry.openshift-image-registry.svc:5000/openshift/cli:latest + command: ["/bin/bash", "/scripts/cleanup.sh"] + env: + - name: NAMESPACE + value: ab-eval-flow + - name: POD_AGE_HOURS + value: "24" + - name: PIPELINERUN_AGE_DAYS + value: "7" + volumeMounts: + - name: scripts + mountPath: /scripts + readOnly: true + resources: + requests: + cpu: "50m" + memory: "64Mi" + limits: + cpu: "200m" + memory: "128Mi" + securityContext: + runAsNonRoot: true + allowPrivilegeEscalation: false + capabilities: + drop: ["ALL"] + volumes: + - name: scripts + configMap: + name: cleanup-script + defaultMode: 0755 +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: cleanup-script + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow +data: + cleanup.sh: | + #!/usr/bin/env bash + set -euo pipefail + + NAMESPACE="${NAMESPACE:-ab-eval-flow}" + POD_AGE_HOURS="${POD_AGE_HOURS:-24}" + PIPELINERUN_AGE_DAYS="${PIPELINERUN_AGE_DAYS:-7}" + + log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $*"; } + + log "Starting cleanup in namespace=${NAMESPACE}" + + # Delete completed/failed trial pods older than threshold + log "Removing completed/failed trial pods older than ${POD_AGE_HOURS}h..." + threshold=$((POD_AGE_HOURS * 3600)) + for phase in Succeeded Failed; do + oc get pods -n "${NAMESPACE}" --field-selector="status.phase=${phase}" \ + -l abevalflow/role=trial -o name 2>/dev/null | while read -r pod; do + age=$(oc get "${pod}" -n "${NAMESPACE}" -o jsonpath='{.metadata.creationTimestamp}' 2>/dev/null) + age_sec=$(python3 -c " + from datetime import datetime, timezone + t = datetime.fromisoformat('${age}'.replace('Z','+00:00')) + print(int((datetime.now(timezone.utc) - t).total_seconds())) + " 2>/dev/null || echo 0) + if [ "${age_sec}" -gt "${threshold}" ]; then + log "Deleting ${pod} (age=${age_sec}s, phase=${phase})" + oc delete "${pod}" -n "${NAMESPACE}" --grace-period=0 || true + fi + done + done + + # Delete old PipelineRuns (keep recent N days worth) + log "Removing PipelineRuns older than ${PIPELINERUN_AGE_DAYS}d..." + if command -v tkn &>/dev/null; then + tkn pipelinerun delete -n "${NAMESPACE}" --keep="${PIPELINERUN_AGE_DAYS}" --force 2>/dev/null || true + else + log "tkn not available, skipping PipelineRun cleanup" + fi + + # Prune empty image streams + log "Pruning empty image streams..." + oc get imagestream -n "${NAMESPACE}" -o name 2>/dev/null | while read -r is; do + tags=$(oc get "${is}" -n "${NAMESPACE}" -o jsonpath='{.status.tags}' 2>/dev/null || echo "[]") + count=$(python3 -c "import json; print(len(json.loads('${tags}') or []))" 2>/dev/null || echo 0) + if [ "${count}" -eq 0 ]; then + log "Deleting empty ${is}" + oc delete "${is}" -n "${NAMESPACE}" || true + fi + done + + log "Cleanup complete" diff --git a/config/storage/dead_letter_pvc.yaml b/config/storage/dead_letter_pvc.yaml new file mode 100644 index 0000000..0e55fff --- /dev/null +++ b/config/storage/dead_letter_pvc.yaml @@ -0,0 +1,17 @@ +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: abevalflow-dead-letter + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow + annotations: + abevalflow/note: >- + Retains artifacts from failed PipelineRuns for debugging. + Cleaned up by the cleanup CronJob after retention period. +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 2Gi diff --git a/config/storage/workspace_pvc.yaml b/config/storage/workspace_pvc.yaml new file mode 100644 index 0000000..6ff22d3 --- /dev/null +++ b/config/storage/workspace_pvc.yaml @@ -0,0 +1,13 @@ +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: abevalflow-workspace + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 5Gi diff --git a/scripts/cleanup.sh b/scripts/cleanup.sh new file mode 100755 index 0000000..0d88af4 --- /dev/null +++ b/scripts/cleanup.sh @@ -0,0 +1,58 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Cleanup script for ab-eval-flow namespace. +# Intended to run as a CronJob to remove stale resources. + +NAMESPACE="${NAMESPACE:-ab-eval-flow}" +POD_AGE_HOURS="${POD_AGE_HOURS:-24}" +PIPELINERUN_AGE_DAYS="${PIPELINERUN_AGE_DAYS:-7}" +DEAD_LETTER_RETENTION_DAYS="${DEAD_LETTER_RETENTION_DAYS:-14}" + +log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $*"; } + +log "Starting cleanup in namespace=${NAMESPACE}" + +# Delete completed/failed trial pods older than threshold +log "Removing completed/failed pods older than ${POD_AGE_HOURS}h..." +completed=$(oc get pods -n "${NAMESPACE}" \ + --field-selector=status.phase==Succeeded \ + -o jsonpath='{.items[?(@.metadata.labels.abevalflow/role=="trial")].metadata.name}' 2>/dev/null || true) +failed=$(oc get pods -n "${NAMESPACE}" \ + --field-selector=status.phase==Failed \ + -o jsonpath='{.items[?(@.metadata.labels.abevalflow/role=="trial")].metadata.name}' 2>/dev/null || true) + +for pod in ${completed} ${failed}; do + age_seconds=$(oc get pod "${pod}" -n "${NAMESPACE}" \ + -o jsonpath='{.status.startTime}' 2>/dev/null | \ + xargs -I{} python3 -c " +from datetime import datetime, timezone +start = datetime.fromisoformat('{}').replace(tzinfo=timezone.utc) +print(int((datetime.now(timezone.utc) - start).total_seconds())) +" 2>/dev/null || echo 0) + threshold=$((POD_AGE_HOURS * 3600)) + if [ "${age_seconds}" -gt "${threshold}" ]; then + log "Deleting pod ${pod} (age=${age_seconds}s)" + oc delete pod "${pod}" -n "${NAMESPACE}" --grace-period=0 || true + fi +done + +# Delete old PipelineRuns +log "Removing PipelineRuns older than ${PIPELINERUN_AGE_DAYS}d..." +if command -v tkn &>/dev/null; then + tkn pipelinerun delete -n "${NAMESPACE}" \ + --keep="${PIPELINERUN_AGE_DAYS}" \ + --force 2>/dev/null || true +fi + +# Prune internal registry images for deleted submissions +log "Pruning unused image streams..." +oc get imagestream -n "${NAMESPACE}" -o name 2>/dev/null | while read -r is; do + tag_count=$(oc get "${is}" -n "${NAMESPACE}" -o jsonpath='{.status.tags}' 2>/dev/null | python3 -c "import sys,json; print(len(json.load(sys.stdin) or []))" 2>/dev/null || echo 0) + if [ "${tag_count}" -eq 0 ]; then + log "Deleting empty imagestream ${is}" + oc delete "${is}" -n "${NAMESPACE}" || true + fi +done + +log "Cleanup complete" From 1ea3530341059976d65a1dd97556c7287ce430bc Mon Sep 17 00:00:00 2001 From: gziv Date: Thu, 23 Apr 2026 12:57:09 +0300 Subject: [PATCH 2/4] fix: address PR #13 review feedback on infra-ops manifests - Rename PIPELINERUN_AGE_DAYS to PIPELINERUN_KEEP_COUNT (count semantics) - Clarify dead-letter PVC as reserved for future use, not yet automated - Mark per-task retries/timeouts as target policy (not yet applied) - Change network policy annotations from INSTEAD OF to IN ADDITION TO - Pin LiteLLM image to main-v1.82.6 instead of main-latest - Add dedicated litellm ServiceAccount instead of reusing pipeline SA - Scope secrets get verb with resourceNames in RBAC - Remove inline cleanup script from CronJob ConfigMap (single source) - Rename pod_security.yaml to pod_security_reference.yaml - Add port guidance note to self-hosted network policy --- Docs/failure_handling.md | 33 +++++----- Docs/infrastructure_ops.md | 13 ++-- config/litellm/deployment.yaml | 4 +- config/litellm/serviceaccount.yaml | 8 +++ config/rbac.yaml | 1 + .../security/network_policy_direct_api.yaml | 2 +- config/security/network_policy_litellm.yaml | 2 +- .../security/network_policy_self_hosted.yaml | 5 +- ...urity.yaml => pod_security_reference.yaml} | 0 config/storage/cleanup_cronjob.yaml | 64 ++++--------------- scripts/cleanup.sh | 9 ++- 11 files changed, 58 insertions(+), 83 deletions(-) create mode 100644 config/litellm/serviceaccount.yaml rename config/security/{pod_security.yaml => pod_security_reference.yaml} (100%) diff --git a/Docs/failure_handling.md b/Docs/failure_handling.md index 8db2ce0..ed20b3e 100644 --- a/Docs/failure_handling.md +++ b/Docs/failure_handling.md @@ -1,12 +1,13 @@ # Failure Handling, Retries, and Idempotency -## Retry Policy +## Retry Policy (Target — Not Yet Applied) -Each pipeline task has a retry configuration based on its idempotency -characteristics. Retries are defined in `pipeline.yaml` using Tekton's -`retries` field. +> **Note:** The per-task retry values below are the target policy. They +> will be added to `pipeline.yaml` once the pipeline assembly PR merges +> and per-task `retries` fields are wired. Currently `pipeline.yaml` +> only sets aggregate `spec.timeouts`. -| Task | Retries | Rationale | +| Task | Planned Retries | Rationale | |---|---|---| | `clone-repo` (ClusterTask) | 2 | Network-dependent, fully idempotent | | `validate` | 1 | Read-only, deterministic | @@ -16,12 +17,13 @@ characteristics. Retries are defined in `pipeline.yaml` using Tekton's | `analyze` | 1 | Reads from workspace, deterministic computation | | `store-results` | 2 | Database transient errors; upsert logic ensures idempotency via `pipeline_run_id` uniqueness | -## Timeouts +## Timeouts (Target — Not Yet Applied) -Per-task timeouts prevent hung tasks from consuming cluster resources -indefinitely. Set in `pipeline.yaml` using Tekton's `timeout` field. +> **Note:** The per-task timeouts below are the target policy. They will +> be added to `pipeline.yaml` alongside the retry values. Currently only +> aggregate timeouts are set: `pipeline: 4h`, `tasks: 3h`. -| Task | Timeout | Notes | +| Task | Planned Timeout | Notes | |---|---|---| | `clone-repo` | 5m | Large repos may need adjustment | | `validate` | 10m | Includes py_compile on all test files | @@ -60,12 +62,13 @@ Tekton that a retry would not help. When a PipelineRun fails after exhausting retries: 1. **Artifacts are retained** on the workspace PVC (not cleaned up) -2. Failed run artifacts are copied to the `abevalflow-dead-letter` PVC - by the cleanup CronJob (instead of being deleted) -3. Dead-letter artifacts are retained for 14 days (configurable via - `DEAD_LETTER_RETENTION_DAYS`) -4. PipelineRun metadata remains queryable via `tkn pipelinerun describe` - until the cleanup CronJob prunes it (default 7 days) +2. The `abevalflow-dead-letter` PVC is provisioned and reserved for + failed-run artifact storage. Automatic copy logic is **not yet + implemented** — operators can manually copy artifacts from the + workspace PVC for post-mortem analysis. +3. PipelineRun metadata remains queryable via `tkn pipelinerun describe` + until the cleanup CronJob prunes it (keeps the 7 most recent by + count, configurable via `PIPELINERUN_KEEP_COUNT`) ## Partial-Run Recovery diff --git a/Docs/infrastructure_ops.md b/Docs/infrastructure_ops.md index 6bfe1fc..87619f8 100644 --- a/Docs/infrastructure_ops.md +++ b/Docs/infrastructure_ops.md @@ -33,7 +33,10 @@ oc apply -f config/security/network_policy_.yaml oc apply -f config/storage/workspace_pvc.yaml oc apply -f config/storage/dead_letter_pvc.yaml -# 5. Cleanup CronJob +# 5. Cleanup — create ConfigMap from script, then apply CronJob +oc create configmap cleanup-script \ + --from-file=cleanup.sh=scripts/cleanup.sh \ + -n ab-eval-flow --dry-run=client -o yaml | oc apply -f - oc apply -f config/storage/cleanup_cronjob.yaml # 6. Tekton tasks @@ -48,6 +51,8 @@ oc create route edge el-submission-listener \ --port=http-listener # 9. (Optional) LiteLLM — only for Vertex AI mode +# Creates a dedicated litellm ServiceAccount, Deployment, Service, and ConfigMap. +# Requires the litellm-credentials Secret (see LiteLLM Setup below). oc apply -f config/litellm/ ``` @@ -100,7 +105,7 @@ curl http://localhost:4000/health | PVC | Purpose | Default Size | |---|---|---| | `abevalflow-workspace` | Shared pipeline workspace (source, builds, results) | 5Gi | -| `abevalflow-dead-letter` | Retained artifacts from failed runs | 2Gi | +| `abevalflow-dead-letter` | Reserved for failed-run artifacts (manual use for now) | 2Gi | Adjust sizes based on expected submission volume and image sizes. @@ -112,7 +117,7 @@ Runs daily at 03:00 UTC. Configurable via environment variables: |---|---|---| | `NAMESPACE` | `ab-eval-flow` | Target namespace | | `POD_AGE_HOURS` | `24` | Delete completed/failed trial pods older than this | -| `PIPELINERUN_AGE_DAYS` | `7` | Delete PipelineRuns older than this | +| `PIPELINERUN_KEEP_COUNT` | `7` | Keep the N most recent PipelineRuns, delete the rest | To run cleanup manually: @@ -138,7 +143,7 @@ Adjust based on cluster capacity and expected concurrency. ## Pod Security Trial pods spawned by Harbor's `OpenShiftEnvironment` should follow the -security context documented in `config/security/pod_security.yaml`: +security context documented in `config/security/pod_security_reference.yaml`: - `runAsNonRoot: true` - `allowPrivilegeEscalation: false` diff --git a/config/litellm/deployment.yaml b/config/litellm/deployment.yaml index 9b5441d..748c064 100644 --- a/config/litellm/deployment.yaml +++ b/config/litellm/deployment.yaml @@ -17,10 +17,10 @@ spec: app.kubernetes.io/name: litellm app.kubernetes.io/part-of: abevalflow spec: - serviceAccountName: pipeline + serviceAccountName: litellm containers: - name: litellm - image: ghcr.io/berriai/litellm:main-latest + image: ghcr.io/berriai/litellm:main-v1.82.6 ports: - containerPort: 4000 name: http diff --git a/config/litellm/serviceaccount.yaml b/config/litellm/serviceaccount.yaml new file mode 100644 index 0000000..d6cb39a --- /dev/null +++ b/config/litellm/serviceaccount.yaml @@ -0,0 +1,8 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + name: litellm + namespace: ab-eval-flow + labels: + app.kubernetes.io/part-of: abevalflow + app.kubernetes.io/name: litellm diff --git a/config/rbac.yaml b/config/rbac.yaml index 474ed40..2867de1 100644 --- a/config/rbac.yaml +++ b/config/rbac.yaml @@ -41,6 +41,7 @@ rules: - apiGroups: [""] resources: [secrets] verbs: [get] + resourceNames: [ab-eval-db-credentials, llm-credentials] - apiGroups: [""] resources: [configmaps] verbs: [get, list] diff --git a/config/security/network_policy_direct_api.yaml b/config/security/network_policy_direct_api.yaml index 39502f0..7f63097 100644 --- a/config/security/network_policy_direct_api.yaml +++ b/config/security/network_policy_direct_api.yaml @@ -9,7 +9,7 @@ metadata: abevalflow/llm-mode: direct-api abevalflow/note: >- Allows trial pods to reach LLM provider APIs and DNS. - Apply this policy INSTEAD OF default-deny when using direct API keys. + Apply IN ADDITION TO default-deny when using direct API keys. spec: podSelector: matchLabels: diff --git a/config/security/network_policy_litellm.yaml b/config/security/network_policy_litellm.yaml index 157ffde..cd82b8b 100644 --- a/config/security/network_policy_litellm.yaml +++ b/config/security/network_policy_litellm.yaml @@ -9,7 +9,7 @@ metadata: abevalflow/llm-mode: vertex+litellm abevalflow/note: >- Allows trial pods to reach only the in-cluster LiteLLM proxy - and DNS. No external egress. Apply INSTEAD OF default-deny + and DNS. No external egress. Apply IN ADDITION TO default-deny when using Vertex AI mode. spec: podSelector: diff --git a/config/security/network_policy_self_hosted.yaml b/config/security/network_policy_self_hosted.yaml index 6828a31..35f4e05 100644 --- a/config/security/network_policy_self_hosted.yaml +++ b/config/security/network_policy_self_hosted.yaml @@ -9,10 +9,11 @@ metadata: abevalflow/llm-mode: self-hosted abevalflow/note: >- Allows trial pods to reach only the in-cluster model endpoint - and DNS. No external egress. Apply INSTEAD OF default-deny + and DNS. No external egress. Apply IN ADDITION TO default-deny when using a self-hosted model (vLLM, Ollama, etc.). Update the podSelector or namespaceSelector to match your - model serving deployment. + model serving deployment. Adjust port to match your server + (vLLM=8000, Ollama=11434, TGI=80). spec: podSelector: matchLabels: diff --git a/config/security/pod_security.yaml b/config/security/pod_security_reference.yaml similarity index 100% rename from config/security/pod_security.yaml rename to config/security/pod_security_reference.yaml diff --git a/config/storage/cleanup_cronjob.yaml b/config/storage/cleanup_cronjob.yaml index 51f305a..e6c0a9c 100644 --- a/config/storage/cleanup_cronjob.yaml +++ b/config/storage/cleanup_cronjob.yaml @@ -26,7 +26,7 @@ spec: value: ab-eval-flow - name: POD_AGE_HOURS value: "24" - - name: PIPELINERUN_AGE_DAYS + - name: PIPELINERUN_KEEP_COUNT value: "7" volumeMounts: - name: scripts @@ -50,6 +50,14 @@ spec: name: cleanup-script defaultMode: 0755 --- +# The cleanup-script ConfigMap should be created from scripts/cleanup.sh +# to maintain a single source of truth: +# +# oc create configmap cleanup-script \ +# --from-file=cleanup.sh=scripts/cleanup.sh \ +# -n ab-eval-flow --dry-run=client -o yaml | oc apply -f - +# +# Alternatively, apply it manually: apiVersion: v1 kind: ConfigMap metadata: @@ -57,55 +65,5 @@ metadata: namespace: ab-eval-flow labels: app.kubernetes.io/part-of: abevalflow -data: - cleanup.sh: | - #!/usr/bin/env bash - set -euo pipefail - - NAMESPACE="${NAMESPACE:-ab-eval-flow}" - POD_AGE_HOURS="${POD_AGE_HOURS:-24}" - PIPELINERUN_AGE_DAYS="${PIPELINERUN_AGE_DAYS:-7}" - - log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $*"; } - - log "Starting cleanup in namespace=${NAMESPACE}" - - # Delete completed/failed trial pods older than threshold - log "Removing completed/failed trial pods older than ${POD_AGE_HOURS}h..." - threshold=$((POD_AGE_HOURS * 3600)) - for phase in Succeeded Failed; do - oc get pods -n "${NAMESPACE}" --field-selector="status.phase=${phase}" \ - -l abevalflow/role=trial -o name 2>/dev/null | while read -r pod; do - age=$(oc get "${pod}" -n "${NAMESPACE}" -o jsonpath='{.metadata.creationTimestamp}' 2>/dev/null) - age_sec=$(python3 -c " - from datetime import datetime, timezone - t = datetime.fromisoformat('${age}'.replace('Z','+00:00')) - print(int((datetime.now(timezone.utc) - t).total_seconds())) - " 2>/dev/null || echo 0) - if [ "${age_sec}" -gt "${threshold}" ]; then - log "Deleting ${pod} (age=${age_sec}s, phase=${phase})" - oc delete "${pod}" -n "${NAMESPACE}" --grace-period=0 || true - fi - done - done - - # Delete old PipelineRuns (keep recent N days worth) - log "Removing PipelineRuns older than ${PIPELINERUN_AGE_DAYS}d..." - if command -v tkn &>/dev/null; then - tkn pipelinerun delete -n "${NAMESPACE}" --keep="${PIPELINERUN_AGE_DAYS}" --force 2>/dev/null || true - else - log "tkn not available, skipping PipelineRun cleanup" - fi - - # Prune empty image streams - log "Pruning empty image streams..." - oc get imagestream -n "${NAMESPACE}" -o name 2>/dev/null | while read -r is; do - tags=$(oc get "${is}" -n "${NAMESPACE}" -o jsonpath='{.status.tags}' 2>/dev/null || echo "[]") - count=$(python3 -c "import json; print(len(json.loads('${tags}') or []))" 2>/dev/null || echo 0) - if [ "${count}" -eq 0 ]; then - log "Deleting empty ${is}" - oc delete "${is}" -n "${NAMESPACE}" || true - fi - done - - log "Cleanup complete" +data: {} + # Populate from scripts/cleanup.sh — see comment above. diff --git a/scripts/cleanup.sh b/scripts/cleanup.sh index 0d88af4..97598bf 100755 --- a/scripts/cleanup.sh +++ b/scripts/cleanup.sh @@ -6,8 +6,7 @@ set -euo pipefail NAMESPACE="${NAMESPACE:-ab-eval-flow}" POD_AGE_HOURS="${POD_AGE_HOURS:-24}" -PIPELINERUN_AGE_DAYS="${PIPELINERUN_AGE_DAYS:-7}" -DEAD_LETTER_RETENTION_DAYS="${DEAD_LETTER_RETENTION_DAYS:-14}" +PIPELINERUN_KEEP_COUNT="${PIPELINERUN_KEEP_COUNT:-7}" log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $*"; } @@ -37,11 +36,11 @@ print(int((datetime.now(timezone.utc) - start).total_seconds())) fi done -# Delete old PipelineRuns -log "Removing PipelineRuns older than ${PIPELINERUN_AGE_DAYS}d..." +# Delete old PipelineRuns (keep the N most recent by count) +log "Pruning PipelineRuns, keeping most recent ${PIPELINERUN_KEEP_COUNT}..." if command -v tkn &>/dev/null; then tkn pipelinerun delete -n "${NAMESPACE}" \ - --keep="${PIPELINERUN_AGE_DAYS}" \ + --keep="${PIPELINERUN_KEEP_COUNT}" \ --force 2>/dev/null || true fi From c8faa9d94e2dee55847551337b0a9fa9edf0f59a Mon Sep 17 00:00:00 2001 From: gziv Date: Thu, 23 Apr 2026 13:01:30 +0300 Subject: [PATCH 3/4] fix: correct secret name in RBAC and add fail-fast ConfigMap placeholder - resourceNames: llm-credentials -> litellm-credentials (matches config/litellm/secret_template.yaml) - Replace empty data: {} ConfigMap stub with a placeholder script that exits 1 with instructions, preventing silent broken-state on oc apply --- config/rbac.yaml | 2 +- config/storage/cleanup_cronjob.yaml | 8 ++++++-- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/config/rbac.yaml b/config/rbac.yaml index 2867de1..1516dd6 100644 --- a/config/rbac.yaml +++ b/config/rbac.yaml @@ -41,7 +41,7 @@ rules: - apiGroups: [""] resources: [secrets] verbs: [get] - resourceNames: [ab-eval-db-credentials, llm-credentials] + resourceNames: [ab-eval-db-credentials, litellm-credentials] - apiGroups: [""] resources: [configmaps] verbs: [get, list] diff --git a/config/storage/cleanup_cronjob.yaml b/config/storage/cleanup_cronjob.yaml index e6c0a9c..2473525 100644 --- a/config/storage/cleanup_cronjob.yaml +++ b/config/storage/cleanup_cronjob.yaml @@ -65,5 +65,9 @@ metadata: namespace: ab-eval-flow labels: app.kubernetes.io/part-of: abevalflow -data: {} - # Populate from scripts/cleanup.sh — see comment above. +data: + cleanup.sh: | + #!/usr/bin/env bash + echo "ERROR: placeholder — regenerate this ConfigMap from scripts/cleanup.sh:" + echo " oc create configmap cleanup-script --from-file=cleanup.sh=scripts/cleanup.sh -n ab-eval-flow --dry-run=client -o yaml | oc apply -f -" + exit 1 From d34f381dae4201c736526acd84ebffa3fa5438fb Mon Sep 17 00:00:00 2001 From: gziv Date: Thu, 23 Apr 2026 13:04:20 +0300 Subject: [PATCH 4/4] fix: remove ConfigMap stub from cleanup_cronjob.yaml to prevent overwrite MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The placeholder ConfigMap was reapplied by oc apply, overwriting the real script created in the prior step. Removed entirely — the ConfigMap is now managed only via oc create configmap --from-file as documented. --- config/storage/cleanup_cronjob.yaml | 21 +++------------------ 1 file changed, 3 insertions(+), 18 deletions(-) diff --git a/config/storage/cleanup_cronjob.yaml b/config/storage/cleanup_cronjob.yaml index 2473525..efd9b07 100644 --- a/config/storage/cleanup_cronjob.yaml +++ b/config/storage/cleanup_cronjob.yaml @@ -49,25 +49,10 @@ spec: configMap: name: cleanup-script defaultMode: 0755 ---- -# The cleanup-script ConfigMap should be created from scripts/cleanup.sh -# to maintain a single source of truth: + +# NOTE: The cleanup-script ConfigMap is NOT bundled in this file. +# Create it from the canonical script before applying this CronJob: # # oc create configmap cleanup-script \ # --from-file=cleanup.sh=scripts/cleanup.sh \ # -n ab-eval-flow --dry-run=client -o yaml | oc apply -f - -# -# Alternatively, apply it manually: -apiVersion: v1 -kind: ConfigMap -metadata: - name: cleanup-script - namespace: ab-eval-flow - labels: - app.kubernetes.io/part-of: abevalflow -data: - cleanup.sh: | - #!/usr/bin/env bash - echo "ERROR: placeholder — regenerate this ConfigMap from scripts/cleanup.sh:" - echo " oc create configmap cleanup-script --from-file=cleanup.sh=scripts/cleanup.sh -n ab-eval-flow --dry-run=client -o yaml | oc apply -f -" - exit 1