From aebfcc9706ede461e2bcb9d6cc2d0241b32023fc Mon Sep 17 00:00:00 2001
From: gziv <gziv@redhat.com>
Date: Thu, 23 Apr 2026 11:40:30 +0300
Subject: [PATCH 1/4] feat: add infrastructure manifests, network policies,
 LiteLLM, cleanup, and ops docs

RBAC: explicit ServiceAccount, configmaps and PVC permissions.
Security: pod security reference, resource quota, four network
policies (default-deny, direct-api, litellm, self-hosted).
LiteLLM: HA deployment, service, config, and secret template for
Vertex AI mode. Storage: workspace and dead-letter PVCs, daily
cleanup CronJob for stale pods/PipelineRuns/imagestreams.
Docs: failure handling (retries, timeouts, dead-letter, partial
recovery) and infrastructure deployment guide.
---
 Docs/failure_handling.md                      |  96 +++++++++
 Docs/infrastructure_ops.md                    | 183 ++++++++++++++++++
 config/litellm/configmap.yaml                 |  25 +++
 config/litellm/deployment.yaml                |  82 ++++++++
 config/litellm/secret_template.yaml           |  15 ++
 config/litellm/service.yaml                   |  17 ++
 config/rbac.yaml                              |  14 ++
 .../security/network_policy_default_deny.yaml |  14 ++
 .../security/network_policy_direct_api.yaml   |  38 ++++
 config/security/network_policy_litellm.yaml   |  36 ++++
 .../security/network_policy_self_hosted.yaml  |  38 ++++
 config/security/pod_security.yaml             |  40 ++++
 config/security/resource_quota.yaml           |  17 ++
 config/storage/cleanup_cronjob.yaml           | 111 +++++++++++
 config/storage/dead_letter_pvc.yaml           |  17 ++
 config/storage/workspace_pvc.yaml             |  13 ++
 scripts/cleanup.sh                            |  58 ++++++
 17 files changed, 814 insertions(+)
 create mode 100644 Docs/failure_handling.md
 create mode 100644 Docs/infrastructure_ops.md
 create mode 100644 config/litellm/configmap.yaml
 create mode 100644 config/litellm/deployment.yaml
 create mode 100644 config/litellm/secret_template.yaml
 create mode 100644 config/litellm/service.yaml
 create mode 100644 config/security/network_policy_default_deny.yaml
 create mode 100644 config/security/network_policy_direct_api.yaml
 create mode 100644 config/security/network_policy_litellm.yaml
 create mode 100644 config/security/network_policy_self_hosted.yaml
 create mode 100644 config/security/pod_security.yaml
 create mode 100644 config/security/resource_quota.yaml
 create mode 100644 config/storage/cleanup_cronjob.yaml
 create mode 100644 config/storage/dead_letter_pvc.yaml
 create mode 100644 config/storage/workspace_pvc.yaml
 create mode 100755 scripts/cleanup.sh

diff --git a/Docs/failure_handling.md b/Docs/failure_handling.md
new file mode 100644
index 0000000..8db2ce0
--- /dev/null
+++ b/Docs/failure_handling.md
@@ -0,0 +1,96 @@
+# Failure Handling, Retries, and Idempotency
+
+## Retry Policy
+
+Each pipeline task has a retry configuration based on its idempotency
+characteristics. Retries are defined in `pipeline.yaml` using Tekton's
+`retries` field.
+
+| Task | Retries | Rationale |
+|---|---|---|
+| `clone-repo` (ClusterTask) | 2 | Network-dependent, fully idempotent |
+| `validate` | 1 | Read-only, deterministic |
+| `scaffold` | 1 | Deterministic template rendering |
+| `build-push` | 2 | Transient registry/network errors; Buildah is idempotent with layer caching |
+| `harbor-eval` | 0 | Long-running (up to 3h), not idempotent — partial trial results would conflict with a fresh run |
+| `analyze` | 1 | Reads from workspace, deterministic computation |
+| `store-results` | 2 | Database transient errors; upsert logic ensures idempotency via `pipeline_run_id` uniqueness |
+
+## Timeouts
+
+Per-task timeouts prevent hung tasks from consuming cluster resources
+indefinitely. Set in `pipeline.yaml` using Tekton's `timeout` field.
+
+| Task | Timeout | Notes |
+|---|---|---|
+| `clone-repo` | 5m | Large repos may need adjustment |
+| `validate` | 10m | Includes py_compile on all test files |
+| `scaffold` | 10m | Jinja2 rendering + file copy |
+| `build-push` | 30m | Two container builds (treatment + control) |
+| `harbor-eval` | 3h | 20 trials x 2 variants; adjust based on task complexity |
+| `analyze` | 15m | Statistical computation + report generation |
+| `store-results` | 15m | Database writes + observer notifications |
+| **Pipeline total** | 4h | Safety net above sum of individual timeouts |
+
+## Non-Retryable Failures
+
+Certain failure categories should not be retried because they indicate
+a problem that will not resolve on its own:
+
+- **Validation failures** — malformed submission, missing required files
+- **Schema violations** — invalid `metadata.yaml`
+- **Build failures** from syntax errors in user code
+- **Harbor evaluation failures** from test assertion errors (the skill genuinely fails)
+
+These are distinguished from transient failures (network timeouts,
+registry 503s, DB connection drops) by exit code conventions:
+
+| Exit Code | Meaning | Retry? |
+|---|---|---|
+| 0 | Success | -- |
+| 1 | Transient/recoverable error | Yes |
+| 2 | Validation/user error (non-retryable) | No |
+| 3 | Infrastructure error (retryable) | Yes |
+
+Scripts should use `sys.exit(2)` for user-facing errors to signal
+Tekton that a retry would not help.
+
+## Dead-Letter Path
+
+When a PipelineRun fails after exhausting retries:
+
+1. **Artifacts are retained** on the workspace PVC (not cleaned up)
+2. Failed run artifacts are copied to the `abevalflow-dead-letter` PVC
+   by the cleanup CronJob (instead of being deleted)
+3. Dead-letter artifacts are retained for 14 days (configurable via
+   `DEAD_LETTER_RETENTION_DAYS`)
+4. PipelineRun metadata remains queryable via `tkn pipelinerun describe`
+   until the cleanup CronJob prunes it (default 7 days)
+
+## Partial-Run Recovery
+
+Tekton does not natively support resuming a pipeline from a specific
+task. The recovery strategy is:
+
+1. **Workspace snapshot** — the PVC retains all intermediate artifacts
+   from completed tasks. A re-run with the same submission will
+   overwrite these, effectively starting fresh.
+
+2. **Harbor checkpointing** — the Harbor fork persists individual trial
+   results to the workspace as they complete. If `harbor-eval` fails
+   mid-way (e.g., after 15 of 20 trials), the partial `result.json`
+   files are available for inspection. However, the analysis step
+   expects a complete set, so a re-run of `harbor-eval` is needed.
+
+3. **Manual re-trigger** — use `tkn pipeline start` with the same
+   parameters to re-run the full pipeline. Since all tasks before the
+   failure point are idempotent, they will complete quickly using
+   cached layers (builds) or deterministic outputs (scaffold).
+
+## Concurrency
+
+- **PipelineRuns** — no built-in Tekton limit; use `ResourceQuota` on
+  the namespace (`config/security/resource_quota.yaml`) to cap total
+  pods, which indirectly limits concurrent runs.
+- **Trial Pods** — Harbor's `OpenShiftEnvironment` controls concurrency
+  via its `max_concurrent` parameter in the job config.
diff --git a/Docs/infrastructure_ops.md b/Docs/infrastructure_ops.md
new file mode 100644
index 0000000..6bfe1fc
--- /dev/null
+++ b/Docs/infrastructure_ops.md
@@ -0,0 +1,183 @@
+# Infrastructure & Operations Guide
+
+Deployment and operations reference for running ABEvalFlow on OpenShift.
+
+## Prerequisites
+
+- OpenShift cluster with Pipelines operator (Tekton) installed
+- `oc` CLI authenticated with cluster-admin or namespace-admin
+- `tkn` CLI (optional, for manual pipeline triggers and PipelineRun cleanup)
+
+## Namespace Setup
+
+```bash
+oc new-project ab-eval-flow --description="ABEvalFlow A/B evaluation pipeline"
+```
+
+## Deployment Order
+
+Apply manifests in this order to satisfy dependencies:
+
+```bash
+# 1. RBAC — ServiceAccount, Roles, RoleBindings
+oc apply -f config/rbac.yaml
+
+# 2. Security — resource quotas
+oc apply -f config/security/resource_quota.yaml
+
+# 3. Network policies — choose ONE based on LLM mode (see below)
+oc apply -f config/security/network_policy_default_deny.yaml
+oc apply -f config/security/network_policy_<mode>.yaml
+
+# 4. Storage — workspace and dead-letter PVCs
+oc apply -f config/storage/workspace_pvc.yaml
+oc apply -f config/storage/dead_letter_pvc.yaml
+
+# 5. Cleanup CronJob
+oc apply -f config/storage/cleanup_cronjob.yaml
+
+# 6. Tekton tasks
+oc apply -f pipeline/tasks/
+
+# 7. Tekton triggers
+oc apply -f pipeline/triggers/
+
+# 8. Expose EventListener
+oc create route edge el-submission-listener \
+  --service=el-submission-listener \
+  --port=http-listener
+
+# 9. (Optional) LiteLLM — only for Vertex AI mode
+oc apply -f config/litellm/
+```
+
+## Network Policy Selection
+
+Choose the network policy that matches your LLM access mode. Always
+apply the default-deny policy first, then add the mode-specific allow
+policy.
+
+| LLM Mode | Policies to Apply | Effect |
+|---|---|---|
+| Direct API key | `default_deny` + `direct_api` | Trial pods can reach provider HTTPS endpoints + DNS |
+| Vertex AI + LiteLLM | `default_deny` + `litellm` | Trial pods can only reach in-cluster LiteLLM on port 4000 |
+| Self-hosted model | `default_deny` + `self_hosted` | Trial pods can only reach in-cluster model server |
+
+Trial pods must carry the label `abevalflow/role: trial` for policies
+to take effect. The Harbor fork's `OpenShiftEnvironment` should set
+this label when creating trial pods.
+
+## LiteLLM Setup (Vertex AI Mode Only)
+
+1. Create the credentials secret with your GCP service account key:
+
+```bash
+oc create secret generic litellm-credentials \
+  --from-file=GOOGLE_APPLICATION_CREDENTIALS_JSON=path/to/sa-key.json \
+  --from-literal=LITELLM_MASTER_KEY=$(openssl rand -hex 32) \
+  -n ab-eval-flow
+```
+
+2. Edit `config/litellm/configmap.yaml` to set your GCP project and
+   model routing.
+
+3. Apply the manifests:
+
+```bash
+oc apply -f config/litellm/
+```
+
+4. Verify the proxy is healthy:
+
+```bash
+oc get pods -l app.kubernetes.io/name=litellm -n ab-eval-flow
+oc port-forward svc/litellm 4000:4000 -n ab-eval-flow &
+curl http://localhost:4000/health
+```
+
+## Storage
+
+| PVC | Purpose | Default Size |
+|---|---|---|
+| `abevalflow-workspace` | Shared pipeline workspace (source, builds, results) | 5Gi |
+| `abevalflow-dead-letter` | Retained artifacts from failed runs | 2Gi |
+
+Adjust sizes based on expected submission volume and image sizes.
+
+## Cleanup CronJob
+
+Runs daily at 03:00 UTC. Configurable via environment variables:
+
+| Variable | Default | Description |
+|---|---|---|
+| `NAMESPACE` | `ab-eval-flow` | Target namespace |
+| `POD_AGE_HOURS` | `24` | Delete completed/failed trial pods older than this |
+| `PIPELINERUN_AGE_DAYS` | `7` | Delete PipelineRuns older than this |
+
+To run cleanup manually:
+
+```bash
+oc create job --from=cronjob/abevalflow-cleanup manual-cleanup -n ab-eval-flow
+```
+
+## Resource Quotas
+
+The default quota (`config/security/resource_quota.yaml`) limits:
+
+| Resource | Limit |
+|---|---|
+| Pods | 50 |
+| CPU requests | 32 cores |
+| Memory requests | 64Gi |
+| CPU limits | 64 cores |
+| Memory limits | 128Gi |
+| PVCs | 10 |
+
+Adjust based on cluster capacity and expected concurrency.
+
+## Pod Security
+
+Trial pods spawned by Harbor's `OpenShiftEnvironment` should follow the
+security context documented in `config/security/pod_security.yaml`:
+
+- `runAsNonRoot: true`
+- `allowPrivilegeEscalation: false`
+- Drop all Linux capabilities
+- Seccomp `RuntimeDefault`
+- Resource requests/limits per trial pod
+
+The Harbor fork currently sets `HOME=/tmp` instead of
+`readOnlyRootFilesystem: true` for agent compatibility. This is
+documented in `Docs/harbor_openshift_backend.md`.
+
+## Failure Handling
+
+See [failure_handling.md](failure_handling.md) for retry policies,
+timeouts, dead-letter path, and partial-run recovery.
+
+## Verification
+
+After deploying, verify the infrastructure:
+
+```bash
+# Check ServiceAccount
+oc get sa pipeline -n ab-eval-flow
+
+# Check RBAC
+oc auth can-i create pods --as=system:serviceaccount:ab-eval-flow:pipeline -n ab-eval-flow
+
+# Check network policies
+oc get networkpolicy -n ab-eval-flow
+
+# Check PVCs
+oc get pvc -n ab-eval-flow
+
+# Check CronJob
+oc get cronjob -n ab-eval-flow
+
+# Check EventListener
+oc get el,route -n ab-eval-flow
+
+# Check resource quota usage
+oc describe resourcequota eval-resource-quota -n ab-eval-flow
+```
diff --git a/config/litellm/configmap.yaml b/config/litellm/configmap.yaml
new file mode 100644
index 0000000..5652666
--- /dev/null
+++ b/config/litellm/configmap.yaml
@@ -0,0 +1,25 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: litellm-config
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+    app.kubernetes.io/name: litellm
+data:
+  config.yaml: |
+    model_list:
+      - model_name: claude-sonnet
+        litellm_params:
+          model: vertex_ai/claude-3-5-sonnet@20241022
+          vertex_project: "<gcp-project-id>"
+          vertex_location: "global"
+
+    litellm_settings:
+      drop_params: true
+      set_verbose: false
+      num_retries: 2
+      request_timeout: 120
+
+    general_settings:
+      master_key: "os.environ/LITELLM_MASTER_KEY"
diff --git a/config/litellm/deployment.yaml b/config/litellm/deployment.yaml
new file mode 100644
index 0000000..9b5441d
--- /dev/null
+++ b/config/litellm/deployment.yaml
@@ -0,0 +1,82 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: litellm
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+    app.kubernetes.io/name: litellm
+spec:
+  replicas: 2
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: litellm
+  template:
+    metadata:
+      labels:
+        app.kubernetes.io/name: litellm
+        app.kubernetes.io/part-of: abevalflow
+    spec:
+      serviceAccountName: pipeline
+      containers:
+        - name: litellm
+          image: ghcr.io/berriai/litellm:main-latest
+          ports:
+            - containerPort: 4000
+              name: http
+          args:
+            - "--config"
+            - "/app/config/config.yaml"
+            - "--port"
+            - "4000"
+          env:
+            - name: LITELLM_MASTER_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: litellm-credentials
+                  key: LITELLM_MASTER_KEY
+            - name: GOOGLE_APPLICATION_CREDENTIALS
+              value: /app/credentials/gcp-sa-key.json
+          volumeMounts:
+            - name: config
+              mountPath: /app/config
+              readOnly: true
+            - name: credentials
+              mountPath: /app/credentials
+              readOnly: true
+          resources:
+            requests:
+              cpu: "250m"
+              memory: "256Mi"
+            limits:
+              cpu: "1"
+              memory: "1Gi"
+          readinessProbe:
+            httpGet:
+              path: /health
+              port: 4000
+            initialDelaySeconds: 10
+            periodSeconds: 10
+          livenessProbe:
+            httpGet:
+              path: /health
+              port: 4000
+            initialDelaySeconds: 15
+            periodSeconds: 30
+          securityContext:
+            runAsNonRoot: true
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop: ["ALL"]
+            seccompProfile:
+              type: RuntimeDefault
+      volumes:
+        - name: config
+          configMap:
+            name: litellm-config
+        - name: credentials
+          secret:
+            secretName: litellm-credentials
+            items:
+              - key: GOOGLE_APPLICATION_CREDENTIALS_JSON
+                path: gcp-sa-key.json
diff --git a/config/litellm/secret_template.yaml b/config/litellm/secret_template.yaml
new file mode 100644
index 0000000..01b28f3
--- /dev/null
+++ b/config/litellm/secret_template.yaml
@@ -0,0 +1,15 @@
+apiVersion: v1
+kind: Secret
+metadata:
+  name: litellm-credentials
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+    app.kubernetes.io/name: litellm
+type: Opaque
+stringData:
+  # Vertex AI service account key (JSON)
+  # Replace with actual credentials before applying
+  GOOGLE_APPLICATION_CREDENTIALS_JSON: "<base64-encoded-service-account-key>"
+  # Master key for LiteLLM admin API
+  LITELLM_MASTER_KEY: "<replace-with-random-key>"
diff --git a/config/litellm/service.yaml b/config/litellm/service.yaml
new file mode 100644
index 0000000..dbf06c7
--- /dev/null
+++ b/config/litellm/service.yaml
@@ -0,0 +1,17 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: litellm
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+    app.kubernetes.io/name: litellm
+spec:
+  type: ClusterIP
+  selector:
+    app.kubernetes.io/name: litellm
+  ports:
+    - name: http
+      port: 4000
+      targetPort: 4000
+      protocol: TCP
diff --git a/config/rbac.yaml b/config/rbac.yaml
index 0b313e8..474ed40 100644
--- a/config/rbac.yaml
+++ b/config/rbac.yaml
@@ -1,3 +1,11 @@
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: pipeline
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: RoleBinding
 metadata:
@@ -33,6 +41,12 @@ rules:
   - apiGroups: [""]
     resources: [secrets]
     verbs: [get]
+  - apiGroups: [""]
+    resources: [configmaps]
+    verbs: [get, list]
+  - apiGroups: [""]
+    resources: [persistentvolumeclaims]
+    verbs: [get, list, create, delete]
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: RoleBinding
diff --git a/config/security/network_policy_default_deny.yaml b/config/security/network_policy_default_deny.yaml
new file mode 100644
index 0000000..ec45d8c
--- /dev/null
+++ b/config/security/network_policy_default_deny.yaml
@@ -0,0 +1,14 @@
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: trial-pod-default-deny-egress
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+spec:
+  podSelector:
+    matchLabels:
+      abevalflow/role: trial
+  policyTypes:
+    - Egress
+  egress: []
diff --git a/config/security/network_policy_direct_api.yaml b/config/security/network_policy_direct_api.yaml
new file mode 100644
index 0000000..39502f0
--- /dev/null
+++ b/config/security/network_policy_direct_api.yaml
@@ -0,0 +1,38 @@
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: trial-pod-direct-api-egress
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+  annotations:
+    abevalflow/llm-mode: direct-api
+    abevalflow/note: >-
+      Allows trial pods to reach LLM provider APIs and DNS.
+      Apply this policy INSTEAD OF default-deny when using direct API keys.
+spec:
+  podSelector:
+    matchLabels:
+      abevalflow/role: trial
+  policyTypes:
+    - Egress
+  egress:
+    # DNS resolution
+    - to:
+        - namespaceSelector: {}
+      ports:
+        - protocol: UDP
+          port: 53
+        - protocol: TCP
+          port: 53
+    # HTTPS to LLM providers
+    - to:
+        - ipBlock:
+            cidr: 0.0.0.0/0
+            except:
+              - 10.0.0.0/8
+              - 172.16.0.0/12
+              - 192.168.0.0/16
+      ports:
+        - protocol: TCP
+          port: 443
diff --git a/config/security/network_policy_litellm.yaml b/config/security/network_policy_litellm.yaml
new file mode 100644
index 0000000..157ffde
--- /dev/null
+++ b/config/security/network_policy_litellm.yaml
@@ -0,0 +1,36 @@
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: trial-pod-litellm-egress
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+  annotations:
+    abevalflow/llm-mode: vertex+litellm
+    abevalflow/note: >-
+      Allows trial pods to reach only the in-cluster LiteLLM proxy
+      and DNS. No external egress. Apply INSTEAD OF default-deny
+      when using Vertex AI mode.
+spec:
+  podSelector:
+    matchLabels:
+      abevalflow/role: trial
+  policyTypes:
+    - Egress
+  egress:
+    # DNS resolution
+    - to:
+        - namespaceSelector: {}
+      ports:
+        - protocol: UDP
+          port: 53
+        - protocol: TCP
+          port: 53
+    # LiteLLM proxy (same namespace)
+    - to:
+        - podSelector:
+            matchLabels:
+              app.kubernetes.io/name: litellm
+      ports:
+        - protocol: TCP
+          port: 4000
diff --git a/config/security/network_policy_self_hosted.yaml b/config/security/network_policy_self_hosted.yaml
new file mode 100644
index 0000000..6828a31
--- /dev/null
+++ b/config/security/network_policy_self_hosted.yaml
@@ -0,0 +1,38 @@
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: trial-pod-self-hosted-egress
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+  annotations:
+    abevalflow/llm-mode: self-hosted
+    abevalflow/note: >-
+      Allows trial pods to reach only the in-cluster model endpoint
+      and DNS. No external egress. Apply INSTEAD OF default-deny
+      when using a self-hosted model (vLLM, Ollama, etc.).
+      Update the podSelector or namespaceSelector to match your
+      model serving deployment.
+spec:
+  podSelector:
+    matchLabels:
+      abevalflow/role: trial
+  policyTypes:
+    - Egress
+  egress:
+    # DNS resolution
+    - to:
+        - namespaceSelector: {}
+      ports:
+        - protocol: UDP
+          port: 53
+        - protocol: TCP
+          port: 53
+    # Self-hosted model endpoint (adjust selector to match your deployment)
+    - to:
+        - podSelector:
+            matchLabels:
+              app.kubernetes.io/name: model-server
+      ports:
+        - protocol: TCP
+          port: 8080
diff --git a/config/security/pod_security.yaml b/config/security/pod_security.yaml
new file mode 100644
index 0000000..62f9eb3
--- /dev/null
+++ b/config/security/pod_security.yaml
@@ -0,0 +1,40 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: trial-pod-security-reference
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+data:
+  # Reference security context for trial Pods spawned by Harbor's
+  # OpenShiftEnvironment. The Harbor fork reads these values from
+  # environment_kwargs in the job config; this ConfigMap documents
+  # the intended baseline.
+  #
+  # NOTE: The Harbor fork does NOT currently set readOnlyRootFilesystem
+  # because many agents write to $HOME. The mitigation is HOME=/tmp
+  # with an emptyDir mount. If agent compatibility allows, enable
+  # readOnlyRootFilesystem in the job config's environment_kwargs.
+  security_context.yaml: |
+    securityContext:
+      runAsNonRoot: true
+      allowPrivilegeEscalation: false
+      capabilities:
+        drop: ["ALL"]
+      seccompProfile:
+        type: RuntimeDefault
+    volumeMounts:
+      - name: tmp
+        mountPath: /tmp
+    volumes:
+      - name: tmp
+        emptyDir:
+          sizeLimit: 512Mi
+  resource_limits.yaml: |
+    resources:
+      requests:
+        cpu: "250m"
+        memory: "512Mi"
+      limits:
+        cpu: "2"
+        memory: "4Gi"
diff --git a/config/security/resource_quota.yaml b/config/security/resource_quota.yaml
new file mode 100644
index 0000000..bf1c7d7
--- /dev/null
+++ b/config/security/resource_quota.yaml
@@ -0,0 +1,17 @@
+apiVersion: v1
+kind: ResourceQuota
+metadata:
+  name: eval-resource-quota
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+spec:
+  hard:
+    # Trial pods: 40 max concurrent (20 treatment + 20 control)
+    # plus pipeline task pods and EventListener
+    pods: "50"
+    requests.cpu: "32"
+    requests.memory: "64Gi"
+    limits.cpu: "64"
+    limits.memory: "128Gi"
+    persistentvolumeclaims: "10"
diff --git a/config/storage/cleanup_cronjob.yaml b/config/storage/cleanup_cronjob.yaml
new file mode 100644
index 0000000..51f305a
--- /dev/null
+++ b/config/storage/cleanup_cronjob.yaml
@@ -0,0 +1,111 @@
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: abevalflow-cleanup
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+spec:
+  schedule: "0 3 * * *"
+  concurrencyPolicy: Forbid
+  successfulJobsHistoryLimit: 3
+  failedJobsHistoryLimit: 3
+  jobTemplate:
+    spec:
+      backoffLimit: 1
+      template:
+        spec:
+          serviceAccountName: pipeline
+          restartPolicy: OnFailure
+          containers:
+            - name: cleanup
+              image: image-registry.openshift-image-registry.svc:5000/openshift/cli:latest
+              command: ["/bin/bash", "/scripts/cleanup.sh"]
+              env:
+                - name: NAMESPACE
+                  value: ab-eval-flow
+                - name: POD_AGE_HOURS
+                  value: "24"
+                - name: PIPELINERUN_AGE_DAYS
+                  value: "7"
+              volumeMounts:
+                - name: scripts
+                  mountPath: /scripts
+                  readOnly: true
+              resources:
+                requests:
+                  cpu: "50m"
+                  memory: "64Mi"
+                limits:
+                  cpu: "200m"
+                  memory: "128Mi"
+              securityContext:
+                runAsNonRoot: true
+                allowPrivilegeEscalation: false
+                capabilities:
+                  drop: ["ALL"]
+          volumes:
+            - name: scripts
+              configMap:
+                name: cleanup-script
+                defaultMode: 0755
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: cleanup-script
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+data:
+  cleanup.sh: |
+    #!/usr/bin/env bash
+    set -euo pipefail
+
+    NAMESPACE="${NAMESPACE:-ab-eval-flow}"
+    POD_AGE_HOURS="${POD_AGE_HOURS:-24}"
+    PIPELINERUN_AGE_DAYS="${PIPELINERUN_AGE_DAYS:-7}"
+
+    log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $*"; }
+
+    log "Starting cleanup in namespace=${NAMESPACE}"
+
+    # Delete completed/failed trial pods older than threshold
+    log "Removing completed/failed trial pods older than ${POD_AGE_HOURS}h..."
+    threshold=$((POD_AGE_HOURS * 3600))
+    for phase in Succeeded Failed; do
+      oc get pods -n "${NAMESPACE}" --field-selector="status.phase=${phase}" \
+        -l abevalflow/role=trial -o name 2>/dev/null | while read -r pod; do
+        age=$(oc get "${pod}" -n "${NAMESPACE}" -o jsonpath='{.metadata.creationTimestamp}' 2>/dev/null)
+        age_sec=$(python3 -c "
+    from datetime import datetime, timezone
+    t = datetime.fromisoformat('${age}'.replace('Z','+00:00'))
+    print(int((datetime.now(timezone.utc) - t).total_seconds()))
+    " 2>/dev/null || echo 0)
+        if [ "${age_sec}" -gt "${threshold}" ]; then
+          log "Deleting ${pod} (age=${age_sec}s, phase=${phase})"
+          oc delete "${pod}" -n "${NAMESPACE}" --grace-period=0 || true
+        fi
+      done
+    done
+
+    # Delete old PipelineRuns (keep recent N days worth)
+    log "Removing PipelineRuns older than ${PIPELINERUN_AGE_DAYS}d..."
+    if command -v tkn &>/dev/null; then
+      tkn pipelinerun delete -n "${NAMESPACE}" --keep="${PIPELINERUN_AGE_DAYS}" --force 2>/dev/null || true
+    else
+      log "tkn not available, skipping PipelineRun cleanup"
+    fi
+
+    # Prune empty image streams
+    log "Pruning empty image streams..."
+    oc get imagestream -n "${NAMESPACE}" -o name 2>/dev/null | while read -r is; do
+      tags=$(oc get "${is}" -n "${NAMESPACE}" -o jsonpath='{.status.tags}' 2>/dev/null || echo "[]")
+      count=$(python3 -c "import json; print(len(json.loads('${tags}') or []))" 2>/dev/null || echo 0)
+      if [ "${count}" -eq 0 ]; then
+        log "Deleting empty ${is}"
+        oc delete "${is}" -n "${NAMESPACE}" || true
+      fi
+    done
+
+    log "Cleanup complete"
diff --git a/config/storage/dead_letter_pvc.yaml b/config/storage/dead_letter_pvc.yaml
new file mode 100644
index 0000000..0e55fff
--- /dev/null
+++ b/config/storage/dead_letter_pvc.yaml
@@ -0,0 +1,17 @@
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: abevalflow-dead-letter
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+  annotations:
+    abevalflow/note: >-
+      Retains artifacts from failed PipelineRuns for debugging.
+      Cleaned up by the cleanup CronJob after retention period.
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 2Gi
diff --git a/config/storage/workspace_pvc.yaml b/config/storage/workspace_pvc.yaml
new file mode 100644
index 0000000..6ff22d3
--- /dev/null
+++ b/config/storage/workspace_pvc.yaml
@@ -0,0 +1,13 @@
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: abevalflow-workspace
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 5Gi
diff --git a/scripts/cleanup.sh b/scripts/cleanup.sh
new file mode 100755
index 0000000..0d88af4
--- /dev/null
+++ b/scripts/cleanup.sh
@@ -0,0 +1,58 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Cleanup script for ab-eval-flow namespace.
+# Intended to run as a CronJob to remove stale resources.
+
+NAMESPACE="${NAMESPACE:-ab-eval-flow}"
+POD_AGE_HOURS="${POD_AGE_HOURS:-24}"
+PIPELINERUN_AGE_DAYS="${PIPELINERUN_AGE_DAYS:-7}"
+DEAD_LETTER_RETENTION_DAYS="${DEAD_LETTER_RETENTION_DAYS:-14}"
+
+log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $*"; }
+
+log "Starting cleanup in namespace=${NAMESPACE}"
+
+# Delete completed/failed trial pods older than threshold
+log "Removing completed/failed pods older than ${POD_AGE_HOURS}h..."
+completed=$(oc get pods -n "${NAMESPACE}" \
+  --field-selector=status.phase==Succeeded \
+  -o jsonpath='{.items[?(@.metadata.labels.abevalflow/role=="trial")].metadata.name}' 2>/dev/null || true)
+failed=$(oc get pods -n "${NAMESPACE}" \
+  --field-selector=status.phase==Failed \
+  -o jsonpath='{.items[?(@.metadata.labels.abevalflow/role=="trial")].metadata.name}' 2>/dev/null || true)
+
+for pod in ${completed} ${failed}; do
+  age_seconds=$(oc get pod "${pod}" -n "${NAMESPACE}" \
+    -o jsonpath='{.status.startTime}' 2>/dev/null | \
+    xargs -I{} python3 -c "
+from datetime import datetime, timezone
+start = datetime.fromisoformat('{}').replace(tzinfo=timezone.utc)
+print(int((datetime.now(timezone.utc) - start).total_seconds()))
+" 2>/dev/null || echo 0)
+  threshold=$((POD_AGE_HOURS * 3600))
+  if [ "${age_seconds}" -gt "${threshold}" ]; then
+    log "Deleting pod ${pod} (age=${age_seconds}s)"
+    oc delete pod "${pod}" -n "${NAMESPACE}" --grace-period=0 || true
+  fi
+done
+
+# Delete old PipelineRuns
+log "Removing PipelineRuns older than ${PIPELINERUN_AGE_DAYS}d..."
+if command -v tkn &>/dev/null; then
+  tkn pipelinerun delete -n "${NAMESPACE}" \
+    --keep="${PIPELINERUN_AGE_DAYS}" \
+    --force 2>/dev/null || true
+fi
+
+# Prune internal registry images for deleted submissions
+log "Pruning unused image streams..."
+oc get imagestream -n "${NAMESPACE}" -o name 2>/dev/null | while read -r is; do
+  tag_count=$(oc get "${is}" -n "${NAMESPACE}" -o jsonpath='{.status.tags}' 2>/dev/null | python3 -c "import sys,json; print(len(json.load(sys.stdin) or []))" 2>/dev/null || echo 0)
+  if [ "${tag_count}" -eq 0 ]; then
+    log "Deleting empty imagestream ${is}"
+    oc delete "${is}" -n "${NAMESPACE}" || true
+  fi
+done
+
+log "Cleanup complete"

From 1ea3530341059976d65a1dd97556c7287ce430bc Mon Sep 17 00:00:00 2001
From: gziv <gziv@redhat.com>
Date: Thu, 23 Apr 2026 12:57:09 +0300
Subject: [PATCH 2/4] fix: address PR #13 review feedback on infra-ops
 manifests

- Rename PIPELINERUN_AGE_DAYS to PIPELINERUN_KEEP_COUNT (count semantics)
- Clarify dead-letter PVC as reserved for future use, not yet automated
- Mark per-task retries/timeouts as target policy (not yet applied)
- Change network policy annotations from INSTEAD OF to IN ADDITION TO
- Pin LiteLLM image to main-v1.82.6 instead of main-latest
- Add dedicated litellm ServiceAccount instead of reusing pipeline SA
- Scope secrets get verb with resourceNames in RBAC
- Remove inline cleanup script from CronJob ConfigMap (single source)
- Rename pod_security.yaml to pod_security_reference.yaml
- Add port guidance note to self-hosted network policy
---
 Docs/failure_handling.md                      | 33 +++++-----
 Docs/infrastructure_ops.md                    | 13 ++--
 config/litellm/deployment.yaml                |  4 +-
 config/litellm/serviceaccount.yaml            |  8 +++
 config/rbac.yaml                              |  1 +
 .../security/network_policy_direct_api.yaml   |  2 +-
 config/security/network_policy_litellm.yaml   |  2 +-
 .../security/network_policy_self_hosted.yaml  |  5 +-
 ...urity.yaml => pod_security_reference.yaml} |  0
 config/storage/cleanup_cronjob.yaml           | 64 ++++---------------
 scripts/cleanup.sh                            |  9 ++-
 11 files changed, 58 insertions(+), 83 deletions(-)
 create mode 100644 config/litellm/serviceaccount.yaml
 rename config/security/{pod_security.yaml => pod_security_reference.yaml} (100%)

diff --git a/Docs/failure_handling.md b/Docs/failure_handling.md
index 8db2ce0..ed20b3e 100644
--- a/Docs/failure_handling.md
+++ b/Docs/failure_handling.md
@@ -1,12 +1,13 @@
 # Failure Handling, Retries, and Idempotency
 
-## Retry Policy
+## Retry Policy (Target — Not Yet Applied)
 
-Each pipeline task has a retry configuration based on its idempotency
-characteristics. Retries are defined in `pipeline.yaml` using Tekton's
-`retries` field.
+> **Note:** The per-task retry values below are the target policy. They
+> will be added to `pipeline.yaml` once the pipeline assembly PR merges
+> and per-task `retries` fields are wired. Currently `pipeline.yaml`
+> only sets aggregate `spec.timeouts`.
 
-| Task | Retries | Rationale |
+| Task | Planned Retries | Rationale |
 |---|---|---|
 | `clone-repo` (ClusterTask) | 2 | Network-dependent, fully idempotent |
 | `validate` | 1 | Read-only, deterministic |
@@ -16,12 +17,13 @@ characteristics. Retries are defined in `pipeline.yaml` using Tekton's
 | `analyze` | 1 | Reads from workspace, deterministic computation |
 | `store-results` | 2 | Database transient errors; upsert logic ensures idempotency via `pipeline_run_id` uniqueness |
 
-## Timeouts
+## Timeouts (Target — Not Yet Applied)
 
-Per-task timeouts prevent hung tasks from consuming cluster resources
-indefinitely. Set in `pipeline.yaml` using Tekton's `timeout` field.
+> **Note:** The per-task timeouts below are the target policy. They will
+> be added to `pipeline.yaml` alongside the retry values. Currently only
+> aggregate timeouts are set: `pipeline: 4h`, `tasks: 3h`.
 
-| Task | Timeout | Notes |
+| Task | Planned Timeout | Notes |
 |---|---|---|
 | `clone-repo` | 5m | Large repos may need adjustment |
 | `validate` | 10m | Includes py_compile on all test files |
@@ -60,12 +62,13 @@ Tekton that a retry would not help.
 When a PipelineRun fails after exhausting retries:
 
 1. **Artifacts are retained** on the workspace PVC (not cleaned up)
-2. Failed run artifacts are copied to the `abevalflow-dead-letter` PVC
-   by the cleanup CronJob (instead of being deleted)
-3. Dead-letter artifacts are retained for 14 days (configurable via
-   `DEAD_LETTER_RETENTION_DAYS`)
-4. PipelineRun metadata remains queryable via `tkn pipelinerun describe`
-   until the cleanup CronJob prunes it (default 7 days)
+2. The `abevalflow-dead-letter` PVC is provisioned and reserved for
+   failed-run artifact storage. Automatic copy logic is **not yet
+   implemented** — operators can manually copy artifacts from the
+   workspace PVC for post-mortem analysis.
+3. PipelineRun metadata remains queryable via `tkn pipelinerun describe`
+   until the cleanup CronJob prunes it (keeps the 7 most recent by
+   count, configurable via `PIPELINERUN_KEEP_COUNT`)
 
 ## Partial-Run Recovery
 
diff --git a/Docs/infrastructure_ops.md b/Docs/infrastructure_ops.md
index 6bfe1fc..87619f8 100644
--- a/Docs/infrastructure_ops.md
+++ b/Docs/infrastructure_ops.md
@@ -33,7 +33,10 @@ oc apply -f config/security/network_policy_<mode>.yaml
 oc apply -f config/storage/workspace_pvc.yaml
 oc apply -f config/storage/dead_letter_pvc.yaml
 
-# 5. Cleanup CronJob
+# 5. Cleanup — create ConfigMap from script, then apply CronJob
+oc create configmap cleanup-script \
+  --from-file=cleanup.sh=scripts/cleanup.sh \
+  -n ab-eval-flow --dry-run=client -o yaml | oc apply -f -
 oc apply -f config/storage/cleanup_cronjob.yaml
 
 # 6. Tekton tasks
@@ -48,6 +51,8 @@ oc create route edge el-submission-listener \
   --port=http-listener
 
 # 9. (Optional) LiteLLM — only for Vertex AI mode
+#    Creates a dedicated litellm ServiceAccount, Deployment, Service, and ConfigMap.
+#    Requires the litellm-credentials Secret (see LiteLLM Setup below).
 oc apply -f config/litellm/
 ```
 
@@ -100,7 +105,7 @@ curl http://localhost:4000/health
 | PVC | Purpose | Default Size |
 |---|---|---|
 | `abevalflow-workspace` | Shared pipeline workspace (source, builds, results) | 5Gi |
-| `abevalflow-dead-letter` | Retained artifacts from failed runs | 2Gi |
+| `abevalflow-dead-letter` | Reserved for failed-run artifacts (manual use for now) | 2Gi |
 
 Adjust sizes based on expected submission volume and image sizes.
 
@@ -112,7 +117,7 @@ Runs daily at 03:00 UTC. Configurable via environment variables:
 |---|---|---|
 | `NAMESPACE` | `ab-eval-flow` | Target namespace |
 | `POD_AGE_HOURS` | `24` | Delete completed/failed trial pods older than this |
-| `PIPELINERUN_AGE_DAYS` | `7` | Delete PipelineRuns older than this |
+| `PIPELINERUN_KEEP_COUNT` | `7` | Keep the N most recent PipelineRuns, delete the rest |
 
 To run cleanup manually:
 
@@ -138,7 +143,7 @@ Adjust based on cluster capacity and expected concurrency.
 ## Pod Security
 
 Trial pods spawned by Harbor's `OpenShiftEnvironment` should follow the
-security context documented in `config/security/pod_security.yaml`:
+security context documented in `config/security/pod_security_reference.yaml`:
 
 - `runAsNonRoot: true`
 - `allowPrivilegeEscalation: false`
diff --git a/config/litellm/deployment.yaml b/config/litellm/deployment.yaml
index 9b5441d..748c064 100644
--- a/config/litellm/deployment.yaml
+++ b/config/litellm/deployment.yaml
@@ -17,10 +17,10 @@ spec:
         app.kubernetes.io/name: litellm
         app.kubernetes.io/part-of: abevalflow
     spec:
-      serviceAccountName: pipeline
+      serviceAccountName: litellm
       containers:
         - name: litellm
-          image: ghcr.io/berriai/litellm:main-latest
+          image: ghcr.io/berriai/litellm:main-v1.82.6
           ports:
             - containerPort: 4000
               name: http
diff --git a/config/litellm/serviceaccount.yaml b/config/litellm/serviceaccount.yaml
new file mode 100644
index 0000000..d6cb39a
--- /dev/null
+++ b/config/litellm/serviceaccount.yaml
@@ -0,0 +1,8 @@
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: litellm
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+    app.kubernetes.io/name: litellm
diff --git a/config/rbac.yaml b/config/rbac.yaml
index 474ed40..2867de1 100644
--- a/config/rbac.yaml
+++ b/config/rbac.yaml
@@ -41,6 +41,7 @@ rules:
   - apiGroups: [""]
     resources: [secrets]
     verbs: [get]
+    resourceNames: [ab-eval-db-credentials, llm-credentials]
   - apiGroups: [""]
     resources: [configmaps]
     verbs: [get, list]
diff --git a/config/security/network_policy_direct_api.yaml b/config/security/network_policy_direct_api.yaml
index 39502f0..7f63097 100644
--- a/config/security/network_policy_direct_api.yaml
+++ b/config/security/network_policy_direct_api.yaml
@@ -9,7 +9,7 @@ metadata:
     abevalflow/llm-mode: direct-api
     abevalflow/note: >-
       Allows trial pods to reach LLM provider APIs and DNS.
-      Apply this policy INSTEAD OF default-deny when using direct API keys.
+      Apply IN ADDITION TO default-deny when using direct API keys.
 spec:
   podSelector:
     matchLabels:
diff --git a/config/security/network_policy_litellm.yaml b/config/security/network_policy_litellm.yaml
index 157ffde..cd82b8b 100644
--- a/config/security/network_policy_litellm.yaml
+++ b/config/security/network_policy_litellm.yaml
@@ -9,7 +9,7 @@ metadata:
     abevalflow/llm-mode: vertex+litellm
     abevalflow/note: >-
       Allows trial pods to reach only the in-cluster LiteLLM proxy
-      and DNS. No external egress. Apply INSTEAD OF default-deny
+      and DNS. No external egress. Apply IN ADDITION TO default-deny
       when using Vertex AI mode.
 spec:
   podSelector:
diff --git a/config/security/network_policy_self_hosted.yaml b/config/security/network_policy_self_hosted.yaml
index 6828a31..35f4e05 100644
--- a/config/security/network_policy_self_hosted.yaml
+++ b/config/security/network_policy_self_hosted.yaml
@@ -9,10 +9,11 @@ metadata:
     abevalflow/llm-mode: self-hosted
     abevalflow/note: >-
       Allows trial pods to reach only the in-cluster model endpoint
-      and DNS. No external egress. Apply INSTEAD OF default-deny
+      and DNS. No external egress. Apply IN ADDITION TO default-deny
       when using a self-hosted model (vLLM, Ollama, etc.).
       Update the podSelector or namespaceSelector to match your
-      model serving deployment.
+      model serving deployment. Adjust port to match your server
+      (vLLM=8000, Ollama=11434, TGI=80).
 spec:
   podSelector:
     matchLabels:
diff --git a/config/security/pod_security.yaml b/config/security/pod_security_reference.yaml
similarity index 100%
rename from config/security/pod_security.yaml
rename to config/security/pod_security_reference.yaml
diff --git a/config/storage/cleanup_cronjob.yaml b/config/storage/cleanup_cronjob.yaml
index 51f305a..e6c0a9c 100644
--- a/config/storage/cleanup_cronjob.yaml
+++ b/config/storage/cleanup_cronjob.yaml
@@ -26,7 +26,7 @@ spec:
                   value: ab-eval-flow
                 - name: POD_AGE_HOURS
                   value: "24"
-                - name: PIPELINERUN_AGE_DAYS
+                - name: PIPELINERUN_KEEP_COUNT
                   value: "7"
               volumeMounts:
                 - name: scripts
@@ -50,6 +50,14 @@ spec:
                 name: cleanup-script
                 defaultMode: 0755
 ---
+# The cleanup-script ConfigMap should be created from scripts/cleanup.sh
+# to maintain a single source of truth:
+#
+#   oc create configmap cleanup-script \
+#     --from-file=cleanup.sh=scripts/cleanup.sh \
+#     -n ab-eval-flow --dry-run=client -o yaml | oc apply -f -
+#
+# Alternatively, apply it manually:
 apiVersion: v1
 kind: ConfigMap
 metadata:
@@ -57,55 +65,5 @@ metadata:
   namespace: ab-eval-flow
   labels:
     app.kubernetes.io/part-of: abevalflow
-data:
-  cleanup.sh: |
-    #!/usr/bin/env bash
-    set -euo pipefail
-
-    NAMESPACE="${NAMESPACE:-ab-eval-flow}"
-    POD_AGE_HOURS="${POD_AGE_HOURS:-24}"
-    PIPELINERUN_AGE_DAYS="${PIPELINERUN_AGE_DAYS:-7}"
-
-    log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $*"; }
-
-    log "Starting cleanup in namespace=${NAMESPACE}"
-
-    # Delete completed/failed trial pods older than threshold
-    log "Removing completed/failed trial pods older than ${POD_AGE_HOURS}h..."
-    threshold=$((POD_AGE_HOURS * 3600))
-    for phase in Succeeded Failed; do
-      oc get pods -n "${NAMESPACE}" --field-selector="status.phase=${phase}" \
-        -l abevalflow/role=trial -o name 2>/dev/null | while read -r pod; do
-        age=$(oc get "${pod}" -n "${NAMESPACE}" -o jsonpath='{.metadata.creationTimestamp}' 2>/dev/null)
-        age_sec=$(python3 -c "
-    from datetime import datetime, timezone
-    t = datetime.fromisoformat('${age}'.replace('Z','+00:00'))
-    print(int((datetime.now(timezone.utc) - t).total_seconds()))
-    " 2>/dev/null || echo 0)
-        if [ "${age_sec}" -gt "${threshold}" ]; then
-          log "Deleting ${pod} (age=${age_sec}s, phase=${phase})"
-          oc delete "${pod}" -n "${NAMESPACE}" --grace-period=0 || true
-        fi
-      done
-    done
-
-    # Delete old PipelineRuns (keep recent N days worth)
-    log "Removing PipelineRuns older than ${PIPELINERUN_AGE_DAYS}d..."
-    if command -v tkn &>/dev/null; then
-      tkn pipelinerun delete -n "${NAMESPACE}" --keep="${PIPELINERUN_AGE_DAYS}" --force 2>/dev/null || true
-    else
-      log "tkn not available, skipping PipelineRun cleanup"
-    fi
-
-    # Prune empty image streams
-    log "Pruning empty image streams..."
-    oc get imagestream -n "${NAMESPACE}" -o name 2>/dev/null | while read -r is; do
-      tags=$(oc get "${is}" -n "${NAMESPACE}" -o jsonpath='{.status.tags}' 2>/dev/null || echo "[]")
-      count=$(python3 -c "import json; print(len(json.loads('${tags}') or []))" 2>/dev/null || echo 0)
-      if [ "${count}" -eq 0 ]; then
-        log "Deleting empty ${is}"
-        oc delete "${is}" -n "${NAMESPACE}" || true
-      fi
-    done
-
-    log "Cleanup complete"
+data: {}
+  # Populate from scripts/cleanup.sh — see comment above.
diff --git a/scripts/cleanup.sh b/scripts/cleanup.sh
index 0d88af4..97598bf 100755
--- a/scripts/cleanup.sh
+++ b/scripts/cleanup.sh
@@ -6,8 +6,7 @@ set -euo pipefail
 
 NAMESPACE="${NAMESPACE:-ab-eval-flow}"
 POD_AGE_HOURS="${POD_AGE_HOURS:-24}"
-PIPELINERUN_AGE_DAYS="${PIPELINERUN_AGE_DAYS:-7}"
-DEAD_LETTER_RETENTION_DAYS="${DEAD_LETTER_RETENTION_DAYS:-14}"
+PIPELINERUN_KEEP_COUNT="${PIPELINERUN_KEEP_COUNT:-7}"
 
 log() { echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $*"; }
 
@@ -37,11 +36,11 @@ print(int((datetime.now(timezone.utc) - start).total_seconds()))
   fi
 done
 
-# Delete old PipelineRuns
-log "Removing PipelineRuns older than ${PIPELINERUN_AGE_DAYS}d..."
+# Delete old PipelineRuns (keep the N most recent by count)
+log "Pruning PipelineRuns, keeping most recent ${PIPELINERUN_KEEP_COUNT}..."
 if command -v tkn &>/dev/null; then
   tkn pipelinerun delete -n "${NAMESPACE}" \
-    --keep="${PIPELINERUN_AGE_DAYS}" \
+    --keep="${PIPELINERUN_KEEP_COUNT}" \
     --force 2>/dev/null || true
 fi
 

From c8faa9d94e2dee55847551337b0a9fa9edf0f59a Mon Sep 17 00:00:00 2001
From: gziv <gziv@redhat.com>
Date: Thu, 23 Apr 2026 13:01:30 +0300
Subject: [PATCH 3/4] fix: correct secret name in RBAC and add fail-fast
 ConfigMap placeholder

- resourceNames: llm-credentials -> litellm-credentials (matches
  config/litellm/secret_template.yaml)
- Replace empty data: {} ConfigMap stub with a placeholder script that
  exits 1 with instructions, preventing silent broken-state on oc apply
---
 config/rbac.yaml                    | 2 +-
 config/storage/cleanup_cronjob.yaml | 8 ++++++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/config/rbac.yaml b/config/rbac.yaml
index 2867de1..1516dd6 100644
--- a/config/rbac.yaml
+++ b/config/rbac.yaml
@@ -41,7 +41,7 @@ rules:
   - apiGroups: [""]
     resources: [secrets]
     verbs: [get]
-    resourceNames: [ab-eval-db-credentials, llm-credentials]
+    resourceNames: [ab-eval-db-credentials, litellm-credentials]
   - apiGroups: [""]
     resources: [configmaps]
     verbs: [get, list]
diff --git a/config/storage/cleanup_cronjob.yaml b/config/storage/cleanup_cronjob.yaml
index e6c0a9c..2473525 100644
--- a/config/storage/cleanup_cronjob.yaml
+++ b/config/storage/cleanup_cronjob.yaml
@@ -65,5 +65,9 @@ metadata:
   namespace: ab-eval-flow
   labels:
     app.kubernetes.io/part-of: abevalflow
-data: {}
-  # Populate from scripts/cleanup.sh — see comment above.
+data:
+  cleanup.sh: |
+    #!/usr/bin/env bash
+    echo "ERROR: placeholder — regenerate this ConfigMap from scripts/cleanup.sh:"
+    echo "  oc create configmap cleanup-script --from-file=cleanup.sh=scripts/cleanup.sh -n ab-eval-flow --dry-run=client -o yaml | oc apply -f -"
+    exit 1

From d34f381dae4201c736526acd84ebffa3fa5438fb Mon Sep 17 00:00:00 2001
From: gziv <gziv@redhat.com>
Date: Thu, 23 Apr 2026 13:04:20 +0300
Subject: [PATCH 4/4] fix: remove ConfigMap stub from cleanup_cronjob.yaml to
 prevent overwrite
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The placeholder ConfigMap was reapplied by oc apply, overwriting the
real script created in the prior step. Removed entirely — the ConfigMap
is now managed only via oc create configmap --from-file as documented.
---
 config/storage/cleanup_cronjob.yaml | 21 +++------------------
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/config/storage/cleanup_cronjob.yaml b/config/storage/cleanup_cronjob.yaml
index 2473525..efd9b07 100644
--- a/config/storage/cleanup_cronjob.yaml
+++ b/config/storage/cleanup_cronjob.yaml
@@ -49,25 +49,10 @@ spec:
               configMap:
                 name: cleanup-script
                 defaultMode: 0755
----
-# The cleanup-script ConfigMap should be created from scripts/cleanup.sh
-# to maintain a single source of truth:
+
+# NOTE: The cleanup-script ConfigMap is NOT bundled in this file.
+# Create it from the canonical script before applying this CronJob:
 #
 #   oc create configmap cleanup-script \
 #     --from-file=cleanup.sh=scripts/cleanup.sh \
 #     -n ab-eval-flow --dry-run=client -o yaml | oc apply -f -
-#
-# Alternatively, apply it manually:
-apiVersion: v1
-kind: ConfigMap
-metadata:
-  name: cleanup-script
-  namespace: ab-eval-flow
-  labels:
-    app.kubernetes.io/part-of: abevalflow
-data:
-  cleanup.sh: |
-    #!/usr/bin/env bash
-    echo "ERROR: placeholder — regenerate this ConfigMap from scripts/cleanup.sh:"
-    echo "  oc create configmap cleanup-script --from-file=cleanup.sh=scripts/cleanup.sh -n ab-eval-flow --dry-run=client -o yaml | oc apply -f -"
-    exit 1