RHEcosystemAppEng · GuyZivRH · Apr 26, 2026 · Apr 23, 2026 · Apr 23, 2026 · Apr 23, 2026
diff --git a/Docs/failure_handling.md b/Docs/failure_handling.md
@@ -0,0 +1,99 @@
+# Failure Handling, Retries, and Idempotency
+
+## Retry Policy (Target — Not Yet Applied)
+
+> **Note:** The per-task retry values below are the target policy. They
+> will be added to `pipeline.yaml` once the pipeline assembly PR merges
+> and per-task `retries` fields are wired. Currently `pipeline.yaml`
+> only sets aggregate `spec.timeouts`.
+
+| Task | Planned Retries | Rationale |
+|---|---|---|
+| `clone-repo` (ClusterTask) | 2 | Network-dependent, fully idempotent |
+| `validate` | 1 | Read-only, deterministic |
+| `scaffold` | 1 | Deterministic template rendering |
+| `build-push` | 2 | Transient registry/network errors; Buildah is idempotent with layer caching |
+| `harbor-eval` | 0 | Long-running (up to 3h), not idempotent — partial trial results would conflict with a fresh run |
+| `analyze` | 1 | Reads from workspace, deterministic computation |
+| `store-results` | 2 | Database transient errors; upsert logic ensures idempotency via `pipeline_run_id` uniqueness |
+
+## Timeouts (Target — Not Yet Applied)
+
+> **Note:** The per-task timeouts below are the target policy. They will
+> be added to `pipeline.yaml` alongside the retry values. Currently only
+> aggregate timeouts are set: `pipeline: 4h`, `tasks: 3h`.
+
+| Task | Planned Timeout | Notes |
+|---|---|---|
+| `clone-repo` | 5m | Large repos may need adjustment |
+| `validate` | 10m | Includes py_compile on all test files |
+| `scaffold` | 10m | Jinja2 rendering + file copy |
+| `build-push` | 30m | Two container builds (treatment + control) |
+| `harbor-eval` | 3h | 20 trials x 2 variants; adjust based on task complexity |
+| `analyze` | 15m | Statistical computation + report generation |
+| `store-results` | 15m | Database writes + observer notifications |
+| **Pipeline total** | 4h | Safety net above sum of individual timeouts |
+
+## Non-Retryable Failures
+
+Certain failure categories should not be retried because they indicate
+a problem that will not resolve on its own:
+
+- **Validation failures** — malformed submission, missing required files
+- **Schema violations** — invalid `metadata.yaml`
+- **Build failures** from syntax errors in user code
+- **Harbor evaluation failures** from test assertion errors (the skill genuinely fails)
+
+These are distinguished from transient failures (network timeouts,
+registry 503s, DB connection drops) by exit code conventions:
+
+| Exit Code | Meaning | Retry? |
+|---|---|---|
+| 0 | Success | -- |
+| 1 | Transient/recoverable error | Yes |
+| 2 | Validation/user error (non-retryable) | No |
+| 3 | Infrastructure error (retryable) | Yes |
+
+Scripts should use `sys.exit(2)` for user-facing errors to signal
+Tekton that a retry would not help.
+
+## Dead-Letter Path
+
+When a PipelineRun fails after exhausting retries:
+
+1. **Artifacts are retained** on the workspace PVC (not cleaned up)
+2. The `abevalflow-dead-letter` PVC is provisioned and reserved for
+   failed-run artifact storage. Automatic copy logic is **not yet
+   implemented** — operators can manually copy artifacts from the
+   workspace PVC for post-mortem analysis.
+3. PipelineRun metadata remains queryable via `tkn pipelinerun describe`
+   until the cleanup CronJob prunes it (keeps the 7 most recent by
+   count, configurable via `PIPELINERUN_KEEP_COUNT`)
+
+## Partial-Run Recovery
+
+Tekton does not natively support resuming a pipeline from a specific
+task. The recovery strategy is:
+
+1. **Workspace snapshot** — the PVC retains all intermediate artifacts
+   from completed tasks. A re-run with the same submission will
+   overwrite these, effectively starting fresh.
+
+2. **Harbor checkpointing** — the Harbor fork persists individual trial
+   results to the workspace as they complete. If `harbor-eval` fails
+   mid-way (e.g., after 15 of 20 trials), the partial `result.json`
+   files are available for inspection. However, the analysis step
+   expects a complete set, so a re-run of `harbor-eval` is needed.
+
+3. **Manual re-trigger** — use `tkn pipeline start` with the same
+   parameters to re-run the full pipeline. Since all tasks before the
+   failure point are idempotent, they will complete quickly using
+   cached layers (builds) or deterministic outputs (scaffold).
+
+## Concurrency
+
+- **PipelineRuns** — no built-in Tekton limit; use `ResourceQuota` on
+  the namespace (`config/security/resource_quota.yaml`) to cap total
+  pods, which indirectly limits concurrent runs.
+- **Trial Pods** — Harbor's `OpenShiftEnvironment` controls concurrency
+  via its `max_concurrent` parameter in the job config.
diff --git a/Docs/infrastructure_ops.md b/Docs/infrastructure_ops.md
@@ -0,0 +1,188 @@
+# Infrastructure & Operations Guide
+
+Deployment and operations reference for running ABEvalFlow on OpenShift.
+
+## Prerequisites
+
+- OpenShift cluster with Pipelines operator (Tekton) installed
+- `oc` CLI authenticated with cluster-admin or namespace-admin
+- `tkn` CLI (optional, for manual pipeline triggers and PipelineRun cleanup)
+
+## Namespace Setup
+
+```bash
+oc new-project ab-eval-flow --description="ABEvalFlow A/B evaluation pipeline"
+```
+
+## Deployment Order
+
+Apply manifests in this order to satisfy dependencies:
+
+```bash
+# 1. RBAC — ServiceAccount, Roles, RoleBindings
+oc apply -f config/rbac.yaml
+
+# 2. Security — resource quotas
+oc apply -f config/security/resource_quota.yaml
+
+# 3. Network policies — choose ONE based on LLM mode (see below)
+oc apply -f config/security/network_policy_default_deny.yaml
+oc apply -f config/security/network_policy_<mode>.yaml
+
+# 4. Storage — workspace and dead-letter PVCs
+oc apply -f config/storage/workspace_pvc.yaml
+oc apply -f config/storage/dead_letter_pvc.yaml
+
+# 5. Cleanup — create ConfigMap from script, then apply CronJob
+oc create configmap cleanup-script \
+  --from-file=cleanup.sh=scripts/cleanup.sh \
+  -n ab-eval-flow --dry-run=client -o yaml | oc apply -f -
+oc apply -f config/storage/cleanup_cronjob.yaml
+
+# 6. Tekton tasks
+oc apply -f pipeline/tasks/
+
+# 7. Tekton triggers
+oc apply -f pipeline/triggers/
+
+# 8. Expose EventListener
+oc create route edge el-submission-listener \
+  --service=el-submission-listener \
+  --port=http-listener
+
+# 9. (Optional) LiteLLM — only for Vertex AI mode
+#    Creates a dedicated litellm ServiceAccount, Deployment, Service, and ConfigMap.
+#    Requires the litellm-credentials Secret (see LiteLLM Setup below).
+oc apply -f config/litellm/
+```
+
+## Network Policy Selection
+
+Choose the network policy that matches your LLM access mode. Always
+apply the default-deny policy first, then add the mode-specific allow
+policy.
+
+| LLM Mode | Policies to Apply | Effect |
+|---|---|---|
+| Direct API key | `default_deny` + `direct_api` | Trial pods can reach provider HTTPS endpoints + DNS |
+| Vertex AI + LiteLLM | `default_deny` + `litellm` | Trial pods can only reach in-cluster LiteLLM on port 4000 |
+| Self-hosted model | `default_deny` + `self_hosted` | Trial pods can only reach in-cluster model server |
+
+Trial pods must carry the label `abevalflow/role: trial` for policies
+to take effect. The Harbor fork's `OpenShiftEnvironment` should set
+this label when creating trial pods.
+
+## LiteLLM Setup (Vertex AI Mode Only)
+
+1. Create the credentials secret with your GCP service account key:
+
+```bash
+oc create secret generic litellm-credentials \
+  --from-file=GOOGLE_APPLICATION_CREDENTIALS_JSON=path/to/sa-key.json \
+  --from-literal=LITELLM_MASTER_KEY=$(openssl rand -hex 32) \
+  -n ab-eval-flow
+```
+
+2. Edit `config/litellm/configmap.yaml` to set your GCP project and
+   model routing.
+
+3. Apply the manifests:
+
+```bash
+oc apply -f config/litellm/
+```
+
+4. Verify the proxy is healthy:
+
+```bash
+oc get pods -l app.kubernetes.io/name=litellm -n ab-eval-flow
+oc port-forward svc/litellm 4000:4000 -n ab-eval-flow &
+curl http://localhost:4000/health
+```
+
+## Storage
+
+| PVC | Purpose | Default Size |
+|---|---|---|
+| `abevalflow-workspace` | Shared pipeline workspace (source, builds, results) | 5Gi |
+| `abevalflow-dead-letter` | Reserved for failed-run artifacts (manual use for now) | 2Gi |
+
+Adjust sizes based on expected submission volume and image sizes.
+
+## Cleanup CronJob
+
+Runs daily at 03:00 UTC. Configurable via environment variables:
+
+| Variable | Default | Description |
+|---|---|---|
+| `NAMESPACE` | `ab-eval-flow` | Target namespace |
+| `POD_AGE_HOURS` | `24` | Delete completed/failed trial pods older than this |
+| `PIPELINERUN_KEEP_COUNT` | `7` | Keep the N most recent PipelineRuns, delete the rest |
+
+To run cleanup manually:
+
+```bash
+oc create job --from=cronjob/abevalflow-cleanup manual-cleanup -n ab-eval-flow
+```
+
+## Resource Quotas
+
+The default quota (`config/security/resource_quota.yaml`) limits:
+
+| Resource | Limit |
+|---|---|
+| Pods | 50 |
+| CPU requests | 32 cores |
+| Memory requests | 64Gi |
+| CPU limits | 64 cores |
+| Memory limits | 128Gi |
+| PVCs | 10 |
+
+Adjust based on cluster capacity and expected concurrency.
+
+## Pod Security
+
+Trial pods spawned by Harbor's `OpenShiftEnvironment` should follow the
+security context documented in `config/security/pod_security_reference.yaml`:
+
+- `runAsNonRoot: true`
+- `allowPrivilegeEscalation: false`
+- Drop all Linux capabilities
+- Seccomp `RuntimeDefault`
+- Resource requests/limits per trial pod
+
+The Harbor fork currently sets `HOME=/tmp` instead of
+`readOnlyRootFilesystem: true` for agent compatibility. This is
+documented in `Docs/harbor_openshift_backend.md`.
+
+## Failure Handling
+
+See [failure_handling.md](failure_handling.md) for retry policies,
+timeouts, dead-letter path, and partial-run recovery.
+
+## Verification
+
+After deploying, verify the infrastructure:
+
+```bash
+# Check ServiceAccount
+oc get sa pipeline -n ab-eval-flow
+
+# Check RBAC
+oc auth can-i create pods --as=system:serviceaccount:ab-eval-flow:pipeline -n ab-eval-flow
+
+# Check network policies
+oc get networkpolicy -n ab-eval-flow
+
+# Check PVCs
+oc get pvc -n ab-eval-flow
+
+# Check CronJob
+oc get cronjob -n ab-eval-flow
+
+# Check EventListener
+oc get el,route -n ab-eval-flow
+
+# Check resource quota usage
+oc describe resourcequota eval-resource-quota -n ab-eval-flow
+```
diff --git a/config/litellm/configmap.yaml b/config/litellm/configmap.yaml
@@ -0,0 +1,25 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: litellm-config
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+    app.kubernetes.io/name: litellm
+data:
+  config.yaml: |
+    model_list:
+      - model_name: claude-sonnet
+        litellm_params:
+          model: vertex_ai/claude-3-5-sonnet@20241022
+          vertex_project: "<gcp-project-id>"
+          vertex_location: "global"
+
+    litellm_settings:
+      drop_params: true
+      set_verbose: false
+      num_retries: 2
+      request_timeout: 120
+
+    general_settings:
+      master_key: "os.environ/LITELLM_MASTER_KEY"
diff --git a/config/litellm/deployment.yaml b/config/litellm/deployment.yaml
@@ -0,0 +1,82 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: litellm
+  namespace: ab-eval-flow
+  labels:
+    app.kubernetes.io/part-of: abevalflow
+    app.kubernetes.io/name: litellm
+spec:
+  replicas: 2
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: litellm
+  template:
+    metadata:
+      labels:
+        app.kubernetes.io/name: litellm
+        app.kubernetes.io/part-of: abevalflow
+    spec:
+      serviceAccountName: litellm
+      containers:
+        - name: litellm
+          image: ghcr.io/berriai/litellm:main-v1.82.6
+          ports:
+            - containerPort: 4000
+              name: http
+          args:
+            - "--config"
+            - "/app/config/config.yaml"
+            - "--port"
+            - "4000"
+          env:
+            - name: LITELLM_MASTER_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: litellm-credentials
+                  key: LITELLM_MASTER_KEY
+            - name: GOOGLE_APPLICATION_CREDENTIALS
+              value: /app/credentials/gcp-sa-key.json
+          volumeMounts:
+            - name: config
+              mountPath: /app/config
+              readOnly: true
+            - name: credentials
+              mountPath: /app/credentials
+              readOnly: true
+          resources:
+            requests:
+              cpu: "250m"
+              memory: "256Mi"
+            limits:
+              cpu: "1"
+              memory: "1Gi"
+          readinessProbe:
+            httpGet:
+              path: /health
+              port: 4000
+            initialDelaySeconds: 10
+            periodSeconds: 10
+          livenessProbe:
+            httpGet:
+              path: /health
+              port: 4000
+            initialDelaySeconds: 15
+            periodSeconds: 30
+          securityContext:
+            runAsNonRoot: true
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop: ["ALL"]
+            seccompProfile:
+              type: RuntimeDefault
+      volumes:
+        - name: config
+          configMap:
+            name: litellm-config
+        - name: credentials
+          secret:
+            secretName: litellm-credentials
+            items:
+              - key: GOOGLE_APPLICATION_CREDENTIALS_JSON
+                path: gcp-sa-key.json