Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions Docs/failure_handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Failure Handling, Retries, and Idempotency

## Retry Policy (Target — Not Yet Applied)

> **Note:** The per-task retry values below are the target policy. They
> will be added to `pipeline.yaml` once the pipeline assembly PR merges
> and per-task `retries` fields are wired. Currently `pipeline.yaml`
> only sets aggregate `spec.timeouts`.

| Task | Planned Retries | Rationale |
|---|---|---|
| `clone-repo` (ClusterTask) | 2 | Network-dependent, fully idempotent |
| `validate` | 1 | Read-only, deterministic |
| `scaffold` | 1 | Deterministic template rendering |
| `build-push` | 2 | Transient registry/network errors; Buildah is idempotent with layer caching |
| `harbor-eval` | 0 | Long-running (up to 3h), not idempotent — partial trial results would conflict with a fresh run |
| `analyze` | 1 | Reads from workspace, deterministic computation |
| `store-results` | 2 | Database transient errors; upsert logic ensures idempotency via `pipeline_run_id` uniqueness |

## Timeouts (Target — Not Yet Applied)

> **Note:** The per-task timeouts below are the target policy. They will
> be added to `pipeline.yaml` alongside the retry values. Currently only
> aggregate timeouts are set: `pipeline: 4h`, `tasks: 3h`.

| Task | Planned Timeout | Notes |
|---|---|---|
| `clone-repo` | 5m | Large repos may need adjustment |
| `validate` | 10m | Includes py_compile on all test files |
| `scaffold` | 10m | Jinja2 rendering + file copy |
| `build-push` | 30m | Two container builds (treatment + control) |
| `harbor-eval` | 3h | 20 trials x 2 variants; adjust based on task complexity |
| `analyze` | 15m | Statistical computation + report generation |
| `store-results` | 15m | Database writes + observer notifications |
| **Pipeline total** | 4h | Safety net above sum of individual timeouts |

## Non-Retryable Failures

Certain failure categories should not be retried because they indicate
a problem that will not resolve on its own:

- **Validation failures** — malformed submission, missing required files
- **Schema violations** — invalid `metadata.yaml`
- **Build failures** from syntax errors in user code
- **Harbor evaluation failures** from test assertion errors (the skill genuinely fails)

These are distinguished from transient failures (network timeouts,
registry 503s, DB connection drops) by exit code conventions:

| Exit Code | Meaning | Retry? |
|---|---|---|
| 0 | Success | -- |
| 1 | Transient/recoverable error | Yes |
| 2 | Validation/user error (non-retryable) | No |
| 3 | Infrastructure error (retryable) | Yes |

Scripts should use `sys.exit(2)` for user-facing errors to signal
Tekton that a retry would not help.

## Dead-Letter Path

When a PipelineRun fails after exhausting retries:

1. **Artifacts are retained** on the workspace PVC (not cleaned up)
2. The `abevalflow-dead-letter` PVC is provisioned and reserved for
failed-run artifact storage. Automatic copy logic is **not yet
implemented** — operators can manually copy artifacts from the
workspace PVC for post-mortem analysis.
3. PipelineRun metadata remains queryable via `tkn pipelinerun describe`
until the cleanup CronJob prunes it (keeps the 7 most recent by
count, configurable via `PIPELINERUN_KEEP_COUNT`)

## Partial-Run Recovery

Tekton does not natively support resuming a pipeline from a specific
task. The recovery strategy is:

1. **Workspace snapshot** — the PVC retains all intermediate artifacts
from completed tasks. A re-run with the same submission will
overwrite these, effectively starting fresh.

2. **Harbor checkpointing** — the Harbor fork persists individual trial
results to the workspace as they complete. If `harbor-eval` fails
mid-way (e.g., after 15 of 20 trials), the partial `result.json`
files are available for inspection. However, the analysis step
expects a complete set, so a re-run of `harbor-eval` is needed.

3. **Manual re-trigger** — use `tkn pipeline start` with the same
parameters to re-run the full pipeline. Since all tasks before the
failure point are idempotent, they will complete quickly using
cached layers (builds) or deterministic outputs (scaffold).

## Concurrency

- **PipelineRuns** — no built-in Tekton limit; use `ResourceQuota` on
the namespace (`config/security/resource_quota.yaml`) to cap total
pods, which indirectly limits concurrent runs.
- **Trial Pods** — Harbor's `OpenShiftEnvironment` controls concurrency
via its `max_concurrent` parameter in the job config.
188 changes: 188 additions & 0 deletions Docs/infrastructure_ops.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# Infrastructure & Operations Guide

Deployment and operations reference for running ABEvalFlow on OpenShift.

## Prerequisites

- OpenShift cluster with Pipelines operator (Tekton) installed
- `oc` CLI authenticated with cluster-admin or namespace-admin
- `tkn` CLI (optional, for manual pipeline triggers and PipelineRun cleanup)

## Namespace Setup

```bash
oc new-project ab-eval-flow --description="ABEvalFlow A/B evaluation pipeline"
```

## Deployment Order

Apply manifests in this order to satisfy dependencies:

```bash
# 1. RBAC — ServiceAccount, Roles, RoleBindings
oc apply -f config/rbac.yaml

# 2. Security — resource quotas
oc apply -f config/security/resource_quota.yaml

# 3. Network policies — choose ONE based on LLM mode (see below)
oc apply -f config/security/network_policy_default_deny.yaml
oc apply -f config/security/network_policy_<mode>.yaml

# 4. Storage — workspace and dead-letter PVCs
oc apply -f config/storage/workspace_pvc.yaml
oc apply -f config/storage/dead_letter_pvc.yaml

# 5. Cleanup — create ConfigMap from script, then apply CronJob
oc create configmap cleanup-script \
--from-file=cleanup.sh=scripts/cleanup.sh \
-n ab-eval-flow --dry-run=client -o yaml | oc apply -f -
oc apply -f config/storage/cleanup_cronjob.yaml

# 6. Tekton tasks
oc apply -f pipeline/tasks/

# 7. Tekton triggers
oc apply -f pipeline/triggers/

# 8. Expose EventListener
oc create route edge el-submission-listener \
--service=el-submission-listener \
--port=http-listener

# 9. (Optional) LiteLLM — only for Vertex AI mode
# Creates a dedicated litellm ServiceAccount, Deployment, Service, and ConfigMap.
# Requires the litellm-credentials Secret (see LiteLLM Setup below).
oc apply -f config/litellm/
```

## Network Policy Selection

Choose the network policy that matches your LLM access mode. Always
apply the default-deny policy first, then add the mode-specific allow
policy.

| LLM Mode | Policies to Apply | Effect |
|---|---|---|
| Direct API key | `default_deny` + `direct_api` | Trial pods can reach provider HTTPS endpoints + DNS |
| Vertex AI + LiteLLM | `default_deny` + `litellm` | Trial pods can only reach in-cluster LiteLLM on port 4000 |
| Self-hosted model | `default_deny` + `self_hosted` | Trial pods can only reach in-cluster model server |

Trial pods must carry the label `abevalflow/role: trial` for policies
to take effect. The Harbor fork's `OpenShiftEnvironment` should set
this label when creating trial pods.

## LiteLLM Setup (Vertex AI Mode Only)

1. Create the credentials secret with your GCP service account key:

```bash
oc create secret generic litellm-credentials \
--from-file=GOOGLE_APPLICATION_CREDENTIALS_JSON=path/to/sa-key.json \
--from-literal=LITELLM_MASTER_KEY=$(openssl rand -hex 32) \
-n ab-eval-flow
```

2. Edit `config/litellm/configmap.yaml` to set your GCP project and
model routing.

3. Apply the manifests:

```bash
oc apply -f config/litellm/
```

4. Verify the proxy is healthy:

```bash
oc get pods -l app.kubernetes.io/name=litellm -n ab-eval-flow
oc port-forward svc/litellm 4000:4000 -n ab-eval-flow &
curl http://localhost:4000/health
```

## Storage

| PVC | Purpose | Default Size |
|---|---|---|
| `abevalflow-workspace` | Shared pipeline workspace (source, builds, results) | 5Gi |
| `abevalflow-dead-letter` | Reserved for failed-run artifacts (manual use for now) | 2Gi |

Adjust sizes based on expected submission volume and image sizes.

## Cleanup CronJob

Runs daily at 03:00 UTC. Configurable via environment variables:

| Variable | Default | Description |
|---|---|---|
| `NAMESPACE` | `ab-eval-flow` | Target namespace |
| `POD_AGE_HOURS` | `24` | Delete completed/failed trial pods older than this |
| `PIPELINERUN_KEEP_COUNT` | `7` | Keep the N most recent PipelineRuns, delete the rest |

To run cleanup manually:

```bash
oc create job --from=cronjob/abevalflow-cleanup manual-cleanup -n ab-eval-flow
```

## Resource Quotas

The default quota (`config/security/resource_quota.yaml`) limits:

| Resource | Limit |
|---|---|
| Pods | 50 |
| CPU requests | 32 cores |
| Memory requests | 64Gi |
| CPU limits | 64 cores |
| Memory limits | 128Gi |
| PVCs | 10 |

Adjust based on cluster capacity and expected concurrency.

## Pod Security

Trial pods spawned by Harbor's `OpenShiftEnvironment` should follow the
security context documented in `config/security/pod_security_reference.yaml`:

- `runAsNonRoot: true`
- `allowPrivilegeEscalation: false`
- Drop all Linux capabilities
- Seccomp `RuntimeDefault`
- Resource requests/limits per trial pod

The Harbor fork currently sets `HOME=/tmp` instead of
`readOnlyRootFilesystem: true` for agent compatibility. This is
documented in `Docs/harbor_openshift_backend.md`.

## Failure Handling

See [failure_handling.md](failure_handling.md) for retry policies,
timeouts, dead-letter path, and partial-run recovery.

## Verification

After deploying, verify the infrastructure:

```bash
# Check ServiceAccount
oc get sa pipeline -n ab-eval-flow

# Check RBAC
oc auth can-i create pods --as=system:serviceaccount:ab-eval-flow:pipeline -n ab-eval-flow

# Check network policies
oc get networkpolicy -n ab-eval-flow

# Check PVCs
oc get pvc -n ab-eval-flow

# Check CronJob
oc get cronjob -n ab-eval-flow

# Check EventListener
oc get el,route -n ab-eval-flow

# Check resource quota usage
oc describe resourcequota eval-resource-quota -n ab-eval-flow
```
25 changes: 25 additions & 0 deletions config/litellm/configmap.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: litellm-config
namespace: ab-eval-flow
labels:
app.kubernetes.io/part-of: abevalflow
app.kubernetes.io/name: litellm
data:
config.yaml: |
model_list:
- model_name: claude-sonnet
litellm_params:
model: vertex_ai/claude-3-5-sonnet@20241022
vertex_project: "<gcp-project-id>"
vertex_location: "global"
litellm_settings:
drop_params: true
set_verbose: false
num_retries: 2
request_timeout: 120
general_settings:
master_key: "os.environ/LITELLM_MASTER_KEY"
82 changes: 82 additions & 0 deletions config/litellm/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm
namespace: ab-eval-flow
labels:
app.kubernetes.io/part-of: abevalflow
app.kubernetes.io/name: litellm
spec:
replicas: 2
selector:
matchLabels:
app.kubernetes.io/name: litellm
template:
metadata:
labels:
app.kubernetes.io/name: litellm
app.kubernetes.io/part-of: abevalflow
spec:
serviceAccountName: litellm
containers:
- name: litellm
image: ghcr.io/berriai/litellm:main-v1.82.6
ports:
- containerPort: 4000
name: http
args:
- "--config"
- "/app/config/config.yaml"
- "--port"
- "4000"
env:
- name: LITELLM_MASTER_KEY
valueFrom:
secretKeyRef:
name: litellm-credentials
key: LITELLM_MASTER_KEY
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /app/credentials/gcp-sa-key.json
volumeMounts:
- name: config
mountPath: /app/config
readOnly: true
- name: credentials
mountPath: /app/credentials
readOnly: true
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1"
memory: "1Gi"
readinessProbe:
httpGet:
path: /health
port: 4000
initialDelaySeconds: 10
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 4000
initialDelaySeconds: 15
periodSeconds: 30
securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
seccompProfile:
type: RuntimeDefault
volumes:
- name: config
configMap:
name: litellm-config
- name: credentials
secret:
secretName: litellm-credentials
items:
- key: GOOGLE_APPLICATION_CREDENTIALS_JSON
path: gcp-sa-key.json
Loading