feat: APPENG-4910 — Infrastructure manifests, network policies, LiteLLM, cleanup, and ops docs by GuyZivRH · Pull Request #13 · RHEcosystemAppEng/ABEvalFlow

GuyZivRH · 2026-04-23T08:43:00Z

Summary

RBAC: Explicit ServiceAccount, added configmaps and PVC permissions to pipeline Role
Security: Pod security context reference, namespace ResourceQuota (50 pods, 32 CPU, 64Gi), four network policies by LLM mode (default-deny, direct-api, litellm, self-hosted)
LiteLLM: HA Deployment (2 replicas), Service, ConfigMap, Secret template for Vertex AI mode
Storage: Workspace PVC (5Gi), dead-letter PVC (2Gi) for failed run artifacts
Cleanup: Daily CronJob removing stale trial pods (>24h), old PipelineRuns (>7d), empty ImageStreams
Docs: Failure handling reference (retries, timeouts, exit codes, dead-letter, partial recovery) and infrastructure deployment guide with verification steps

Deferred (follow-up)

Cost controls and observability (8.5)
Retry/timeout values to be applied when pipeline.yaml (PR feat: APPENG-4912 — Pipeline assembly manifest wiring all tasks end-to-end #11) merges

Verification (tested on OpenShift cluster)

All manifests applied to ab-eval-flow namespace and verified:

RBAC: ServiceAccount pipeline exists; oc auth can-i confirmed create pods, get secrets, create PVCs, get configmaps — all yes
Network policies: 2 policies active (default-deny + direct-api), targeting abevalflow/role=trial pods
Resource quota: Applied — 0/50 pods, 0/32 CPU, 2/10 PVCs used
Workspace PVC (5Gi): Bound
Dead-letter PVC (2Gi): Bound
Cleanup CronJob: Scheduled (daily 03:00 UTC), not yet triggered
EventListener: Ready, route live at el-submission-listener-ab-eval-flow.apps.appeng.clusters.se-apps.redhat.com

…p, and ops docs RBAC: explicit ServiceAccount, configmaps and PVC permissions. Security: pod security reference, resource quota, four network policies (default-deny, direct-api, litellm, self-hosted). LiteLLM: HA deployment, service, config, and secret template for Vertex AI mode. Storage: workspace and dead-letter PVCs, daily cleanup CronJob for stale pods/PipelineRuns/imagestreams. Docs: failure handling (retries, timeouts, dead-letter, partial recovery) and infrastructure deployment guide.

GuyZivRH

Summary

Solid infrastructure slice for APPENG-4910: RBAC extensions, layered network policies by LLM mode, quotas/PSS references, workspace + dead-letter PVCs, a daily cleanup CronJob, optional LiteLLM stack for Vertex mode, and two ops docs (infrastructure_ops.md, failure_handling.md). The PR description’s cluster verification checklist is unusually thorough and increases confidence.

Reviewer: Cursor / Composer (composer-2)

Blocking / should-fix

1. PipelineRun cleanup: `--keep` is a count, not days

Both scripts/cleanup.sh and the embedded script in config/storage/cleanup_cronjob.yaml use:

tkn pipelinerun delete ... --keep="${PIPELINERUN_AGE_DAYS}"

With default 7, Tekton CLI keeps the 7 most recent PipelineRuns and deletes the rest — it does not mean “older than 7 days”. The env var name and comments (“older than N days”) are therefore misleading and the behavior does not match the stated intent.

Suggestion: Use tkn pipelinerun delete --keep-since … (time-based, check your tkn version for units) or implement age-based deletion with oc/kubectl, and rename the variable (e.g. PIPELINERUN_KEEP_COUNT vs a real age parameter).

Should-fix (docs / accuracy)

2. `Docs/failure_handling.md` vs actual `pipeline.yaml`

The doc tables describe per-task retries and per-task timeout on pipeline.yaml. On the current branch, pipeline/pipeline.yaml has spec.timeouts only (pipeline/tasks aggregate) and no per-task retries / timeout entries. The PR body already defers some of this until another PR lands; the doc should say “planned / target policy” or be gated on the YAML actually containing those fields, to avoid operators trusting settings that are not applied.

Minor / follow-ups

Two cleanup implementations: scripts/cleanup.sh and the ConfigMap-embedded script differ (e.g. pod age source, DEAD_LETTER_RETENTION_DAYS only in the repo script). Consider a single source of truth (e.g. ConfigMap generated from scripts/cleanup.sh in CI, or document that the CronJob is canonical).
LiteLLM image: ghcr.io/berriai/litellm:main-latest is a floating tag; pinning by digest or release tag improves reproducibility and supply-chain review.
config/litellm/secret_template.yaml: Placeholder text suggests base64 for GOOGLE_APPLICATION_CREDENTIALS_JSON while stringData normally holds raw JSON; align wording with infrastructure_ops.md (oc create secret --from-file).
LiteLLM serviceAccountName: pipeline: Works if the pipeline SA is intended to run the proxy, but a dedicated SA with minimal RBAC is often cleaner than reusing the pipeline SA that also has image-builder/trial permissions.

What looks good

Network policy split (default deny egress + mode-specific allow) is clear and documented with a selection table.
Direct-API policy using ipBlock with RFC1918 exceptions is a reasonable default for “HTTPS to the internet, not the cluster CIDRs.”
RBAC: Adding ConfigMap list/get and PVC create/delete on the trial-manager Role matches Tekton + cleanup needs described elsewhere.
CronJob hardening: non-root, dropped caps, resource limits, concurrencyPolicy: Forbid.
Ops guide: Apply order, optional LiteLLM path, and verification commands are practical.

Verdict: Comment — approve the direction of the PR, but please fix the tkn pipelinerun delete --keep semantics (and align failure_handling.md with what is actually in pipeline.yaml) before or immediately after merge.

GuyZivRH · 2026-04-23T09:33:41Z

Reviewed in detail — sharing findings from the manifests/scripts themselves.

High — Dead-letter flow is documented but not implemented

Files: Docs/failure_handling.md, config/storage/cleanup_cronjob.yaml, scripts/cleanup.sh
Docs state failed-run artifacts are copied to abevalflow-dead-letter and retained via DEAD_LETTER_RETENTION_DAYS.
Implementation currently does not mount workspace/dead-letter PVCs in the cleanup job and has no copy/retention logic for dead-letter artifacts.
This creates a gap between documented recovery guarantees and runtime behavior.

Suggested fix:

Either implement dead-letter copy + retention (PVC mounts + copy/prune logic), or remove/defer this claim in docs until implemented.

Medium — PipelineRun cleanup behavior is mismatched and likely skipped in default image

Files: config/storage/cleanup_cronjob.yaml, scripts/cleanup.sh, Docs/infrastructure_ops.md
Script uses tkn pipelinerun delete --keep=\"${PIPELINERUN_AGE_DAYS}\".
--keep is a count, not an age duration, so PIPELINERUN_AGE_DAYS naming/behavior is inconsistent.
CronJob image is openshift/cli; when tkn is unavailable, PipelineRun cleanup is skipped.

Suggested fix:

Use a Tekton-capable image or age-based deletion via oc; align variable naming with actual semantics.

Medium — NetworkPolicy annotation guidance conflicts with secure usage

Files: config/security/network_policy_direct_api.yaml, config/security/network_policy_litellm.yaml, config/security/network_policy_self_hosted.yaml, Docs/infrastructure_ops.md
Annotations in policy manifests say apply mode policy “INSTEAD OF default-deny”.
Docs correctly instruct default-deny + one mode-specific allow policy.
Following the annotation text alone can result in broader-than-intended egress.

Suggested fix:

Update annotations to match docs (apply default-deny first, then add one mode-specific allow policy).

Happy to re-review after updates.

GuyZivRH

Review — APPENG-4910: Infrastructure manifests, network policies, LiteLLM, cleanup, and ops docs

Reviewed by: Claude Sonnet 4.6
Commit: aebfcc9

Good overall PR. The scope (RBAC, security, LiteLLM, storage, cleanup, ops docs) is well-organized across clearly named files, the manifests are labeled consistently, and the infrastructure guide is comprehensive. Cluster verification evidence in the PR description adds confidence.

Six Should Fix findings and five Minor ones below.

Should Fix

1. LiteLLM image uses a mutable tag

config/litellm/deployment.yaml:

image: ghcr.io/berriai/litellm:main-latest

main-latest is a rolling tag — it will silently change on the next push to the LiteLLM repo, breaking reproducibility and potentially introducing breaking changes. Pin to a specific release tag (e.g., ghcr.io/berriai/litellm:main-v1.67.0 or similar) consistent with the pinning convention used elsewhere in this project (ubi9 images are all pinned with digest-level tags).

2. LiteLLM Deployment uses the pipeline ServiceAccount

config/litellm/deployment.yaml:

serviceAccountName: pipeline

The pipeline ServiceAccount has broad permissions: create/delete Pods, get Secrets, create/delete PVCs, system:image-builder. A LiteLLM proxy only needs to run — it requires no Kubernetes API access at all. Using a dedicated minimal ServiceAccount (or the default SA with no extra bindings) removes an unnecessary privilege escalation path if the LiteLLM container is ever compromised.

3. secrets get verb in pipeline-trial-manager Role is unscoped

config/rbac.yaml:

- apiGroups: [""]
  resources: [secrets]
  verbs: [get]

This allows the pipeline ServiceAccount to get any Secret in the namespace, including litellm-credentials. Scope it to the specific secrets actually needed, e.g.:

- apiGroups: [""]
  resources: [secrets]
  verbs: [get]
  resourceNames: [ab-eval-db-credentials]

4. Dead-letter artifact copy is documented but not implemented

Docs/failure_handling.md (lines 60-66) states:

Failed run artifacts are copied to the abevalflow-dead-letter PVC by the cleanup CronJob

Neither the CronJob's inline script nor scripts/cleanup.sh contains any copy-to-dead-letter logic — both only delete or prune. The abevalflow-dead-letter PVC will stay empty. Either implement the copy step or revise the failure-handling doc to describe the PVC as "reserved for future use / manual artifact staging."

5. tkn pipelinerun delete --keep is count-based, not age-based

Both the CronJob inline script and scripts/cleanup.sh:

tkn pipelinerun delete -n "${NAMESPACE}" --keep="${PIPELINERUN_AGE_DAYS}" --force

The --keep flag retains the N most recent PipelineRuns by count, not by age. With PIPELINERUN_AGE_DAYS=7 it keeps the 7 most recent runs regardless of how old they are. On a busy cluster this could delete 8-hour-old runs; on an idle cluster it could keep 3-month-old runs. The variable name PIPELINERUN_AGE_DAYS is doubly misleading. Either use the correct flag (if tkn supports age-based deletion) or rename the variable to PIPELINERUN_KEEP_COUNT and update the ops docs.

6. Network policy annotation contradicts the ops guide

config/security/network_policy_litellm.yaml (and direct_api, self_hosted):

abevalflow/note: >-
  Apply this policy INSTEAD OF default-deny when using Vertex AI mode.

Docs/infrastructure_ops.md says:

Always apply the default-deny policy first, then add the mode-specific allow policy.

The annotation promotes the weaker single-policy approach. NetworkPolicies are additive (union), so applying default-deny + litellm is functionally identical to applying litellm alone — but the two-policy setup from the ops guide is safer: if the mode-specific policy is accidentally deleted, the default-deny remains as a backstop. The annotation on all three mode-specific policies should say "Apply IN ADDITION TO default-deny" to match the ops guide and the safer practice.

Minor

7. Two divergent implementations of the cleanup script

The CronJob (config/storage/cleanup_cronjob.yaml) embeds a copy of the cleanup script in an inline ConfigMap, while scripts/cleanup.sh exists as a separate file with different pod-age logic (different jsonpath query, different Python date handling). They will inevitably drift. Consider one of:

Remove scripts/cleanup.sh if the CronJob's inline version is canonical.
Have the CronJob mount scripts/cleanup.sh from a ConfigMap that is generated from the actual file (with a build step), keeping one source of truth.

8. pod_security.yaml filename suggests enforcement, content is documentation

config/security/pod_security.yaml is a ConfigMap of reference values — it enforces nothing on its own. The file comment is clear about this, but the filename could mislead operators into thinking a PodSecurityPolicy or SCC has been applied. Consider renaming to pod_security_reference.yaml or trial_pod_security_reference.yaml to match the ConfigMap's own metadata.name (trial-pod-security-reference).

9. DEAD_LETTER_RETENTION_DAYS defined but never wired to the CronJob

scripts/cleanup.sh declares DEAD_LETTER_RETENTION_DAYS="${DEAD_LETTER_RETENTION_DAYS:-14}" but this variable is absent from the CronJob's env block and, as noted in finding #4, nothing actually uses it. Remove the variable or wire it through once the dead-letter copy step is implemented.

10. Exit-code convention documented but not yet adopted

Docs/failure_handling.md (lines 48-56) documents a sys.exit(2) convention for non-retryable user errors. None of the existing scripts (validate.py, scaffold.py, analyze.py, store_results.py, query_results.py) implement it — they all exit with 1 or rely on uncaught exceptions. This doesn't block the PR (it's a forward-looking convention) but a follow-up ticket should be opened to retrofit the convention so Tekton's retry logic can distinguish transient from user errors.

11. Self-hosted network policy hardcodes port 8080

config/security/network_policy_self_hosted.yaml:

ports:
  - protocol: TCP
    port: 8080

Common self-hosted inference servers use different ports: vLLM defaults to 8000, Ollama to 11434, TGI to 80. The comment says to adjust the selector, but not the port. Add a note alongside the port that it should also be adjusted to match the actual server.

Not a bug — confirmed correct

ResourceQuota limits (50 pods, 32 CPU, 64Gi) are consistent with 40 concurrent trial pods + overhead ✅
egress: [] in default-deny correctly denies all egress from abevalflow/role: trial pods ✅
LiteLLM NetworkPolicy targets pods by app.kubernetes.io/name: litellm — matches the Deployment's pod labels ✅
concurrencyPolicy: Forbid on the CronJob prevents overlapping cleanup runs ✅
Dead-letter PVC accessModes: ReadWriteOnce is fine; the cleanup job will be the only writer ✅

GuyZivRH · 2026-04-23T09:35:08Z

Model used for this review: Codex 5.3.

GuyZivRH · 2026-04-23T09:35:10Z

Model: composer-2 (Composer)

(Follow-up: the review comment above should be read with this model name stated explicitly.)

- Rename PIPELINERUN_AGE_DAYS to PIPELINERUN_KEEP_COUNT (count semantics) - Clarify dead-letter PVC as reserved for future use, not yet automated - Mark per-task retries/timeouts as target policy (not yet applied) - Change network policy annotations from INSTEAD OF to IN ADDITION TO - Pin LiteLLM image to main-v1.82.6 instead of main-latest - Add dedicated litellm ServiceAccount instead of reusing pipeline SA - Scope secrets get verb with resourceNames in RBAC - Remove inline cleanup script from CronJob ConfigMap (single source) - Rename pod_security.yaml to pod_security_reference.yaml - Add port guidance note to self-hosted network policy

- resourceNames: llm-credentials -> litellm-credentials (matches config/litellm/secret_template.yaml) - Replace empty data: {} ConfigMap stub with a placeholder script that exits 1 with instructions, preventing silent broken-state on oc apply

…rite The placeholder ConfigMap was reapplied by oc apply, overwriting the real script created in the prior step. Removed entirely — the ConfigMap is now managed only via oc create configmap --from-file as documented.

r2dedios

LGTM!

r2dedios

Please take a look. Nice job

r2dedios · 2026-04-25T16:39:24Z

+    resources: [persistentvolumeclaims]
+    verbs: [get, list, create, delete]


Does the SA need really the permissions for deleting PVCs?

r2dedios · 2026-04-25T16:41:54Z

+    - Egress
+  egress:
+    # DNS resolution
+    - to:


Limiting the DNS querying to our server might be a nice improvement:

My suggestion:

- to: - namespaceSelector: matchLabels: name: openshift-dns # ajustar según tu cluster ports: - protocol: UDP port: 53

GuyZivRH requested review from dmartinol and r2dedios April 23, 2026 08:50

GuyZivRH commented Apr 23, 2026

View reviewed changes

GuyZivRH added 3 commits April 23, 2026 12:57

r2dedios approved these changes Apr 25, 2026

View reviewed changes

r2dedios reviewed Apr 25, 2026

View reviewed changes

GuyZivRH merged commit c2dc34e into main Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: APPENG-4910 — Infrastructure manifests, network policies, LiteLLM, cleanup, and ops docs#13

feat: APPENG-4910 — Infrastructure manifests, network policies, LiteLLM, cleanup, and ops docs#13
GuyZivRH merged 4 commits intomainfrom
APPENG-4910/infra-ops

GuyZivRH commented Apr 23, 2026 •

edited

Loading

Uh oh!

GuyZivRH left a comment

Uh oh!

GuyZivRH commented Apr 23, 2026

Uh oh!

GuyZivRH left a comment

Uh oh!

GuyZivRH commented Apr 23, 2026

Uh oh!

GuyZivRH commented Apr 23, 2026

Uh oh!

r2dedios left a comment

Uh oh!

r2dedios left a comment

Uh oh!

r2dedios Apr 25, 2026

Uh oh!

r2dedios Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		resources: [persistentvolumeclaims]
		verbs: [get, list, create, delete]

Conversation

GuyZivRH commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Deferred (follow-up)

Verification (tested on OpenShift cluster)

Uh oh!

GuyZivRH left a comment

Choose a reason for hiding this comment

Summary

Blocking / should-fix

1. PipelineRun cleanup: --keep is a count, not days

Should-fix (docs / accuracy)

2. Docs/failure_handling.md vs actual pipeline.yaml

Minor / follow-ups

What looks good

Uh oh!

GuyZivRH commented Apr 23, 2026

Uh oh!

GuyZivRH left a comment

Choose a reason for hiding this comment

Review — APPENG-4910: Infrastructure manifests, network policies, LiteLLM, cleanup, and ops docs

Should Fix

Minor

Not a bug — confirmed correct

Uh oh!

GuyZivRH commented Apr 23, 2026

Uh oh!

GuyZivRH commented Apr 23, 2026

Uh oh!

r2dedios left a comment

Choose a reason for hiding this comment

Uh oh!

r2dedios left a comment

Choose a reason for hiding this comment

Uh oh!

r2dedios Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

r2dedios Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GuyZivRH commented Apr 23, 2026 •

edited

Loading

1. PipelineRun cleanup: `--keep` is a count, not days

2. `Docs/failure_handling.md` vs actual `pipeline.yaml`