Skip to content

feat: APPENG-4910 — Infrastructure manifests, network policies, LiteLLM, cleanup, and ops docs#13

Merged
GuyZivRH merged 4 commits intomainfrom
APPENG-4910/infra-ops
Apr 26, 2026
Merged

feat: APPENG-4910 — Infrastructure manifests, network policies, LiteLLM, cleanup, and ops docs#13
GuyZivRH merged 4 commits intomainfrom
APPENG-4910/infra-ops

Conversation

@GuyZivRH
Copy link
Copy Markdown
Collaborator

@GuyZivRH GuyZivRH commented Apr 23, 2026

Summary

  • RBAC: Explicit ServiceAccount, added configmaps and PVC permissions to pipeline Role
  • Security: Pod security context reference, namespace ResourceQuota (50 pods, 32 CPU, 64Gi), four network policies by LLM mode (default-deny, direct-api, litellm, self-hosted)
  • LiteLLM: HA Deployment (2 replicas), Service, ConfigMap, Secret template for Vertex AI mode
  • Storage: Workspace PVC (5Gi), dead-letter PVC (2Gi) for failed run artifacts
  • Cleanup: Daily CronJob removing stale trial pods (>24h), old PipelineRuns (>7d), empty ImageStreams
  • Docs: Failure handling reference (retries, timeouts, exit codes, dead-letter, partial recovery) and infrastructure deployment guide with verification steps

Deferred (follow-up)

Verification (tested on OpenShift cluster)

All manifests applied to ab-eval-flow namespace and verified:

  • RBAC: ServiceAccount pipeline exists; oc auth can-i confirmed create pods, get secrets, create PVCs, get configmaps — all yes
  • Network policies: 2 policies active (default-deny + direct-api), targeting abevalflow/role=trial pods
  • Resource quota: Applied — 0/50 pods, 0/32 CPU, 2/10 PVCs used
  • Workspace PVC (5Gi): Bound
  • Dead-letter PVC (2Gi): Bound
  • Cleanup CronJob: Scheduled (daily 03:00 UTC), not yet triggered
  • EventListener: Ready, route live at el-submission-listener-ab-eval-flow.apps.appeng.clusters.se-apps.redhat.com

…p, and ops docs

RBAC: explicit ServiceAccount, configmaps and PVC permissions.
Security: pod security reference, resource quota, four network
policies (default-deny, direct-api, litellm, self-hosted).
LiteLLM: HA deployment, service, config, and secret template for
Vertex AI mode. Storage: workspace and dead-letter PVCs, daily
cleanup CronJob for stale pods/PipelineRuns/imagestreams.
Docs: failure handling (retries, timeouts, dead-letter, partial
recovery) and infrastructure deployment guide.
@GuyZivRH GuyZivRH requested review from dmartinol and r2dedios April 23, 2026 08:50
Copy link
Copy Markdown
Collaborator Author

@GuyZivRH GuyZivRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Solid infrastructure slice for APPENG-4910: RBAC extensions, layered network policies by LLM mode, quotas/PSS references, workspace + dead-letter PVCs, a daily cleanup CronJob, optional LiteLLM stack for Vertex mode, and two ops docs (infrastructure_ops.md, failure_handling.md). The PR description’s cluster verification checklist is unusually thorough and increases confidence.

Reviewer: Cursor / Composer (composer-2)


Blocking / should-fix

1. PipelineRun cleanup: --keep is a count, not days

Both scripts/cleanup.sh and the embedded script in config/storage/cleanup_cronjob.yaml use:

tkn pipelinerun delete ... --keep="${PIPELINERUN_AGE_DAYS}"

With default 7, Tekton CLI keeps the 7 most recent PipelineRuns and deletes the rest — it does not mean “older than 7 days”. The env var name and comments (“older than N days”) are therefore misleading and the behavior does not match the stated intent.

Suggestion: Use tkn pipelinerun delete --keep-since … (time-based, check your tkn version for units) or implement age-based deletion with oc/kubectl, and rename the variable (e.g. PIPELINERUN_KEEP_COUNT vs a real age parameter).


Should-fix (docs / accuracy)

2. Docs/failure_handling.md vs actual pipeline.yaml

The doc tables describe per-task retries and per-task timeout on pipeline.yaml. On the current branch, pipeline/pipeline.yaml has spec.timeouts only (pipeline/tasks aggregate) and no per-task retries / timeout entries. The PR body already defers some of this until another PR lands; the doc should say “planned / target policy” or be gated on the YAML actually containing those fields, to avoid operators trusting settings that are not applied.


Minor / follow-ups

  • Two cleanup implementations: scripts/cleanup.sh and the ConfigMap-embedded script differ (e.g. pod age source, DEAD_LETTER_RETENTION_DAYS only in the repo script). Consider a single source of truth (e.g. ConfigMap generated from scripts/cleanup.sh in CI, or document that the CronJob is canonical).
  • LiteLLM image: ghcr.io/berriai/litellm:main-latest is a floating tag; pinning by digest or release tag improves reproducibility and supply-chain review.
  • config/litellm/secret_template.yaml: Placeholder text suggests base64 for GOOGLE_APPLICATION_CREDENTIALS_JSON while stringData normally holds raw JSON; align wording with infrastructure_ops.md (oc create secret --from-file).
  • LiteLLM serviceAccountName: pipeline: Works if the pipeline SA is intended to run the proxy, but a dedicated SA with minimal RBAC is often cleaner than reusing the pipeline SA that also has image-builder/trial permissions.

What looks good

  • Network policy split (default deny egress + mode-specific allow) is clear and documented with a selection table.
  • Direct-API policy using ipBlock with RFC1918 exceptions is a reasonable default for “HTTPS to the internet, not the cluster CIDRs.”
  • RBAC: Adding ConfigMap list/get and PVC create/delete on the trial-manager Role matches Tekton + cleanup needs described elsewhere.
  • CronJob hardening: non-root, dropped caps, resource limits, concurrencyPolicy: Forbid.
  • Ops guide: Apply order, optional LiteLLM path, and verification commands are practical.

Verdict: Comment — approve the direction of the PR, but please fix the tkn pipelinerun delete --keep semantics (and align failure_handling.md with what is actually in pipeline.yaml) before or immediately after merge.

@GuyZivRH
Copy link
Copy Markdown
Collaborator Author

Reviewed in detail — sharing findings from the manifests/scripts themselves.

  1. High — Dead-letter flow is documented but not implemented
  • Files: Docs/failure_handling.md, config/storage/cleanup_cronjob.yaml, scripts/cleanup.sh
  • Docs state failed-run artifacts are copied to abevalflow-dead-letter and retained via DEAD_LETTER_RETENTION_DAYS.
  • Implementation currently does not mount workspace/dead-letter PVCs in the cleanup job and has no copy/retention logic for dead-letter artifacts.
  • This creates a gap between documented recovery guarantees and runtime behavior.

Suggested fix:

  • Either implement dead-letter copy + retention (PVC mounts + copy/prune logic), or remove/defer this claim in docs until implemented.
  1. Medium — PipelineRun cleanup behavior is mismatched and likely skipped in default image
  • Files: config/storage/cleanup_cronjob.yaml, scripts/cleanup.sh, Docs/infrastructure_ops.md
  • Script uses tkn pipelinerun delete --keep=\"${PIPELINERUN_AGE_DAYS}\".
  • --keep is a count, not an age duration, so PIPELINERUN_AGE_DAYS naming/behavior is inconsistent.
  • CronJob image is openshift/cli; when tkn is unavailable, PipelineRun cleanup is skipped.

Suggested fix:

  • Use a Tekton-capable image or age-based deletion via oc; align variable naming with actual semantics.
  1. Medium — NetworkPolicy annotation guidance conflicts with secure usage
  • Files: config/security/network_policy_direct_api.yaml, config/security/network_policy_litellm.yaml, config/security/network_policy_self_hosted.yaml, Docs/infrastructure_ops.md
  • Annotations in policy manifests say apply mode policy “INSTEAD OF default-deny”.
  • Docs correctly instruct default-deny + one mode-specific allow policy.
  • Following the annotation text alone can result in broader-than-intended egress.

Suggested fix:

  • Update annotations to match docs (apply default-deny first, then add one mode-specific allow policy).

Happy to re-review after updates.

Copy link
Copy Markdown
Collaborator Author

@GuyZivRH GuyZivRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — APPENG-4910: Infrastructure manifests, network policies, LiteLLM, cleanup, and ops docs

Reviewed by: Claude Sonnet 4.6
Commit: aebfcc9

Good overall PR. The scope (RBAC, security, LiteLLM, storage, cleanup, ops docs) is well-organized across clearly named files, the manifests are labeled consistently, and the infrastructure guide is comprehensive. Cluster verification evidence in the PR description adds confidence.

Six Should Fix findings and five Minor ones below.


Should Fix

1. LiteLLM image uses a mutable tag

config/litellm/deployment.yaml:

image: ghcr.io/berriai/litellm:main-latest

main-latest is a rolling tag — it will silently change on the next push to the LiteLLM repo, breaking reproducibility and potentially introducing breaking changes. Pin to a specific release tag (e.g., ghcr.io/berriai/litellm:main-v1.67.0 or similar) consistent with the pinning convention used elsewhere in this project (ubi9 images are all pinned with digest-level tags).


2. LiteLLM Deployment uses the pipeline ServiceAccount

config/litellm/deployment.yaml:

serviceAccountName: pipeline

The pipeline ServiceAccount has broad permissions: create/delete Pods, get Secrets, create/delete PVCs, system:image-builder. A LiteLLM proxy only needs to run — it requires no Kubernetes API access at all. Using a dedicated minimal ServiceAccount (or the default SA with no extra bindings) removes an unnecessary privilege escalation path if the LiteLLM container is ever compromised.


3. secrets get verb in pipeline-trial-manager Role is unscoped

config/rbac.yaml:

- apiGroups: [""]
  resources: [secrets]
  verbs: [get]

This allows the pipeline ServiceAccount to get any Secret in the namespace, including litellm-credentials. Scope it to the specific secrets actually needed, e.g.:

- apiGroups: [""]
  resources: [secrets]
  verbs: [get]
  resourceNames: [ab-eval-db-credentials]

4. Dead-letter artifact copy is documented but not implemented

Docs/failure_handling.md (lines 60-66) states:

Failed run artifacts are copied to the abevalflow-dead-letter PVC by the cleanup CronJob

Neither the CronJob's inline script nor scripts/cleanup.sh contains any copy-to-dead-letter logic — both only delete or prune. The abevalflow-dead-letter PVC will stay empty. Either implement the copy step or revise the failure-handling doc to describe the PVC as "reserved for future use / manual artifact staging."


5. tkn pipelinerun delete --keep is count-based, not age-based

Both the CronJob inline script and scripts/cleanup.sh:

tkn pipelinerun delete -n "${NAMESPACE}" --keep="${PIPELINERUN_AGE_DAYS}" --force

The --keep flag retains the N most recent PipelineRuns by count, not by age. With PIPELINERUN_AGE_DAYS=7 it keeps the 7 most recent runs regardless of how old they are. On a busy cluster this could delete 8-hour-old runs; on an idle cluster it could keep 3-month-old runs. The variable name PIPELINERUN_AGE_DAYS is doubly misleading. Either use the correct flag (if tkn supports age-based deletion) or rename the variable to PIPELINERUN_KEEP_COUNT and update the ops docs.


6. Network policy annotation contradicts the ops guide

config/security/network_policy_litellm.yaml (and direct_api, self_hosted):

abevalflow/note: >-
  Apply this policy INSTEAD OF default-deny when using Vertex AI mode.

Docs/infrastructure_ops.md says:

Always apply the default-deny policy first, then add the mode-specific allow policy.

The annotation promotes the weaker single-policy approach. NetworkPolicies are additive (union), so applying default-deny + litellm is functionally identical to applying litellm alone — but the two-policy setup from the ops guide is safer: if the mode-specific policy is accidentally deleted, the default-deny remains as a backstop. The annotation on all three mode-specific policies should say "Apply IN ADDITION TO default-deny" to match the ops guide and the safer practice.


Minor

7. Two divergent implementations of the cleanup script

The CronJob (config/storage/cleanup_cronjob.yaml) embeds a copy of the cleanup script in an inline ConfigMap, while scripts/cleanup.sh exists as a separate file with different pod-age logic (different jsonpath query, different Python date handling). They will inevitably drift. Consider one of:

  • Remove scripts/cleanup.sh if the CronJob's inline version is canonical.
  • Have the CronJob mount scripts/cleanup.sh from a ConfigMap that is generated from the actual file (with a build step), keeping one source of truth.

8. pod_security.yaml filename suggests enforcement, content is documentation

config/security/pod_security.yaml is a ConfigMap of reference values — it enforces nothing on its own. The file comment is clear about this, but the filename could mislead operators into thinking a PodSecurityPolicy or SCC has been applied. Consider renaming to pod_security_reference.yaml or trial_pod_security_reference.yaml to match the ConfigMap's own metadata.name (trial-pod-security-reference).


9. DEAD_LETTER_RETENTION_DAYS defined but never wired to the CronJob

scripts/cleanup.sh declares DEAD_LETTER_RETENTION_DAYS="${DEAD_LETTER_RETENTION_DAYS:-14}" but this variable is absent from the CronJob's env block and, as noted in finding #4, nothing actually uses it. Remove the variable or wire it through once the dead-letter copy step is implemented.


10. Exit-code convention documented but not yet adopted

Docs/failure_handling.md (lines 48-56) documents a sys.exit(2) convention for non-retryable user errors. None of the existing scripts (validate.py, scaffold.py, analyze.py, store_results.py, query_results.py) implement it — they all exit with 1 or rely on uncaught exceptions. This doesn't block the PR (it's a forward-looking convention) but a follow-up ticket should be opened to retrofit the convention so Tekton's retry logic can distinguish transient from user errors.


11. Self-hosted network policy hardcodes port 8080

config/security/network_policy_self_hosted.yaml:

ports:
  - protocol: TCP
    port: 8080

Common self-hosted inference servers use different ports: vLLM defaults to 8000, Ollama to 11434, TGI to 80. The comment says to adjust the selector, but not the port. Add a note alongside the port that it should also be adjusted to match the actual server.


Not a bug — confirmed correct

  • ResourceQuota limits (50 pods, 32 CPU, 64Gi) are consistent with 40 concurrent trial pods + overhead ✅
  • egress: [] in default-deny correctly denies all egress from abevalflow/role: trial pods ✅
  • LiteLLM NetworkPolicy targets pods by app.kubernetes.io/name: litellm — matches the Deployment's pod labels ✅
  • concurrencyPolicy: Forbid on the CronJob prevents overlapping cleanup runs ✅
  • Dead-letter PVC accessModes: ReadWriteOnce is fine; the cleanup job will be the only writer ✅

@GuyZivRH
Copy link
Copy Markdown
Collaborator Author

Model used for this review: Codex 5.3.

@GuyZivRH
Copy link
Copy Markdown
Collaborator Author

Model: composer-2 (Composer)

(Follow-up: the review comment above should be read with this model name stated explicitly.)

- Rename PIPELINERUN_AGE_DAYS to PIPELINERUN_KEEP_COUNT (count semantics)
- Clarify dead-letter PVC as reserved for future use, not yet automated
- Mark per-task retries/timeouts as target policy (not yet applied)
- Change network policy annotations from INSTEAD OF to IN ADDITION TO
- Pin LiteLLM image to main-v1.82.6 instead of main-latest
- Add dedicated litellm ServiceAccount instead of reusing pipeline SA
- Scope secrets get verb with resourceNames in RBAC
- Remove inline cleanup script from CronJob ConfigMap (single source)
- Rename pod_security.yaml to pod_security_reference.yaml
- Add port guidance note to self-hosted network policy
- resourceNames: llm-credentials -> litellm-credentials (matches
  config/litellm/secret_template.yaml)
- Replace empty data: {} ConfigMap stub with a placeholder script that
  exits 1 with instructions, preventing silent broken-state on oc apply
…rite

The placeholder ConfigMap was reapplied by oc apply, overwriting the
real script created in the prior step. Removed entirely — the ConfigMap
is now managed only via oc create configmap --from-file as documented.
Copy link
Copy Markdown

@r2dedios r2dedios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown

@r2dedios r2dedios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look. Nice job

Comment thread config/rbac.yaml
Comment on lines +49 to +50
resources: [persistentvolumeclaims]
verbs: [get, list, create, delete]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the SA need really the permissions for deleting PVCs?

- Egress
egress:
# DNS resolution
- to:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limiting the DNS querying to our server might be a nice improvement:

My suggestion:

  - to:
      - namespaceSelector:
          matchLabels:
            name: openshift-dns  # ajustar según tu cluster
      ports:
        - protocol: UDP
          port: 53

@GuyZivRH GuyZivRH merged commit c2dc34e into main Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants