Conversation
…p, and ops docs RBAC: explicit ServiceAccount, configmaps and PVC permissions. Security: pod security reference, resource quota, four network policies (default-deny, direct-api, litellm, self-hosted). LiteLLM: HA deployment, service, config, and secret template for Vertex AI mode. Storage: workspace and dead-letter PVCs, daily cleanup CronJob for stale pods/PipelineRuns/imagestreams. Docs: failure handling (retries, timeouts, dead-letter, partial recovery) and infrastructure deployment guide.
GuyZivRH
left a comment
There was a problem hiding this comment.
Summary
Solid infrastructure slice for APPENG-4910: RBAC extensions, layered network policies by LLM mode, quotas/PSS references, workspace + dead-letter PVCs, a daily cleanup CronJob, optional LiteLLM stack for Vertex mode, and two ops docs (infrastructure_ops.md, failure_handling.md). The PR description’s cluster verification checklist is unusually thorough and increases confidence.
Reviewer: Cursor / Composer (composer-2)
Blocking / should-fix
1. PipelineRun cleanup: --keep is a count, not days
Both scripts/cleanup.sh and the embedded script in config/storage/cleanup_cronjob.yaml use:
tkn pipelinerun delete ... --keep="${PIPELINERUN_AGE_DAYS}"
With default 7, Tekton CLI keeps the 7 most recent PipelineRuns and deletes the rest — it does not mean “older than 7 days”. The env var name and comments (“older than N days”) are therefore misleading and the behavior does not match the stated intent.
Suggestion: Use tkn pipelinerun delete --keep-since … (time-based, check your tkn version for units) or implement age-based deletion with oc/kubectl, and rename the variable (e.g. PIPELINERUN_KEEP_COUNT vs a real age parameter).
Should-fix (docs / accuracy)
2. Docs/failure_handling.md vs actual pipeline.yaml
The doc tables describe per-task retries and per-task timeout on pipeline.yaml. On the current branch, pipeline/pipeline.yaml has spec.timeouts only (pipeline/tasks aggregate) and no per-task retries / timeout entries. The PR body already defers some of this until another PR lands; the doc should say “planned / target policy” or be gated on the YAML actually containing those fields, to avoid operators trusting settings that are not applied.
Minor / follow-ups
- Two cleanup implementations:
scripts/cleanup.shand the ConfigMap-embedded script differ (e.g. pod age source,DEAD_LETTER_RETENTION_DAYSonly in the repo script). Consider a single source of truth (e.g. ConfigMap generated fromscripts/cleanup.shin CI, or document that the CronJob is canonical). - LiteLLM image:
ghcr.io/berriai/litellm:main-latestis a floating tag; pinning by digest or release tag improves reproducibility and supply-chain review. config/litellm/secret_template.yaml: Placeholder text suggests base64 forGOOGLE_APPLICATION_CREDENTIALS_JSONwhilestringDatanormally holds raw JSON; align wording withinfrastructure_ops.md(oc create secret --from-file).- LiteLLM
serviceAccountName: pipeline: Works if the pipeline SA is intended to run the proxy, but a dedicated SA with minimal RBAC is often cleaner than reusing the pipeline SA that also has image-builder/trial permissions.
What looks good
- Network policy split (default deny egress + mode-specific allow) is clear and documented with a selection table.
- Direct-API policy using
ipBlockwith RFC1918 exceptions is a reasonable default for “HTTPS to the internet, not the cluster CIDRs.” - RBAC: Adding ConfigMap list/get and PVC create/delete on the trial-manager Role matches Tekton + cleanup needs described elsewhere.
- CronJob hardening: non-root, dropped caps, resource limits,
concurrencyPolicy: Forbid. - Ops guide: Apply order, optional LiteLLM path, and verification commands are practical.
Verdict: Comment — approve the direction of the PR, but please fix the tkn pipelinerun delete --keep semantics (and align failure_handling.md with what is actually in pipeline.yaml) before or immediately after merge.
|
Reviewed in detail — sharing findings from the manifests/scripts themselves.
Suggested fix:
Suggested fix:
Suggested fix:
Happy to re-review after updates. |
GuyZivRH
left a comment
There was a problem hiding this comment.
Review — APPENG-4910: Infrastructure manifests, network policies, LiteLLM, cleanup, and ops docs
Reviewed by: Claude Sonnet 4.6
Commit: aebfcc9
Good overall PR. The scope (RBAC, security, LiteLLM, storage, cleanup, ops docs) is well-organized across clearly named files, the manifests are labeled consistently, and the infrastructure guide is comprehensive. Cluster verification evidence in the PR description adds confidence.
Six Should Fix findings and five Minor ones below.
Should Fix
1. LiteLLM image uses a mutable tag
config/litellm/deployment.yaml:
image: ghcr.io/berriai/litellm:main-latestmain-latest is a rolling tag — it will silently change on the next push to the LiteLLM repo, breaking reproducibility and potentially introducing breaking changes. Pin to a specific release tag (e.g., ghcr.io/berriai/litellm:main-v1.67.0 or similar) consistent with the pinning convention used elsewhere in this project (ubi9 images are all pinned with digest-level tags).
2. LiteLLM Deployment uses the pipeline ServiceAccount
config/litellm/deployment.yaml:
serviceAccountName: pipelineThe pipeline ServiceAccount has broad permissions: create/delete Pods, get Secrets, create/delete PVCs, system:image-builder. A LiteLLM proxy only needs to run — it requires no Kubernetes API access at all. Using a dedicated minimal ServiceAccount (or the default SA with no extra bindings) removes an unnecessary privilege escalation path if the LiteLLM container is ever compromised.
3. secrets get verb in pipeline-trial-manager Role is unscoped
config/rbac.yaml:
- apiGroups: [""]
resources: [secrets]
verbs: [get]This allows the pipeline ServiceAccount to get any Secret in the namespace, including litellm-credentials. Scope it to the specific secrets actually needed, e.g.:
- apiGroups: [""]
resources: [secrets]
verbs: [get]
resourceNames: [ab-eval-db-credentials]4. Dead-letter artifact copy is documented but not implemented
Docs/failure_handling.md (lines 60-66) states:
Failed run artifacts are copied to the
abevalflow-dead-letterPVC by the cleanup CronJob
Neither the CronJob's inline script nor scripts/cleanup.sh contains any copy-to-dead-letter logic — both only delete or prune. The abevalflow-dead-letter PVC will stay empty. Either implement the copy step or revise the failure-handling doc to describe the PVC as "reserved for future use / manual artifact staging."
5. tkn pipelinerun delete --keep is count-based, not age-based
Both the CronJob inline script and scripts/cleanup.sh:
tkn pipelinerun delete -n "${NAMESPACE}" --keep="${PIPELINERUN_AGE_DAYS}" --forceThe --keep flag retains the N most recent PipelineRuns by count, not by age. With PIPELINERUN_AGE_DAYS=7 it keeps the 7 most recent runs regardless of how old they are. On a busy cluster this could delete 8-hour-old runs; on an idle cluster it could keep 3-month-old runs. The variable name PIPELINERUN_AGE_DAYS is doubly misleading. Either use the correct flag (if tkn supports age-based deletion) or rename the variable to PIPELINERUN_KEEP_COUNT and update the ops docs.
6. Network policy annotation contradicts the ops guide
config/security/network_policy_litellm.yaml (and direct_api, self_hosted):
abevalflow/note: >-
Apply this policy INSTEAD OF default-deny when using Vertex AI mode.Docs/infrastructure_ops.md says:
Always apply the default-deny policy first, then add the mode-specific allow policy.
The annotation promotes the weaker single-policy approach. NetworkPolicies are additive (union), so applying default-deny + litellm is functionally identical to applying litellm alone — but the two-policy setup from the ops guide is safer: if the mode-specific policy is accidentally deleted, the default-deny remains as a backstop. The annotation on all three mode-specific policies should say "Apply IN ADDITION TO default-deny" to match the ops guide and the safer practice.
Minor
7. Two divergent implementations of the cleanup script
The CronJob (config/storage/cleanup_cronjob.yaml) embeds a copy of the cleanup script in an inline ConfigMap, while scripts/cleanup.sh exists as a separate file with different pod-age logic (different jsonpath query, different Python date handling). They will inevitably drift. Consider one of:
- Remove
scripts/cleanup.shif the CronJob's inline version is canonical. - Have the CronJob mount
scripts/cleanup.shfrom a ConfigMap that is generated from the actual file (with a build step), keeping one source of truth.
8. pod_security.yaml filename suggests enforcement, content is documentation
config/security/pod_security.yaml is a ConfigMap of reference values — it enforces nothing on its own. The file comment is clear about this, but the filename could mislead operators into thinking a PodSecurityPolicy or SCC has been applied. Consider renaming to pod_security_reference.yaml or trial_pod_security_reference.yaml to match the ConfigMap's own metadata.name (trial-pod-security-reference).
9. DEAD_LETTER_RETENTION_DAYS defined but never wired to the CronJob
scripts/cleanup.sh declares DEAD_LETTER_RETENTION_DAYS="${DEAD_LETTER_RETENTION_DAYS:-14}" but this variable is absent from the CronJob's env block and, as noted in finding #4, nothing actually uses it. Remove the variable or wire it through once the dead-letter copy step is implemented.
10. Exit-code convention documented but not yet adopted
Docs/failure_handling.md (lines 48-56) documents a sys.exit(2) convention for non-retryable user errors. None of the existing scripts (validate.py, scaffold.py, analyze.py, store_results.py, query_results.py) implement it — they all exit with 1 or rely on uncaught exceptions. This doesn't block the PR (it's a forward-looking convention) but a follow-up ticket should be opened to retrofit the convention so Tekton's retry logic can distinguish transient from user errors.
11. Self-hosted network policy hardcodes port 8080
config/security/network_policy_self_hosted.yaml:
ports:
- protocol: TCP
port: 8080Common self-hosted inference servers use different ports: vLLM defaults to 8000, Ollama to 11434, TGI to 80. The comment says to adjust the selector, but not the port. Add a note alongside the port that it should also be adjusted to match the actual server.
Not a bug — confirmed correct
- ResourceQuota limits (50 pods, 32 CPU, 64Gi) are consistent with 40 concurrent trial pods + overhead ✅
egress: []in default-deny correctly denies all egress fromabevalflow/role: trialpods ✅- LiteLLM NetworkPolicy targets pods by
app.kubernetes.io/name: litellm— matches the Deployment's pod labels ✅ concurrencyPolicy: Forbidon the CronJob prevents overlapping cleanup runs ✅- Dead-letter PVC
accessModes: ReadWriteOnceis fine; the cleanup job will be the only writer ✅
|
Model used for this review: Codex 5.3. |
|
Model: composer-2 (Composer) (Follow-up: the review comment above should be read with this model name stated explicitly.) |
- Rename PIPELINERUN_AGE_DAYS to PIPELINERUN_KEEP_COUNT (count semantics) - Clarify dead-letter PVC as reserved for future use, not yet automated - Mark per-task retries/timeouts as target policy (not yet applied) - Change network policy annotations from INSTEAD OF to IN ADDITION TO - Pin LiteLLM image to main-v1.82.6 instead of main-latest - Add dedicated litellm ServiceAccount instead of reusing pipeline SA - Scope secrets get verb with resourceNames in RBAC - Remove inline cleanup script from CronJob ConfigMap (single source) - Rename pod_security.yaml to pod_security_reference.yaml - Add port guidance note to self-hosted network policy
- resourceNames: llm-credentials -> litellm-credentials (matches
config/litellm/secret_template.yaml)
- Replace empty data: {} ConfigMap stub with a placeholder script that
exits 1 with instructions, preventing silent broken-state on oc apply
…rite The placeholder ConfigMap was reapplied by oc apply, overwriting the real script created in the prior step. Removed entirely — the ConfigMap is now managed only via oc create configmap --from-file as documented.
| resources: [persistentvolumeclaims] | ||
| verbs: [get, list, create, delete] |
There was a problem hiding this comment.
Does the SA need really the permissions for deleting PVCs?
| - Egress | ||
| egress: | ||
| # DNS resolution | ||
| - to: |
There was a problem hiding this comment.
Limiting the DNS querying to our server might be a nice improvement:
My suggestion:
- to:
- namespaceSelector:
matchLabels:
name: openshift-dns # ajustar según tu cluster
ports:
- protocol: UDP
port: 53
Summary
Deferred (follow-up)
Verification (tested on OpenShift cluster)
All manifests applied to
ab-eval-flownamespace and verified:pipelineexists;oc auth can-iconfirmed create pods, get secrets, create PVCs, get configmaps — all yesabevalflow/role=trialpodsel-submission-listener-ab-eval-flow.apps.appeng.clusters.se-apps.redhat.com