diff --git a/workflows/cve-fixer/.cve-fix/stolostron-grafana/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-grafana/examples.md new file mode 100755 index 00000000..3af72243 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-grafana/examples.md @@ -0,0 +1,30 @@ + + + + +## Titles +- `Security: Fix CVE-YYYY-XXXXX ()` (common across stolostron org) +- `fix(cve): CVE-YYYY-XXXXX - ` (conventional commit style, also used in org) + +## Branches +- `fix/cve----attempt-N` (common across stolostron org) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-kube-rbac-proxy/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-kube-rbac-proxy/examples.md new file mode 100755 index 00000000..390e1c3d --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-kube-rbac-proxy/examples.md @@ -0,0 +1,36 @@ + + +## Titles +- `fix(cve): CVE-YYYY-XXXXX - ` (5/30 merged PRs) + - e.g. `fix(cve): CVE-2026-33186 - google.golang.org/grpc` + - e.g. `fix(cve): CVE-2026-33186 - google.golang.org/grpc` +- `Security: Fix CVE-YYYY-XXXXX ()` (5/30 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (gRPC-Go) - release-2.13` + - e.g. `Security: Fix CVE-2026-33186 (gRPC-Go) - release-2.14` + +## Branches +- `fix/cve----attempt-N` (10/30 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-backplane-2.10-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-backplane-2.9-attempt-1` +- `dependabot//-` (2/30 merged PRs) + - e.g. `dependabot/go_modules/golang.org/x/net-0.38.0` + - e.g. `dependabot/go_modules/golang.org/x/oauth2-0.27.0` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-kube-state-metrics/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-kube-state-metrics/examples.md new file mode 100755 index 00000000..a7408bb0 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-kube-state-metrics/examples.md @@ -0,0 +1,33 @@ + + +## Titles +- `Security: Fix CVE-YYYY-XXXXX ()` (4/15 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + +## Branches +- `fix/cve----attempt-N` (4/15 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15-attempt-1` +- `dependabot//-` (2/15 merged PRs) + - e.g. `dependabot/go_modules/github.com/golang-jwt/jwt/v5-5.2.2` + - e.g. `dependabot/go_modules/golang.org/x/crypto-0.35.0` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-memcached-exporter/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-memcached-exporter/examples.md new file mode 100755 index 00000000..3af72243 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-memcached-exporter/examples.md @@ -0,0 +1,30 @@ + + + + +## Titles +- `Security: Fix CVE-YYYY-XXXXX ()` (common across stolostron org) +- `fix(cve): CVE-YYYY-XXXXX - ` (conventional commit style, also used in org) + +## Branches +- `fix/cve----attempt-N` (common across stolostron org) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-addon/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-addon/examples.md new file mode 100755 index 00000000..c0e621f5 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-addon/examples.md @@ -0,0 +1,30 @@ + + +## Titles +- `Security: Fix CVE-YYYY-XXXXX ()` (4/10 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + +## Branches +- `fix/cve----attempt-N` (4/10 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-operator/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-operator/examples.md new file mode 100755 index 00000000..72e15064 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-operator/examples.md @@ -0,0 +1,32 @@ + + +## Titles +- `Security: Fix CVE-YYYY-XXXXX ()` (4/12 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + +## Branches +- `fix/cve----attempt-N` (4/12 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15-attempt-1` +- `dependabot//-` (1/12 merged PRs) + - e.g. `dependabot/go_modules/go.opentelemetry.io/otel/sdk-1.40.0` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-node-exporter/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-node-exporter/examples.md new file mode 100755 index 00000000..8cc1dc04 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-node-exporter/examples.md @@ -0,0 +1,28 @@ + + +## Titles +- `Security: Fix CVE-YYYY-XXXXX ()` (common across stolostron org) +- `fix(cve): CVE-YYYY-XXXXX - ` (conventional commit style, also used in org) + +## Branches +- `fix/cve----attempt-N` (common across stolostron org) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-observatorium-operator/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-observatorium-operator/examples.md new file mode 100755 index 00000000..b1b8a4a2 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-observatorium-operator/examples.md @@ -0,0 +1,30 @@ + + +## Titles +- `Security: Fix CVE-YYYY-XXXXX ()` (common across stolostron org) +- `fix(cve): CVE-YYYY-XXXXX - ` (conventional commit style, also used in org) + +## Branches +- `fix/cve----attempt-N` (common across stolostron org) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) +- `vendor/` directory is vendored — run `go mod vendor` after dependency changes + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- This repo vendors dependencies — run `go mod vendor` after `go mod tidy` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-observatorium/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-observatorium/examples.md new file mode 100755 index 00000000..6debae0b --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-observatorium/examples.md @@ -0,0 +1,30 @@ + + +## Titles +- `Security: Fix CVE-YYYY-XXXXX ()` (4/19 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + +## Branches +- `fix/cve----attempt-N` (4/19 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-prometheus-alertmanager/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-prometheus-alertmanager/examples.md new file mode 100755 index 00000000..07ff82db --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-prometheus-alertmanager/examples.md @@ -0,0 +1,32 @@ + + +## Titles +- `Security: Fix CVE-YYYY-XXXXX ()` (4/26 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` +- `Other CVE title format` (1/26 merged PRs) + - e.g. `[release-2.10] fix: CVE-2023-45288 ensure golang/x/net is 0.23+` + +## Branches +- `fix/cve----attempt-N` (4/26 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-prometheus-operator/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-prometheus-operator/examples.md new file mode 100755 index 00000000..f6c23f29 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-prometheus-operator/examples.md @@ -0,0 +1,30 @@ + + +## Titles +- `fix(cve): CVE-YYYY-XXXXX - ` (4/22 merged PRs) + - e.g. `fix(cve): CVE-2026-33186 - google.golang.org/grpc [release-2.17]` + - e.g. `fix(cve): CVE-2026-33186 - google.golang.org/grpc [release-2.16]` + +## Branches +- `fix/cve----attempt-N` (4/22 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-release-2.17-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-release-2.16-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-prometheus/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-prometheus/examples.md new file mode 100755 index 00000000..a1bf7586 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-prometheus/examples.md @@ -0,0 +1,33 @@ + + +## Titles +- `Security: Fix CVE-YYYY-XXXXX ()` (4/27 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + +## Branches +- `fix/cve----attempt-N` (4/27 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15-attempt-1` +- `dependabot//-` (3/27 merged PRs) + - e.g. `dependabot/go_modules/github.com/golang-jwt/jwt/v5-5.2.2` + - e.g. `dependabot/go_modules/golang.org/x/crypto-0.35.0` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-thanos-receive-controller/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-thanos-receive-controller/examples.md new file mode 100755 index 00000000..8dee3c2c --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-thanos-receive-controller/examples.md @@ -0,0 +1,32 @@ + + +## Titles +- `Security: Fix CVE-YYYY-XXXXX ()` (4/17 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go) - release-2.15` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go) - release-2.14` + +## Branches +- `fix/cve----attempt-N` (4/17 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14` +- `dependabot//-` (1/17 merged PRs) + - e.g. `dependabot/go_modules/golang.org/x/crypto-0.35.0` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-thanos/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-thanos/examples.md new file mode 100755 index 00000000..b51136f2 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-thanos/examples.md @@ -0,0 +1,38 @@ + + +## Titles +- `Security: Fix CVE-YYYY-XXXXX ()` (3/35 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` +- `Other CVE title format` (2/35 merged PRs) + - e.g. `fix: [release-2.10] CVE-2023-45288 ensure golang/x/net is 0.23+` + - e.g. `CVE-2023-45288 ensure golang/x/net is 0.23+` +- `Bump from X to Y to fix CVE-YYYY-XXXXX` (1/35 merged PRs) + - e.g. `Bump google.golang.org/grpc to v1.79.3 to fix CVE-2026-33186` + +## Branches +- `fix/cve----attempt-N` (4/35 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.17-attempt-1` +- `dependabot//-` (3/35 merged PRs) + - e.g. `dependabot/go_modules/golang.org/x/crypto-0.35.0` + - e.g. `dependabot/go_modules/github.com/golang-jwt/jwt/v5-5.2.2` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/guidance-generator/.ambient/ambient.json b/workflows/guidance-generator/.ambient/ambient.json new file mode 100644 index 00000000..6f25a442 --- /dev/null +++ b/workflows/guidance-generator/.ambient/ambient.json @@ -0,0 +1,10 @@ +{ + "name": "PR Guidance Generator", + "description": "Analyze merged and closed fix PRs across one or more repositories to generate compact guidance files that teach automated workflows (CVE Fixer, Bugfix) how to create PRs matching each repo's conventions. Each repo is processed independently.", + "systemPrompt": "You are a PR pattern analyst for the Ambient Code Platform. Your role is to help teams generate and maintain guidance files that teach automated fix workflows how to create pull requests matching their repository's conventions.\n\nKEY RESPONSIBILITIES:\n- Fetch and analyze historical fix PRs from one or more GitHub repositories\n- Process each repository independently — one failure must not abort others\n- Extract patterns from merged PRs (what works) and closed PRs (what to avoid)\n- Generate compact, high-signal guidance files — no fluff, no verbose examples\n- Create one pull request per repository with the generated guidance files\n- Update existing guidance files with patterns from new PRs\n- Print a final summary listing all PR URLs and any failures\n\nWORKFLOW METHODOLOGY:\n1. GENERATE - Parse multiple repos, loop over each: analyze PR history (or specific PRs), create guidance files, open a PR per repo\n2. UPDATE - Parse multiple repos, loop over each: fetch new PRs (or specific PRs), merge patterns, open an update PR per repo\n\nAVAILABLE COMMANDS:\n/guidance.generate [ ...] [--cve-only] [--bugfix-only] [--limit N] [--pr ]\n/guidance.update [ ...] [--cve-only] [--bugfix-only] [--pr ]\n\nBoth commands accept repos and --pr refs space-separated, comma-separated, or mixed.\nFull PR URLs in --pr apply only to their matching repo; plain numbers apply to all repos.\n\nOUTPUT LOCATIONS (per repo):\n- Raw PR data: artifacts/guidance//raw/\n- Analysis output: artifacts/guidance//analysis/\n- Generated files: artifacts/guidance//output/\n\nCORE PRINCIPLES:\n- Process each repo independently in a loop — never let one repo failure abort others\n- Guidance files target ~80 lines; never drop rules to enforce the limit\n- Use adaptive rule threshold: 3+ PRs (large bucket), 2+ (medium), 1+ with limited-data warning (small). Skip file only if 0 merged PRs.\n- In --pr mode: never drop a user-specified PR even if it does not match bucket patterns\n- Merged PRs = positive examples. Closed PRs = what to avoid.\n- Review REQUEST_CHANGES comments reveal what workflows should do proactively.\n- Never guess patterns — only state what the PR data supports.\n- Sanitize control characters from all PR text fields before JSON construction.", + "startupPrompt": "Ask the user which repository or repositories they want to analyze, and whether they want to generate new guidance files or update existing ones. If they are unsure, ask whether their repos already have .cve-fix/examples.md or .bugfix/guidance.md — if yes, suggest /guidance.update; if no, suggest /guidance.generate. Keep the introduction short: one sentence describing what the workflow does, then a concise list of the two commands and their key flags. Do not use marketing language or a canned greeting.", + "results": { + "Generated Guidance": "artifacts/guidance/**/output/*.md", + "PR Analysis": "artifacts/guidance/**/analysis/*.md" + } +} diff --git a/workflows/guidance-generator/.claude/commands/guidance.generate.md b/workflows/guidance-generator/.claude/commands/guidance.generate.md new file mode 100644 index 00000000..0d978be9 --- /dev/null +++ b/workflows/guidance-generator/.claude/commands/guidance.generate.md @@ -0,0 +1,962 @@ +# /guidance.generate - Generate PR Guidance Files + +## Purpose +Analyze a GitHub repository's fix PR history to generate compact guidance files +for the CVE Fixer (`.cve-fix/examples.md`) and Bugfix (`.bugfix/guidance.md`) +workflows, then open a PR in that repo adding those files. + +## Execution Style + +Be concise. Brief status per phase, full summary at end. + +Example: +``` +Fetching PRs from org/repo... 147 total + CVE bucket: 38 PRs (28 merged, 10 closed) + Bugfix bucket: 61 PRs (54 merged, 7 closed) + +Fetching per-PR details... Done +Synthesizing patterns... + CVE: 14 rules extracted (threshold: 3 PRs, or 1 if limited data) + Bugfix: 11 rules extracted + +Writing guidance files... Done +Creating PR in org/repo... https://github.com/org/repo/pull/88 + +Artifacts: artifacts/guidance/org-repo/ +``` + +## Prerequisites + +- GitHub CLI (`gh`) installed and authenticated: `gh auth status` +- `jq` installed +- Write access to the target repository (for PR creation) + +## Arguments + +``` +/guidance.generate [ ...] [--cve-only] [--bugfix-only] [--limit N] +/guidance.generate [,,...] [--cve-only] [--bugfix-only] [--limit N] +/guidance.generate [ ...] --pr [,...] +``` + +- `repo-url`: One or more repos — space-separated or comma-separated (or both). + Accepts full GitHub URLs (`https://github.com/org/repo`) or `org/repo` slugs. + Each repo is processed independently and gets its own PR. +- `--cve-only`: Skip bugfix analysis for all repos +- `--bugfix-only`: Skip CVE analysis for all repos +- `--limit N`: Max PRs to fetch per bucket per repo (default: 100, min: 20) +- `--pr `: PR URLs or numbers — space-separated, comma-separated, or mixed. + Full URLs (`https://github.com/org/repo/pull/123`) are applied only to their + matching repo. Plain numbers (`123`) are applied to all repos. + +## Process + +### 1. Parse Arguments and Validate + +Parse all repo references (space-separated, comma-separated, or mixed) and +`--pr` into structured data. Validate `gh` auth once before the loop. + +```bash +# Validate gh auth once +gh auth status || { echo "ERROR: gh not authenticated. Run 'gh auth login'"; exit 1; } + +# Normalize repo args: replace commas with spaces, strip GitHub URL prefix, +# deduplicate, and collect into REPOS array +normalize_repo() { + local REF="$1" + if [[ "$REF" =~ github\.com/([a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+) ]]; then + echo "${BASH_REMATCH[1]}" + elif [[ "$REF" =~ ^[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+$ ]]; then + echo "$REF" + else + echo "WARNING: Cannot parse repo '$REF' — skipping" >&2 + echo "" + fi +} + +REPOS=() +for RAW in $(echo "$REPO_ARGS" | tr ',' ' '); do + NORMALIZED=$(normalize_repo "$RAW") + [ -n "$NORMALIZED" ] && REPOS+=("$NORMALIZED") +done + +# Deduplicate +REPOS=($(printf '%s\n' "${REPOS[@]}" | awk '!seen[$0]++')) + +if [ ${#REPOS[@]} -eq 0 ]; then + echo "ERROR: No valid repository references provided." + echo "Usage: /guidance.generate org/repo1 org/repo2" + exit 1 +fi + +echo "Repos to process (${#REPOS[@]}):" +for R in "${REPOS[@]}"; do echo " - $R"; done + +# Parse --pr: full URLs map to their repo; plain numbers apply to all repos +declare -A REPO_SPECIFIC_PRS # keyed by "org/repo", value = space-separated PR numbers +GLOBAL_PR_NUMBERS="" # plain numbers — applied to every repo + +if [ -n "$PR_REFS" ]; then + IFS=',' read -ra PR_LIST <<< "$(echo "$PR_REFS" | tr ' ' ',')" + for PR_REF in "${PR_LIST[@]}"; do + PR_REF=$(echo "$PR_REF" | tr -d ' ') + if [[ "$PR_REF" =~ github\.com/([a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+)/pull/([0-9]+) ]]; then + PR_REPO="${BASH_REMATCH[1]}" + PR_NUM="${BASH_REMATCH[2]}" + REPO_SPECIFIC_PRS["$PR_REPO"]="${REPO_SPECIFIC_PRS[$PR_REPO]:-} $PR_NUM" + elif [[ "$PR_REF" =~ ^[0-9]+$ ]]; then + GLOBAL_PR_NUMBERS="$GLOBAL_PR_NUMBERS $PR_REF" + else + echo "WARNING: Could not parse PR reference '$PR_REF' — skipping" + fi + done + GLOBAL_PR_NUMBERS=$(echo "$GLOBAL_PR_NUMBERS" | tr -s ' ' | sed 's/^ //') +fi + +# Accumulators for the final summary +PR_RESULTS=() # "org/repo -> " +FAILED_REPOS=() # "org/repo -> " +``` + +--- +> **Steps 2–8 repeat for each repo in `${REPOS[@]}`.** + +```bash +for REPO in "${REPOS[@]}"; do + echo "" + echo "=== $REPO ===" + + # Validate this repo is accessible; skip on failure rather than aborting all + if ! gh repo view "$REPO" --json name > /dev/null 2>&1; then + echo " ERROR: Cannot access $REPO — skipping" + FAILED_REPOS+=("$REPO -> cannot access repository") + continue + fi + + REPO_SLUG=$(echo "$REPO" | tr '/' '-') + + # Combine repo-specific --pr numbers with global plain numbers for this repo + SPECIFIC_PR_NUMBERS="${REPO_SPECIFIC_PRS[$REPO]:-} $GLOBAL_PR_NUMBERS" + SPECIFIC_PR_NUMBERS=$(echo "$SPECIFIC_PR_NUMBERS" | tr -s ' ' | sed 's/^ //') + [ -n "$SPECIFIC_PR_NUMBERS" ] && echo " Manual PR mode: PR(s) $SPECIFIC_PR_NUMBERS" + + mkdir -p "artifacts/guidance/$REPO_SLUG/raw" + mkdir -p "artifacts/guidance/$REPO_SLUG/analysis" + mkdir -p "artifacts/guidance/$REPO_SLUG/output" + mkdir -p "/tmp/guidance-gen/$REPO_SLUG" +``` + +### 2. Fetch PR Metadata (Pass 1 — lightweight) + +**If `--pr` was specified**, skip bulk fetch and build the metadata list directly +from the given PR numbers: + +```bash +LIMIT="${LIMIT:-100}" + +if [ -n "$SPECIFIC_PR_NUMBERS" ]; then + # Manual mode: fetch metadata only for the specified PRs + echo "[]" > "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" + for NUMBER in $SPECIFIC_PR_NUMBERS; do + PR_META=$(gh pr view "$NUMBER" --repo "$REPO" \ + --json number,title,state,mergedAt,closedAt,labels,headRefName,latestReviews \ + 2>/dev/null) + if [ $? -ne 0 ] || [ -z "$PR_META" ]; then + echo "WARNING: Could not fetch PR #$NUMBER — skipping" + continue + fi + jq --argjson meta "$PR_META" '. + [$meta]' \ + "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/all-prs.json.tmp" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/all-prs.json.tmp" \ + "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" + done + TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/all-prs.json") + echo "Loaded $TOTAL specified PR(s) from $REPO" +else + # Auto mode: bulk fetch all recent PRs + gh pr list \ + --repo "$REPO" \ + --state all \ + --limit 200 \ + --json number,title,state,mergedAt,closedAt,labels,headRefName,latestReviews \ + > "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" + TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/all-prs.json") + echo "Fetched $TOTAL PRs from $REPO" +fi +``` + +### 3. Filter into Buckets + +Use jq to split into CVE and bugfix buckets based on title and branch patterns. + +In **auto mode**: CVE PRs take priority — a PR cannot be in both buckets. +In **manual mode (`--pr`)**: classify normally, but if a specified PR matches +neither pattern, include it in both buckets and let Claude determine during +synthesis which guidance file it informs. Never silently drop a user-specified PR. + +```bash +# Explicit CVE/security signals — pass through unconditionally +CVE_EXPLICIT='CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE' +# Dependency/version bump patterns — may contain security patches; require body scan +CVE_DEP_PATTERN='^[Bb]ump |^deps\(|^build\(deps\)' +# Combined: either explicit or dep pattern matches the CVE bucket initially +CVE_PATTERN="${CVE_EXPLICIT}|${CVE_DEP_PATTERN}" +CVE_BRANCH_PATTERN='^fix/cve-|^security/cve-|^dependabot/|^renovate/' +BUGFIX_PATTERN='^fix[:(]|^bugfix|^bug[[:space:]]fix|closes[[:space:]]#[0-9]+|fixes[[:space:]]#[0-9]+' +BUGFIX_BRANCH_PATTERN='^(bugfix|fix|bug)/' +# Keyword that confirms a dep-pattern match is security-relevant +SECURITY_BODY='CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|security|vulnerab|security.advisory' + +if [ -n "$SPECIFIC_PR_NUMBERS" ]; then + # Manual mode: classify each PR, fallback to both buckets if unmatched + jq '[.[] | select( + (.title | test("'"$CVE_PATTERN"'"; "i")) or + (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i")) + )]' "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" + + jq '[.[] | select( + ( + (.title | test("'"$BUGFIX_PATTERN"'"; "i")) or + (.headRefName | test("'"$BUGFIX_BRANCH_PATTERN"'"; "i")) + ) and + (.title | test("'"$CVE_PATTERN"'"; "i") | not) and + (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i") | not) + )]' "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" + + # Any PR that matched neither bucket: add to both with a warning + UNMATCHED=$(jq '[.[] | select( + ((.title | test("'"$CVE_PATTERN"'"; "i")) or (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i")) | not) and + ((.title | test("'"$BUGFIX_PATTERN"'"; "i")) or (.headRefName | test("'"$BUGFIX_BRANCH_PATTERN"'"; "i")) | not) + )]' "/tmp/guidance-gen/$REPO_SLUG/all-prs.json") + UNMATCHED_COUNT=$(echo "$UNMATCHED" | jq 'length') + if [ "$UNMATCHED_COUNT" -gt 0 ]; then + UNMATCHED_NUMS=$(echo "$UNMATCHED" | jq -r '.[].number' | tr '\n' ',' | sed 's/,$//') + echo " NOTE: PR(s) #$UNMATCHED_NUMS did not match CVE or bugfix patterns — included in both buckets for Claude to classify" + jq --argjson extra "$UNMATCHED" '. + $extra' \ + "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" > "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json.tmp" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json.tmp" "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" + jq --argjson extra "$UNMATCHED" '. + $extra' \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" > "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json.tmp" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json.tmp" "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" + fi +else + # Auto mode: strict filtering, CVE takes priority + jq --argjson limit "$LIMIT" '[ + .[] | select( + (.title | test("'"$CVE_PATTERN"'"; "i")) or + (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i")) + ) + ] | .[:$limit]' \ + "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" + + jq --argjson limit "$LIMIT" '[ + .[] | select( + ( + (.title | test("'"$BUGFIX_PATTERN"'"; "i")) or + (.headRefName | test("'"$BUGFIX_BRANCH_PATTERN"'"; "i")) + ) and + (.title | test("'"$CVE_PATTERN"'"; "i") | not) and + (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i") | not) + ) + ] | .[:$limit]' \ + "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" +fi + +# Body scan: for dep-pattern matches without an explicit CVE/GHSA in the title, +# fetch the PR body and verify it contains a security indicator. +# Explicit CVE/GHSA/Security titles pass through unconditionally. +# Only runs in auto mode — manual --pr mode trusts the user's selection. +if [ -z "$SPECIFIC_PR_NUMBERS" ]; then + DEP_ONLY_NUMS=$(jq -r '[.[] | select( + (.title | test("'"$CVE_DEP_PATTERN"'"; "i")) and + (.title | test("'"$CVE_EXPLICIT"'"; "i") | not) + ) | .number] | .[]' "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json") + + for PR_NUM in $DEP_ONLY_NUMS; do + BODY=$(gh pr view "$PR_NUM" --repo "$REPO" --json body \ + --jq '.body // ""' 2>/dev/null | sanitize_str) + if ! echo "$BODY" | grep -qiE "$SECURITY_BODY"; then + echo " Dropped PR #$PR_NUM from CVE bucket — dep update with no security signal in body" + jq --argjson n "$PR_NUM" '[.[] | select(.number != $n)]' \ + "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json.tmp" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json.tmp" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" + fi + done +fi + +CVE_TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json") +CVE_MERGED=$(jq '[.[] | select(.state == "MERGED")] | length' "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json") +CVE_CLOSED=$(jq '[.[] | select(.state == "CLOSED")] | length' "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json") + +BUGFIX_TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json") +BUGFIX_MERGED=$(jq '[.[] | select(.state == "MERGED")] | length' "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json") +BUGFIX_CLOSED=$(jq '[.[] | select(.state == "CLOSED")] | length' "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json") + +echo " CVE bucket: $CVE_TOTAL PRs ($CVE_MERGED merged, $CVE_CLOSED closed)" +echo " Bugfix bucket: $BUGFIX_TOTAL PRs ($BUGFIX_MERGED merged, $BUGFIX_CLOSED closed)" +``` + +If both buckets are empty, report this clearly and exit — the repo may not have +recognizable fix PR naming conventions. Suggest the user check PR title patterns. + +### 3.5. Fetch Commit Fallback + +For any bucket with fewer than 3 merged PRs, scan recent commits as a supplementary +signal source. Skip this step entirely if `--pr` was specified (user chose the data). + +```bash +# Fetch commit fallback for a bucket if merged PR count < 3 +# Args: BUCKET_LABEL META_FILE OUT_FILE MSG_PATTERN +fetch_commit_fallback() { + local LABEL="$1" + local META_FILE="$2" + local OUT_FILE="$3" + local MSG_PATTERN="$4" + + echo "[]" > "$OUT_FILE" + + # Skip if manual PR mode — user chose the data explicitly + [ -n "$SPECIFIC_PR_NUMBERS" ] && return + + local MERGED_COUNT + MERGED_COUNT=$(jq '[.[] | select(.state == "MERGED")] | length' "$META_FILE") + + if [ "$MERGED_COUNT" -ge 3 ]; then + return # Enough PR data — no fallback needed + fi + + echo " $LABEL bucket: $MERGED_COUNT merged PRs — scanning commits as fallback..." + + # Fetch up to 100 recent commit messages (lightweight — no file data yet) + gh api "repos/$REPO/commits?per_page=100" \ + --jq '.[] | {sha: .sha, message: .commit.message}' \ + > "/tmp/guidance-gen/$REPO_SLUG/${LABEL}-commits-raw.jsonl" 2>/dev/null + + local SAMPLED=0 + local MAX_COMMITS=50 + + while IFS= read -r LINE && [ "$SAMPLED" -lt "$MAX_COMMITS" ]; do + local SHA MSG_RAW TITLE + + SHA=$(echo "$LINE" | jq -r '.sha') + MSG_RAW=$(echo "$LINE" | jq -r '.message' | sanitize_str) + TITLE=$(echo "$MSG_RAW" | head -1) + + # Filter by message pattern for this bucket + echo "$TITLE" | grep -qiE "$MSG_PATTERN" || continue + + # For dep/bump commits without an explicit CVE/GHSA in the title, + # verify the commit body contains a security indicator. + # MSG_RAW already contains the full message — no extra API call needed. + if echo "$TITLE" | grep -qiE "^[Bb]ump |^deps\(|^build\(deps\)"; then + if ! echo "$TITLE" | grep -qiE "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):"; then + if ! echo "$MSG_RAW" | grep -qiE "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|security|vulnerab"; then + continue # dep update with no security signal — skip + fi + fi + fi + + # Fetch file list for this commit (targeted — only for matched commits) + local FILES + FILES=$(gh api "repos/$REPO/commits/$SHA" \ + --jq '[.files[].filename]' 2>/dev/null || echo "[]") + + local BODY + BODY=$(echo "$MSG_RAW" | tail -n +2 | tr '\n' ' ' | cut -c1-300) + + local RECORD + RECORD=$(jq -n \ + --arg sha "$SHA" \ + --arg title "$TITLE" \ + --arg body "$BODY" \ + --argjson files "$FILES" \ + '{source: "commit", sha: $sha, state: "MERGED", + title: $title, branch: "", labels: [], + files: $files, changes_requested: [], close_reason: null, + commit_body: $body}' 2>/tmp/guidance-jq-err.txt) + + if [ $? -ne 0 ]; then + echo " WARNING: commit $SHA skipped — $(cat /tmp/guidance-jq-err.txt)" + continue + fi + + jq --argjson rec "$RECORD" '. + [$rec]' "$OUT_FILE" > "${OUT_FILE}.tmp" \ + && mv "${OUT_FILE}.tmp" "$OUT_FILE" + SAMPLED=$((SAMPLED + 1)) + + done < "/tmp/guidance-gen/$REPO_SLUG/${LABEL}-commits-raw.jsonl" + + local COMMIT_COUNT + COMMIT_COUNT=$(jq 'length' "$OUT_FILE") + echo " Found $COMMIT_COUNT matching $LABEL commits" + + # Save to artifacts for transparency + cp "$OUT_FILE" "artifacts/guidance/$REPO_SLUG/raw/${LABEL}-commits.json" +} + +fetch_commit_fallback "cve" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-commits.json" \ + "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE|^[Bb]ump |^deps\(|^build\(deps\)" + +fetch_commit_fallback "bugfix" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-commits.json" \ + "^fix[:(]|^bugfix|^bug fix|fixes[[:space:]]#[0-9]+|closes[[:space:]]#[0-9]+" +``` + +### 4. Fetch Per-PR Details (Pass 2 — targeted) + +For each PR in both buckets, fetch only: file paths changed and review data. +For closed PRs, also fetch the last 2 comments (closing context). + +Process each bucket the same way. Replace `$META_FILE` and `$OUT_FILE` accordingly. + +```bash +# Strip control characters from a string (keeps printable ASCII + tab + newline) +sanitize_str() { + tr -cd '[:print:]\t\n' +} + +fetch_pr_details() { + local META_FILE="$1" + local OUT_FILE="$2" + local COUNT=$(jq 'length' "$META_FILE") + local FAILED=0 + + echo "[]" > "$OUT_FILE" + + for i in $(seq 0 $((COUNT - 1))); do + NUMBER=$(jq -r ".[$i].number" "$META_FILE") + STATE=$(jq -r ".[$i].state" "$META_FILE") + # Sanitize string fields at extraction time to strip control characters + TITLE=$(jq -r ".[$i].title" "$META_FILE" | sanitize_str) + BRANCH=$(jq -r ".[$i].headRefName" "$META_FILE" | sanitize_str) + LABELS=$(jq -c "[.[$i].labels[].name]" "$META_FILE") + + # Fetch files and reviews in one call + PR_DETAIL=$(gh pr view "$NUMBER" --repo "$REPO" \ + --json files,reviews 2>/dev/null) + + FILES=$(echo "$PR_DETAIL" | jq -c '[.files[].path]') + + # Extract REQUEST_CHANGES review bodies — sanitize inside jq before truncating + CHANGES_REQ=$(echo "$PR_DETAIL" | jq -c '[ + .reviews[] | + select(.state == "CHANGES_REQUESTED") | + .body | + gsub("[\\u0000-\\u0008\\u000b-\\u001f\\u007f]"; "") | + gsub("\\n|\\r"; " ") | + .[0:200] + ]') + + # For closed PRs: get last 2 comments, sanitize inside jq + CLOSE_REASON="null" + if [ "$STATE" = "CLOSED" ]; then + CLOSE_REASON=$(gh pr view "$NUMBER" --repo "$REPO" \ + --json comments \ + --jq '.comments | .[-2:] | map( + .body | + gsub("[\\u0000-\\u0008\\u000b-\\u001f\\u007f]"; "") | + gsub("\\n|\\r"; " ") | + .[0:200] + ) | join(" | ")' \ + 2>/dev/null | jq -Rs '.') + fi + + # Build compact record — capture jq errors per PR instead of silently dropping + RECORD=$(jq -n \ + --argjson number "$NUMBER" \ + --arg state "$STATE" \ + --arg title "$TITLE" \ + --arg branch "$BRANCH" \ + --argjson labels "$LABELS" \ + --argjson files "$FILES" \ + --argjson changes_requested "$CHANGES_REQ" \ + --argjson close_reason "$CLOSE_REASON" \ + '{number: $number, state: $state, title: $title, branch: $branch, + labels: $labels, files: $files, + changes_requested: $changes_requested, close_reason: $close_reason}' \ + 2>/tmp/guidance-jq-err.txt) + + if [ $? -ne 0 ]; then + echo " WARNING: PR #$NUMBER skipped — jq error: $(cat /tmp/guidance-jq-err.txt)" + FAILED=$((FAILED + 1)) + continue + fi + + jq --argjson rec "$RECORD" '. + [$rec]' "$OUT_FILE" > "${OUT_FILE}.tmp" \ + && mv "${OUT_FILE}.tmp" "$OUT_FILE" + done + + if [ "$FAILED" -gt 0 ]; then + echo " WARNING: $FAILED PR(s) skipped due to unparseable content. Check raw data in artifacts." + fi +} + +fetch_pr_details \ + "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-details.json" + +fetch_pr_details \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-details.json" + +# Merge commit fallback records into the detail files +jq -s '.[0] + .[1]' \ + "/tmp/guidance-gen/$REPO_SLUG/cve-details.json" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-commits.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/cve-details-merged.json" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/cve-details-merged.json" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-details.json" + +jq -s '.[0] + .[1]' \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-details.json" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-commits.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/bugfix-details-merged.json" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/bugfix-details-merged.json" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-details.json" + +# Save to artifacts for reference +cp "/tmp/guidance-gen/$REPO_SLUG/cve-details.json" \ + "artifacts/guidance/$REPO_SLUG/raw/cve-prs.json" +cp "/tmp/guidance-gen/$REPO_SLUG/bugfix-details.json" \ + "artifacts/guidance/$REPO_SLUG/raw/bugfix-prs.json" +``` + +### 5. Synthesize Patterns + +Read `cve-details.json` and `bugfix-details.json` from the artifacts. +Analyze them as the agent — do NOT write a script for this step. + +**Records have two sources — treat them differently:** + +Records with no `source` field (or `source != "commit"`) are PR records. +Records with `source: "commit"` came from the commit fallback and have no +`changes_requested` or `close_reason` data. + +**Inclusion thresholds by source:** + +| Source | Min occurrences per rule | +|--------|--------------------------| +| Merged PRs (10+ in bucket) | 3 | +| Merged PRs (3–9 in bucket) | 2 | +| Merged PRs (1–2 in bucket) | 1 | +| Commits only | 5 | +| Mixed (PRs + commits) | 3 total, at least 1 PR | + +**What to extract from PR records:** +- **Title format**: What template do titles follow? +- **Branch format**: What naming pattern do branches use? +- **Files changed**: Which files appear together most often? +- **Labels**: What labels are consistently applied? +- **Co-changes**: When package A changes, does package B always change too? +- **From changes_requested**: What reviewers asked for — these become proactive rules. +- **From close_reason + changes_requested**: Why PRs were rejected — these become "don'ts". + +**What to extract from commit records (no reviewer signal available):** +- **Message format**: Title line pattern, body structure, trailers (`Co-authored-by:`, `Fixes #`) +- **Files changed**: Which files appear together in fix commits +- **Co-changes**: Package co-upgrade patterns visible in file sets + +**Commit-only rules cannot populate the "Don'ts" section** — there is no rejection +signal from commits. If a bucket is commit-only, omit the Don'ts section entirely. + +**Evidence notation:** +- PR-sourced: `(8/9 merged PRs)` +- Commit-sourced: `(7 commits)` +- Mixed: `(3/4 merged PRs + 5 commits)` + +**Output of synthesis step:** +Write an intermediate analysis file per bucket: + +``` +artifacts/guidance//analysis/cve-patterns.md +artifacts/guidance//analysis/bugfix-patterns.md +``` + +Each analysis file is a structured list: +``` +TITLE_FORMAT: "Security: Fix CVE-YYYY-XXXXX ()" (3/4 merged PRs + 6 commits) +BRANCH_FORMAT: "fix/cve-YYYY-XXXXX--attempt-N" (3/4 merged PRs) +FILES_GO_STDLIB: go.mod + Dockerfile + Dockerfile.konflux (8 commits) +PROACTIVE_go_sum: Include go.sum — flagged missing in N closed PRs +DONT_multiple_cves: One CVE per PR — N closed PRs rejected for combining +... +``` + +### 6. Generate Guidance Files + +From the analysis files, generate the final guidance files. + +**Formatting constraints:** +- Target 80 lines per file — this is a guideline for fresh generation, not a hard truncation +- No narrative paragraphs — one rule per line or a tight code block +- Evidence counts are inline and terse: `(N/M merged)`, `(N closed PRs)` +- No full PR examples — only the distilled pattern +- If the synthesized output naturally exceeds 80 lines (many strong patterns), + include all rules that meet the threshold. Note the line count in the PR description. + +**CVE guidance file template** — write to `artifacts/guidance//output/cve-fix-guidance.md`. + +When in manual PR mode, the header must note which PRs were used: + +```markdown +# CVE Fix Guidance — + +``` + +When commit fallback was used, add a `commit-fallback` count to the header: + +```markdown +# CVE Fix Guidance — + +``` + +In auto mode with no fallback needed, omit the `cve-commits` field: + +```markdown +# CVE Fix Guidance — + + +## Titles +`` (N/N) + +## Branches +`` (N/N) + +## Files — + (N/N) + + +## PR Description +Required sections (missing caused REQUEST_CHANGES in N PRs): +-
+-
+... + +## Jira / Issue References + (N PRs flagged incorrect format) + +## Don'ts +- (N cases) +- (N cases) +... +``` + +**Bugfix guidance file template** — write to `artifacts/guidance//output/bugfix-guidance.md`: + +```markdown +# Bugfix Guidance — + + +## Titles +`` (N/N) + +## Branches +`` (N/N) + +## Scope Values + (from N PRs) + +## Test Requirements + (N/N merged PRs included this) + +## PR Must Include +- (N PRs) +... + +## Don'ts +- (N cases) +... +``` + +**Threshold rules — adapt based on available data:** +- 10+ merged PRs in bucket → require 3+ PRs per rule (standard threshold) +- 3–9 merged PRs → require 2+ PRs per rule +- 1–2 merged PRs → require 1+ PR per rule; add a `limited-data` warning in the file header + +**If a section has no rules meeting the applicable threshold, omit that section entirely.** +Do not write sections with placeholder text or "not enough data" notes — just omit them. + +**If a bucket has 0 merged PRs**, skip that guidance file entirely and log why. + +**If only one bucket had data** (e.g., no CVE PRs found), only generate the file for +the bucket that had data. Log which file was skipped and why. + +### 7. Create Pull Request in Target Repository + +Clone the repository, add the guidance files, and open a PR. + +```bash +TODAY=$(date +%Y-%m-%d) +BRANCH_NAME="chore/add-pr-guidance-$TODAY" + +# Clone to /tmp +CLONE_DIR="/tmp/guidance-gen/$REPO_SLUG/repo" +git clone "https://github.com/$REPO.git" "$CLONE_DIR" +cd "$CLONE_DIR" + +# Configure git credentials +gh auth setup-git 2>/dev/null || true + +# Create branch +git checkout -b "$BRANCH_NAME" + +# Copy generated files +CVE_OUTPUT="$OLDPWD/artifacts/guidance/$REPO_SLUG/output/cve-fix-guidance.md" +BUGFIX_OUTPUT="$OLDPWD/artifacts/guidance/$REPO_SLUG/output/bugfix-guidance.md" + +if [ -f "$CVE_OUTPUT" ]; then + mkdir -p .cve-fix + cp "$CVE_OUTPUT" .cve-fix/examples.md +fi + +if [ -f "$BUGFIX_OUTPUT" ]; then + mkdir -p .bugfix + cp "$BUGFIX_OUTPUT" .bugfix/guidance.md +fi + +# Commit +git add .cve-fix .bugfix +git commit -m "chore: add automated PR guidance files + +Guidance files generated by the PR Guidance Generator workflow. +These files teach automated fix workflows how this repo expects +PRs to be structured, based on analysis of merged and closed PRs. + +Files added: +$([ -f "$CVE_OUTPUT" ] && echo " - .cve-fix/examples.md (CVE fix conventions)") +$([ -f "$BUGFIX_OUTPUT" ] && echo " - .bugfix/guidance.md (Bugfix conventions)") + +Co-Authored-By: PR Guidance Generator " + +# Build PR body +CVE_MERGED_COUNT=$(jq '[.[] | select(.state == "MERGED")] | length' \ + "$OLDPWD/artifacts/guidance/$REPO_SLUG/raw/cve-prs.json" 2>/dev/null || echo 0) +CVE_CLOSED_COUNT=$(jq '[.[] | select(.state == "CLOSED")] | length' \ + "$OLDPWD/artifacts/guidance/$REPO_SLUG/raw/cve-prs.json" 2>/dev/null || echo 0) +BUGFIX_MERGED_COUNT=$(jq '[.[] | select(.state == "MERGED")] | length' \ + "$OLDPWD/artifacts/guidance/$REPO_SLUG/raw/bugfix-prs.json" 2>/dev/null || echo 0) +BUGFIX_CLOSED_COUNT=$(jq '[.[] | select(.state == "CLOSED")] | length' \ + "$OLDPWD/artifacts/guidance/$REPO_SLUG/raw/bugfix-prs.json" 2>/dev/null || echo 0) + +PR_BODY=$(cat <\` periodically to refresh with new PRs. + +--- +Generated by PR Guidance Generator workflow +EOF +) + +# Fork-aware push and PR creation +UPSTREAM_OWNER="${REPO%%/*}" +REPO_NAME="${REPO##*/}" +DEFAULT_BRANCH=$(gh repo view "$REPO" --json defaultBranchRef --jq '.defaultBranchRef.name') +GH_USER=$(gh api user --jq .login 2>/dev/null || \ + gh api /installation/repositories --jq '.repositories[0].owner.login' 2>/dev/null || \ + echo "") + +git config user.name "${GH_USER:-guidance-generator}" +git config user.email "${GH_USER:-guidance}@users.noreply.github.com" + +FORK_PUSH=false +FORK_OWNER="" + +# Attempt 1: direct push to upstream +if git push origin "$BRANCH_NAME" 2>/tmp/guidance-push-err.txt; then + echo " Pushed to upstream directly" +elif [ -n "$GH_USER" ]; then + # Attempt 2: find or create a fork + echo " Direct push failed — checking for fork of $REPO..." + FORK=$(gh repo list "$GH_USER" --fork --json nameWithOwner,parent \ + --jq ".[] | select(.parent.owner.login == \"$UPSTREAM_OWNER\" and .parent.name == \"$REPO_NAME\") | .nameWithOwner" \ + 2>/dev/null) + + if [ -z "$FORK" ]; then + echo " No fork found — creating fork..." + if gh repo fork "$REPO" --clone=false 2>/dev/null; then + sleep 3 # give GitHub time to provision the fork + FORK="$GH_USER/$REPO_NAME" + echo " Fork created: $FORK" + else + echo " ERROR: Could not create fork automatically." + echo " Create one manually at: https://github.com/$REPO/fork" + echo " Then re-run: /guidance.generate $REPO" + FAILED_REPOS+=("$REPO -> fork creation failed; create at https://github.com/$REPO/fork and re-run") + cd /; rm -rf "/tmp/guidance-gen/$REPO_SLUG"; continue + fi + else + echo " Found existing fork: $FORK" + fi + + FORK_OWNER="${FORK%%/*}" + git remote add fork "https://github.com/$FORK.git" 2>/dev/null || \ + git remote set-url fork "https://github.com/$FORK.git" + git push fork "$BRANCH_NAME" + FORK_PUSH=true +else + # No gh auth and direct push failed — provide manual fallback + echo " ERROR: Push failed and gh is not authenticated." + echo " Manual steps to submit this PR:" + echo " 1. Fork https://github.com/$REPO" + echo " 2. git -C /tmp/guidance-gen/$REPO_SLUG/repo remote add fork https://github.com/YOUR_USER/$REPO_NAME.git" + echo " 3. git -C /tmp/guidance-gen/$REPO_SLUG/repo push fork $BRANCH_NAME" + echo " 4. Open PR: https://github.com/$REPO/compare/$BRANCH_NAME" + FAILED_REPOS+=("$REPO -> push failed, no gh auth; see manual steps above") + cd /; rm -rf "/tmp/guidance-gen/$REPO_SLUG"; continue +fi + +# Create PR +if $FORK_PUSH; then + PR_URL=$(gh pr create \ + --repo "$REPO" \ + --base "$DEFAULT_BRANCH" \ + --head "$FORK_OWNER:$BRANCH_NAME" \ + --title "chore: add automated PR guidance files" \ + --body "$PR_BODY") +else + PR_URL=$(gh pr create \ + --repo "$REPO" \ + --base "$DEFAULT_BRANCH" \ + --title "chore: add automated PR guidance files" \ + --body "$PR_BODY") +fi +echo "PR created: $PR_URL" +``` + +### 8. Cleanup (per repo) + +```bash + cd / + rm -rf "/tmp/guidance-gen/$REPO_SLUG" + + # Collect result for final summary + if [ -n "${PR_URL:-}" ]; then + PR_RESULTS+=("$REPO -> $PR_URL") + else + FAILED_REPOS+=("$REPO -> PR creation failed (see output above)") + fi + +done # end of per-repo loop +``` + +### 9. Print Summary + +Print one entry per repo, then a totals line. + +``` +Done. Processed repo(s). + +org/repo1 + CVE: 12 rules | Bugfix: 9 rules + PR: https://github.com/org/repo1/pull/88 + +org/repo2 + CVE: skipped (0 merged CVE PRs) + Bugfix: 7 rules + PR: https://github.com/org/repo2/pull/41 + +org/repo3 — FAILED: cannot access repository + +--- +PRs created: | Failed: +``` + +## Output + +- `artifacts/guidance//raw/cve-prs.json` — raw compact PR data +- `artifacts/guidance//raw/bugfix-prs.json` +- `artifacts/guidance//analysis/cve-patterns.md` — intermediate patterns +- `artifacts/guidance//analysis/bugfix-patterns.md` +- `artifacts/guidance//output/cve-fix-guidance.md` — final CVE guidance +- `artifacts/guidance//output/bugfix-guidance.md` — final bugfix guidance +- Pull request in target repository + +## Success Criteria + +- [ ] All repos parsed from input (space and comma separated) +- [ ] gh auth validated once before the loop +- [ ] Each repo processed independently — one failure does not abort others +- [ ] Per-repo: both buckets filtered from PR metadata +- [ ] Per-repo: per-PR details fetched (files + review REQUEST_CHANGES) +- [ ] Per-repo: patterns synthesized with adaptive threshold +- [ ] Per-repo: guidance files written to artifacts/guidance//output/ +- [ ] Per-repo: PR created in target repo +- [ ] Per-repo: /tmp cleaned up after PR creation +- [ ] Final summary lists all repos with PR URLs and any failures + +## Notes + +### Limited Data +Never skip a guidance file just because a bucket has few merged PRs. +Only skip if the bucket has **0 merged PRs**. + +For small datasets, apply an adaptive threshold and add a warning to the file header: + +```markdown + +``` + +This gives the workflow something to work with while signalling to reviewers +that the file should be revisited once more PRs accumulate. + +Log: "CVE bucket has N merged PR(s) — generating with limited-data warning." + +### Repos with No Matching PRs +If neither bucket has data, the repo likely uses non-standard PR naming. +Report this and ask the user to provide example PR numbers or title patterns +so the filters can be adjusted. + +### GitHub API Rate Limits +`gh` uses authenticated calls (5000 req/hr). The per-PR detail fetch makes +2 API calls per PR (files+reviews, and comments for closed PRs). +At the default limit of 100 per bucket, worst case is ~400 API calls — well +within limits. If the user hits rate limits, reduce with `--limit 50`. + +### If .cve-fix/ or .bugfix/ Already Exist in Repo +If these directories already exist in the default branch, do not overwrite silently. +Warn the user: "Existing guidance files found in repo. Use /guidance.update instead, +or pass --force to overwrite." +Check with: +```bash +gh api repos/$REPO/contents/.cve-fix/examples.md > /dev/null 2>&1 && EXISTING_CVE=true +gh api repos/$REPO/contents/.bugfix/guidance.md > /dev/null 2>&1 && EXISTING_BUGFIX=true +``` diff --git a/workflows/guidance-generator/.claude/commands/guidance.update.md b/workflows/guidance-generator/.claude/commands/guidance.update.md new file mode 100644 index 00000000..5e0e10d0 --- /dev/null +++ b/workflows/guidance-generator/.claude/commands/guidance.update.md @@ -0,0 +1,858 @@ +# /guidance.update - Update Existing PR Guidance Files + +## Purpose +Fetch PRs created since the last analysis, extract new patterns, merge them +into existing guidance files, and open a PR in the repository with the updates. + +## Execution Style + +Be concise. Brief status per phase, full summary at end. + +Example: +``` +Reading existing guidance from org/repo... + .cve-fix/examples.md — last analyzed: 2026-01-15 + .bugfix/guidance.md — last analyzed: 2026-01-15 + +Fetching PRs since 2026-01-15... 23 new PRs + CVE bucket: 8 PRs (6 merged, 2 closed) + Bugfix bucket: 12 PRs (11 merged, 1 closed) + +Synthesizing new patterns... + CVE: 2 new rules, 3 evidence counts updated, 1 contradiction flagged + Bugfix: 1 new rule, 2 evidence counts updated + +Updating files and creating PR... https://github.com/org/repo/pull/103 +``` + +## Prerequisites + +- GitHub CLI (`gh`) installed and authenticated +- `jq` installed +- Guidance files must already exist in the repo (run `/guidance.generate` first) + +## Arguments + +``` +/guidance.update [ ...] [--cve-only] [--bugfix-only] +/guidance.update [,,...] [--cve-only] [--bugfix-only] +/guidance.update [ ...] --pr [ ...] +``` + +- `repo-url`: One or more repos — space-separated or comma-separated (or both). + Each repo is updated independently and gets its own PR. +- `--cve-only`: Only update `.cve-fix/examples.md` — skip bugfix guidance. +- `--bugfix-only`: Only update `.bugfix/guidance.md` — skip CVE guidance. +- `--pr `: PR URLs or numbers — space-separated, comma-separated, or mixed. + Full URLs are applied only to their matching repo. Plain numbers are applied to + all repos. The `last-analyzed` date is still updated to today in all files. + +## Process + +### 1. Parse Arguments and Validate + +```bash +# Validate gh auth once +gh auth status || { echo "ERROR: gh not authenticated. Run 'gh auth login'"; exit 1; } + +# Normalize repo args: replace commas with spaces, strip GitHub URL prefix, deduplicate +normalize_repo() { + local REF="$1" + if [[ "$REF" =~ github\.com/([a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+) ]]; then + echo "${BASH_REMATCH[1]}" + elif [[ "$REF" =~ ^[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+$ ]]; then + echo "$REF" + else + echo "WARNING: Cannot parse repo '$REF' — skipping" >&2 + echo "" + fi +} + +REPOS=() +for RAW in $(echo "$REPO_ARGS" | tr ',' ' '); do + NORMALIZED=$(normalize_repo "$RAW") + [ -n "$NORMALIZED" ] && REPOS+=("$NORMALIZED") +done + +REPOS=($(printf '%s\n' "${REPOS[@]}" | awk '!seen[$0]++')) + +if [ ${#REPOS[@]} -eq 0 ]; then + echo "ERROR: No valid repository references provided." + exit 1 +fi + +echo "Repos to process (${#REPOS[@]}):" +for R in "${REPOS[@]}"; do echo " - $R"; done + +# Parse --pr: full URLs map to their repo; plain numbers apply to all repos +declare -A REPO_SPECIFIC_PRS +GLOBAL_PR_NUMBERS="" + +if [ -n "$PR_REFS" ]; then + IFS=',' read -ra PR_LIST <<< "$(echo "$PR_REFS" | tr ' ' ',')" + for PR_REF in "${PR_LIST[@]}"; do + PR_REF=$(echo "$PR_REF" | tr -d ' ') + if [[ "$PR_REF" =~ github\.com/([a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+)/pull/([0-9]+) ]]; then + PR_REPO="${BASH_REMATCH[1]}" + PR_NUM="${BASH_REMATCH[2]}" + REPO_SPECIFIC_PRS["$PR_REPO"]="${REPO_SPECIFIC_PRS[$PR_REPO]:-} $PR_NUM" + elif [[ "$PR_REF" =~ ^[0-9]+$ ]]; then + GLOBAL_PR_NUMBERS="$GLOBAL_PR_NUMBERS $PR_REF" + else + echo "WARNING: Could not parse PR reference '$PR_REF' — skipping" + fi + done + GLOBAL_PR_NUMBERS=$(echo "$GLOBAL_PR_NUMBERS" | tr -s ' ' | sed 's/^ //') +fi + +# Parse scope flags (apply to all repos) +CVE_ONLY=false +BUGFIX_ONLY=false +[ "${CVE_ONLY_FLAG:-}" = "true" ] && CVE_ONLY=true +[ "${BUGFIX_ONLY_FLAG:-}" = "true" ] && BUGFIX_ONLY=true +if $CVE_ONLY && $BUGFIX_ONLY; then + echo "ERROR: --cve-only and --bugfix-only are mutually exclusive." + exit 1 +fi + +PR_RESULTS=() +FAILED_REPOS=() +``` + +--- +> **Steps 2–9 repeat for each repo in `${REPOS[@]}`.** + +```bash +for REPO in "${REPOS[@]}"; do + echo "" + echo "=== $REPO ===" + + if ! gh repo view "$REPO" --json name > /dev/null 2>&1; then + echo " ERROR: Cannot access $REPO — skipping" + FAILED_REPOS+=("$REPO -> cannot access repository") + continue + fi + + REPO_SLUG=$(echo "$REPO" | tr '/' '-') + + SPECIFIC_PR_NUMBERS="${REPO_SPECIFIC_PRS[$REPO]:-} $GLOBAL_PR_NUMBERS" + SPECIFIC_PR_NUMBERS=$(echo "$SPECIFIC_PR_NUMBERS" | tr -s ' ' | sed 's/^ //') + [ -n "$SPECIFIC_PR_NUMBERS" ] && echo " Manual PR mode: PR(s) $SPECIFIC_PR_NUMBERS" + + mkdir -p "artifacts/guidance/$REPO_SLUG/raw" + mkdir -p "artifacts/guidance/$REPO_SLUG/analysis" + mkdir -p "artifacts/guidance/$REPO_SLUG/output" + mkdir -p "/tmp/guidance-gen/$REPO_SLUG" +``` + +### 2. Read Existing Guidance Files from Repository + +Clone the repo and read the existing guidance files. Extract the +`last-analyzed` date from each file's header comment. + +```bash +CLONE_DIR="/tmp/guidance-gen/$REPO_SLUG/repo" +git clone "https://github.com/$REPO.git" "$CLONE_DIR" +cd "$CLONE_DIR" +gh auth setup-git 2>/dev/null || true + +CVE_FILE="$CLONE_DIR/.cve-fix/examples.md" +BUGFIX_FILE="$CLONE_DIR/.bugfix/guidance.md" + +FOUND_CVE=false +FOUND_BUGFIX=false +LAST_DATE="" + +$BUGFIX_ONLY && echo " --bugfix-only: skipping CVE guidance" +$CVE_ONLY && echo " --cve-only: skipping bugfix guidance" + +if [ -f "$CVE_FILE" ] && ! $BUGFIX_ONLY; then + FOUND_CVE=true + # Extract date from: + CVE_DATE=$(grep -m1 'last-analyzed:' "$CVE_FILE" | \ + grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2}' | head -1) + echo " .cve-fix/examples.md — last analyzed: ${CVE_DATE:-unknown}" + LAST_DATE="$CVE_DATE" +fi + +if [ -f "$BUGFIX_FILE" ] && ! $CVE_ONLY; then + FOUND_BUGFIX=true + BUGFIX_DATE=$(grep -m1 'last-analyzed:' "$BUGFIX_FILE" | \ + grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2}' | head -1) + echo " .bugfix/guidance.md — last analyzed: ${BUGFIX_DATE:-unknown}" + # Use earlier of the two dates to avoid missing PRs + if [ -n "$BUGFIX_DATE" ] && [ -n "$LAST_DATE" ]; then + LAST_DATE=$(echo -e "$LAST_DATE\n$BUGFIX_DATE" | sort | head -1) + elif [ -n "$BUGFIX_DATE" ]; then + LAST_DATE="$BUGFIX_DATE" + fi +fi +``` + +**If neither file exists**, stop and redirect: + +``` +Neither .cve-fix/examples.md nor .bugfix/guidance.md found in . +Run /guidance.generate to create them first. +``` + +**If `last-analyzed` date cannot be parsed**, warn the user and default to +fetching the last 90 days of PRs, then proceed. + +```bash +if [ -z "$LAST_DATE" ]; then + echo "WARNING: Could not parse last-analyzed date. Defaulting to last 90 days." + LAST_DATE=$(date -d "90 days ago" +%Y-%m-%d 2>/dev/null || \ + date -v-90d +%Y-%m-%d 2>/dev/null) +fi + +echo "Fetching PRs since $LAST_DATE..." +``` + +### 3. Fetch New PRs (Pass 1) + +**If `--pr` was specified**, skip the date-based bulk fetch and load only the given PRs: + +```bash +if [ -n "$SPECIFIC_PR_NUMBERS" ]; then + # Manual mode: fetch only specified PRs + echo "[]" > "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" + for NUMBER in $SPECIFIC_PR_NUMBERS; do + PR_META=$(gh pr view "$NUMBER" --repo "$REPO" \ + --json number,title,state,mergedAt,closedAt,labels,headRefName,latestReviews \ + 2>/dev/null) + if [ $? -ne 0 ] || [ -z "$PR_META" ]; then + echo "WARNING: Could not fetch PR #$NUMBER — skipping" + continue + fi + jq --argjson meta "$PR_META" '. + [$meta]' \ + "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json.tmp" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json.tmp" \ + "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" + done + NEW_TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json") + echo "Loaded $NEW_TOTAL specified PR(s)" +else + # Auto mode: fetch all PRs since last-analyzed date + gh pr list \ + --repo "$REPO" \ + --state all \ + --limit 200 \ + --search "created:>$LAST_DATE" \ + --json number,title,state,mergedAt,closedAt,labels,headRefName,latestReviews \ + > "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" + NEW_TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json") + echo "Fetched $NEW_TOTAL new PRs since $LAST_DATE" + if [ "$NEW_TOTAL" -eq 0 ]; then + echo "No new PRs found since $LAST_DATE. Guidance files are already up to date." + rm -rf "/tmp/guidance-gen/$REPO_SLUG" + exit 0 + fi +fi +``` + +### 4. Filter New PRs into Buckets + +In **auto mode**: CVE PRs take priority. In **manual mode (`--pr`)**: if a +specified PR matches neither pattern, include it in both buckets for Claude to classify. + +```bash +# Explicit CVE/security signals — pass through unconditionally +CVE_EXPLICIT='CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE' +# Dependency/version bump patterns — require body scan to confirm security relevance +CVE_DEP_PATTERN='^[Bb]ump |^deps\(|^build\(deps\)' +CVE_PATTERN="${CVE_EXPLICIT}|${CVE_DEP_PATTERN}" +CVE_BRANCH_PATTERN='^fix/cve-|^security/cve-|^dependabot/|^renovate/' +BUGFIX_PATTERN='^fix[:(]|^bugfix|^bug[[:space:]]fix|closes[[:space:]]#[0-9]+|fixes[[:space:]]#[0-9]+' +BUGFIX_BRANCH_PATTERN='^(bugfix|fix|bug)/' +SECURITY_BODY='CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|security|vulnerab|security.advisory' + +jq '[.[] | select( + (.title | test("'"$CVE_PATTERN"'"; "i")) or + (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i")) +)]' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" + +jq '[.[] | select( + ( + (.title | test("'"$BUGFIX_PATTERN"'"; "i")) or + (.headRefName | test("'"$BUGFIX_BRANCH_PATTERN"'"; "i")) + ) and + (.title | test("'"$CVE_PATTERN"'"; "i") | not) and + (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i") | not) +)]' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" + +# In manual mode: add unmatched PRs to both buckets +if [ -n "$SPECIFIC_PR_NUMBERS" ]; then + UNMATCHED=$(jq '[.[] | select( + ((.title | test("'"$CVE_PATTERN"'"; "i")) or (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i")) | not) and + ((.title | test("'"$BUGFIX_PATTERN"'"; "i")) or (.headRefName | test("'"$BUGFIX_BRANCH_PATTERN"'"; "i")) | not) + )]' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json") + UNMATCHED_COUNT=$(echo "$UNMATCHED" | jq 'length') + if [ "$UNMATCHED_COUNT" -gt 0 ]; then + UNMATCHED_NUMS=$(echo "$UNMATCHED" | jq -r '.[].number' | tr '\n' ',' | sed 's/,$//') + echo " NOTE: PR(s) #$UNMATCHED_NUMS did not match CVE or bugfix patterns — included in both buckets" + for META_FILE in "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json"; do + jq --argjson extra "$UNMATCHED" '. + $extra' "$META_FILE" > "${META_FILE}.tmp" \ + && mv "${META_FILE}.tmp" "$META_FILE" + done + fi +fi + +# Body scan: for dep-pattern matches without an explicit CVE/GHSA title, +# verify the PR body contains a security indicator before keeping it. +# Only runs in auto mode — manual --pr mode trusts the user's selection. +if [ -z "$SPECIFIC_PR_NUMBERS" ]; then + DEP_ONLY_NUMS=$(jq -r '[.[] | select( + (.title | test("'"$CVE_DEP_PATTERN"'"; "i")) and + (.title | test("'"$CVE_EXPLICIT"'"; "i") | not) + ) | .number] | .[]' "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json") + + for PR_NUM in $DEP_ONLY_NUMS; do + BODY=$(gh pr view "$PR_NUM" --repo "$REPO" --json body \ + --jq '.body // ""' 2>/dev/null | sanitize_str) + if ! echo "$BODY" | grep -qiE "$SECURITY_BODY"; then + echo " Dropped PR #$PR_NUM from CVE bucket — dep update with no security signal in body" + jq --argjson n "$PR_NUM" '[.[] | select(.number != $n)]' \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json.tmp" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json.tmp" \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" + fi + done +fi + +# Zero out skipped buckets so subsequent steps treat them as empty +$BUGFIX_ONLY && echo "[]" > "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" +$CVE_ONLY && echo "[]" > "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" + +NEW_CVE=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json") +NEW_BUGFIX=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json") +echo " CVE bucket: $NEW_CVE new PRs" +echo " Bugfix bucket: $NEW_BUGFIX new PRs" +``` + +### 4.5. Fetch Commit Fallback + +For any bucket with fewer than 3 new merged PRs since the last-analyzed date, +scan recent commits as supplementary signal. Skip if `--pr` was specified. + +```bash +fetch_commit_fallback() { + local LABEL="$1" + local META_FILE="$2" + local OUT_FILE="$3" + local MSG_PATTERN="$4" + + echo "[]" > "$OUT_FILE" + + [ -n "$SPECIFIC_PR_NUMBERS" ] && return + + local MERGED_COUNT + MERGED_COUNT=$(jq '[.[] | select(.state == "MERGED")] | length' "$META_FILE") + + if [ "$MERGED_COUNT" -ge 3 ]; then + return + fi + + echo " $LABEL bucket: $MERGED_COUNT new merged PRs — scanning commits as fallback..." + + gh api "repos/$REPO/commits?per_page=100" \ + --jq '.[] | {sha: .sha, message: .commit.message}' \ + > "/tmp/guidance-gen/$REPO_SLUG/${LABEL}-commits-raw.jsonl" 2>/dev/null + + local SAMPLED=0 + local MAX_COMMITS=50 + + while IFS= read -r LINE && [ "$SAMPLED" -lt "$MAX_COMMITS" ]; do + local SHA MSG_RAW TITLE + + SHA=$(echo "$LINE" | jq -r '.sha') + MSG_RAW=$(echo "$LINE" | jq -r '.message' | sanitize_str) + TITLE=$(echo "$MSG_RAW" | head -1) + + echo "$TITLE" | grep -qiE "$MSG_PATTERN" || continue + + # For dep/bump commits without explicit CVE/GHSA in title, verify body has security signal. + # MSG_RAW already contains the full message — no extra API call needed. + if echo "$TITLE" | grep -qiE "^[Bb]ump |^deps\(|^build\(deps\)"; then + if ! echo "$TITLE" | grep -qiE "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):"; then + if ! echo "$MSG_RAW" | grep -qiE "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|security|vulnerab"; then + continue # dep update with no security signal — skip + fi + fi + fi + + local FILES + FILES=$(gh api "repos/$REPO/commits/$SHA" \ + --jq '[.files[].filename]' 2>/dev/null || echo "[]") + + local BODY + BODY=$(echo "$MSG_RAW" | tail -n +2 | tr '\n' ' ' | cut -c1-300) + + local RECORD + RECORD=$(jq -n \ + --arg sha "$SHA" \ + --arg title "$TITLE" \ + --arg body "$BODY" \ + --argjson files "$FILES" \ + '{source: "commit", sha: $sha, state: "MERGED", + title: $title, branch: "", labels: [], + files: $files, changes_requested: [], close_reason: null, + commit_body: $body}' 2>/tmp/guidance-jq-err.txt) + + if [ $? -ne 0 ]; then + echo " WARNING: commit $SHA skipped — $(cat /tmp/guidance-jq-err.txt)" + continue + fi + + jq --argjson rec "$RECORD" '. + [$rec]' "$OUT_FILE" > "${OUT_FILE}.tmp" \ + && mv "${OUT_FILE}.tmp" "$OUT_FILE" + SAMPLED=$((SAMPLED + 1)) + + done < "/tmp/guidance-gen/$REPO_SLUG/${LABEL}-commits-raw.jsonl" + + local COMMIT_COUNT + COMMIT_COUNT=$(jq 'length' "$OUT_FILE") + echo " Found $COMMIT_COUNT matching $LABEL commits" + cp "$OUT_FILE" "artifacts/guidance/$REPO_SLUG/raw/${LABEL}-commits.json" +} + +fetch_commit_fallback "cve" \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-commits.json" \ + "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE|^[Bb]ump |^deps\(|^build\(deps\)" + +fetch_commit_fallback "bugfix" \ + "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-commits.json" \ + "^fix[:(]|^bugfix|^bug fix|fixes[[:space:]]#[0-9]+|closes[[:space:]]#[0-9]+" +``` + +### 5. Fetch Per-PR Details (Pass 2) + +Same as `/guidance.generate` — files + reviews per PR, closing context for closed PRs. + +```bash +# Strip control characters from a string (keeps printable ASCII + tab + newline) +sanitize_str() { + tr -cd '[:print:]\t\n' +} + +fetch_pr_details() { + local META_FILE="$1" + local OUT_FILE="$2" + local COUNT=$(jq 'length' "$META_FILE") + local FAILED=0 + + echo "[]" > "$OUT_FILE" + + for i in $(seq 0 $((COUNT - 1))); do + NUMBER=$(jq -r ".[$i].number" "$META_FILE") + STATE=$(jq -r ".[$i].state" "$META_FILE") + # Sanitize string fields at extraction time to strip control characters + TITLE=$(jq -r ".[$i].title" "$META_FILE" | sanitize_str) + BRANCH=$(jq -r ".[$i].headRefName" "$META_FILE" | sanitize_str) + LABELS=$(jq -c "[.[$i].labels[].name]" "$META_FILE") + + PR_DETAIL=$(gh pr view "$NUMBER" --repo "$REPO" \ + --json files,reviews 2>/dev/null) + + FILES=$(echo "$PR_DETAIL" | jq -c '[.files[].path]') + + # Extract REQUEST_CHANGES review bodies — sanitize inside jq before truncating + CHANGES_REQ=$(echo "$PR_DETAIL" | jq -c '[ + .reviews[] | + select(.state == "CHANGES_REQUESTED") | + .body | + gsub("[\\u0000-\\u0008\\u000b-\\u001f\\u007f]"; "") | + gsub("\\n|\\r"; " ") | + .[0:200] + ]') + + # For closed PRs: get last 2 comments, sanitize inside jq + CLOSE_REASON="null" + if [ "$STATE" = "CLOSED" ]; then + CLOSE_REASON=$(gh pr view "$NUMBER" --repo "$REPO" \ + --json comments \ + --jq '.comments | .[-2:] | map( + .body | + gsub("[\\u0000-\\u0008\\u000b-\\u001f\\u007f]"; "") | + gsub("\\n|\\r"; " ") | + .[0:200] + ) | join(" | ")' \ + 2>/dev/null | jq -Rs '.') + fi + + # Build compact record — capture jq errors per PR instead of silently dropping + RECORD=$(jq -n \ + --argjson number "$NUMBER" \ + --arg state "$STATE" \ + --arg title "$TITLE" \ + --arg branch "$BRANCH" \ + --argjson labels "$LABELS" \ + --argjson files "$FILES" \ + --argjson changes_requested "$CHANGES_REQ" \ + --argjson close_reason "$CLOSE_REASON" \ + '{number: $number, state: $state, title: $title, branch: $branch, + labels: $labels, files: $files, + changes_requested: $changes_requested, close_reason: $close_reason}' \ + 2>/tmp/guidance-jq-err.txt) + + if [ $? -ne 0 ]; then + echo " WARNING: PR #$NUMBER skipped — jq error: $(cat /tmp/guidance-jq-err.txt)" + FAILED=$((FAILED + 1)) + continue + fi + + jq --argjson rec "$RECORD" '. + [$rec]' "$OUT_FILE" > "${OUT_FILE}.tmp" \ + && mv "${OUT_FILE}.tmp" "$OUT_FILE" + done + + if [ "$FAILED" -gt 0 ]; then + echo " WARNING: $FAILED PR(s) skipped due to unparseable content. Check raw data in artifacts." + fi +} + +fetch_pr_details \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-details.json" + +fetch_pr_details \ + "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details.json" + +# Merge commit fallback records into the detail files +jq -s '.[0] + .[1]' \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-details.json" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-commits.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/new-cve-details-merged.json" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/new-cve-details-merged.json" \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-details.json" + +jq -s '.[0] + .[1]' \ + "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details.json" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-commits.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details-merged.json" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details-merged.json" \ + "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details.json" + +cp "/tmp/guidance-gen/$REPO_SLUG/new-cve-details.json" \ + "artifacts/guidance/$REPO_SLUG/raw/new-cve-prs.json" +cp "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details.json" \ + "artifacts/guidance/$REPO_SLUG/raw/new-bugfix-prs.json" +``` + +### 6. Synthesize New Patterns + +Read both the new PR detail files AND the existing guidance files. + +As the agent, analyze the new detail records. Records with `source: "commit"` +came from the commit fallback — treat them differently from PR records: + +**Thresholds for new rules:** + +| Source | Min for a new rule | +|--------|--------------------| +| Merged PRs | 3 (or 2 if bucket had <10, or 1 if <3) | +| Commits only | 5 | +| Mixed (PRs + commits) | 3 total, at least 1 PR | + +**What commit records contribute:** +- Message/title format patterns +- File co-change patterns +- Commit body trailer conventions (`Co-authored-by:`, `Fixes #`, etc.) + +**What commit records do NOT contribute:** +- Don'ts section (no rejection signal) +- Reviewer expectation rules (no `changes_requested` data) + +**Evidence notation for new rules:** +- PR only: `(3/4 merged PRs)` +- Commit only: `(6 commits)` +- Mixed: `(2/3 merged PRs + 4 commits)` + +For each pattern found in the combined data: + +**A. New rule** — meets the threshold above and does not already exist in the +guidance file. Add it to the appropriate section. + +**B. Reinforced rule** — already exists in the guidance file. +Update evidence count: `(8/9 merged)` → `(14/15 merged)` or add commit count: +`(8/9 merged)` → `(8/9 merged PRs + 5 commits)`. + +**C. Contradicting rule** — a pattern in new merged PRs that directly contradicts +a "don't". Flag it: +``` +- [REVIEW NEEDED] Multiple CVEs per PR — previously flagged as a don't, + but PR #N was merged combining CVEs. Policy may have changed. (N/N new merged) +``` + +**D. New don't** — pattern from newly closed PRs (3+ cases). Commits cannot +produce new don'ts. Add only PR-sourced rejections here. + +Write findings to: +- `artifacts/guidance//analysis/cve-update-patterns.md` +- `artifacts/guidance//analysis/bugfix-update-patterns.md` + +Format: same structured list as in `/guidance.generate` step 5. + +### 7. Merge Patterns into Existing Guidance Files + +Read the cloned guidance files and apply the changes from step 6. + +**Editing rules:** +- Update evidence counts in-place: find the line, update the `(N/M ...)` count +- Append new rules to the bottom of the appropriate section +- Append new don'ts to the Don'ts section +- Add `[REVIEW NEEDED]` lines at the bottom of the relevant section for contradictions +- Update the `last-analyzed` date in the header comment +- Update the merged/closed counts in the header comment +- Do NOT reorder existing rules — preserve the file structure + +After editing, count the lines in each file. Never drop existing rules to +make room — always append new rules in full. If the file now exceeds 80 lines, +note it but do not truncate: + +```bash +CVE_LINES=$(wc -l < "$CVE_FILE") +BUGFIX_LINES=$(wc -l < "$BUGFIX_FILE") + +OVERSIZE_NOTE="" +if [ "$CVE_LINES" -gt 80 ]; then + echo " NOTE: .cve-fix/examples.md is now ${CVE_LINES} lines (target: 80)" + OVERSIZE_NOTE="${OVERSIZE_NOTE}\n- \`.cve-fix/examples.md\` is ${CVE_LINES} lines. Consider running \`/guidance.generate\` to rebuild and consolidate." +fi +if [ "$BUGFIX_LINES" -gt 80 ]; then + echo " NOTE: .bugfix/guidance.md is now ${BUGFIX_LINES} lines (target: 80)" + OVERSIZE_NOTE="${OVERSIZE_NOTE}\n- \`.bugfix/guidance.md\` is ${BUGFIX_LINES} lines. Consider running \`/guidance.generate\` to rebuild and consolidate." +fi +``` + +Include `$OVERSIZE_NOTE` in the PR description if non-empty so the reviewer +knows the file has grown and can decide whether to trigger a full rebuild. + +**Update the header:** +``` + +``` + +Copy the updated files to artifacts output: +```bash +cp "$CVE_FILE" "artifacts/guidance/$REPO_SLUG/output/cve-fix-guidance.md" +cp "$BUGFIX_FILE" "artifacts/guidance/$REPO_SLUG/output/bugfix-guidance.md" +``` + +### 8. Create Pull Request with Updates + +```bash +TODAY=$(date +%Y-%m-%d) +BRANCH_NAME="chore/update-pr-guidance-$TODAY" + +cd "$CLONE_DIR" +git checkout -b "$BRANCH_NAME" + +# Files are already updated in-place in the clone from step 7 +git add .cve-fix .bugfix +git commit -m "chore: update PR guidance files ($TODAY) + +Refreshed guidance based on PRs merged/closed since last analysis. + +Changes: +- Updated evidence counts for existing rules +- Added new rules (if any new patterns emerged) +- Updated last-analyzed date to $TODAY + +Co-Authored-By: PR Guidance Generator " + +# Build PR body +PR_BODY=$(cat </dev/null || \ + gh api /installation/repositories --jq '.repositories[0].owner.login' 2>/dev/null || \ + echo "") + +FORK_PUSH=false +FORK_OWNER="" + +# Attempt 1: direct push to upstream +if git push origin "$BRANCH_NAME" 2>/tmp/guidance-push-err.txt; then + echo " Pushed to upstream directly" +elif [ -n "$GH_USER" ]; then + # Attempt 2: find or create a fork + echo " Direct push failed — checking for fork of $REPO..." + FORK=$(gh repo list "$GH_USER" --fork --json nameWithOwner,parent \ + --jq ".[] | select(.parent.owner.login == \"$UPSTREAM_OWNER\" and .parent.name == \"$REPO_NAME\") | .nameWithOwner" \ + 2>/dev/null) + + if [ -z "$FORK" ]; then + echo " No fork found — creating fork..." + if gh repo fork "$REPO" --clone=false 2>/dev/null; then + sleep 3 + FORK="$GH_USER/$REPO_NAME" + echo " Fork created: $FORK" + else + echo " ERROR: Could not create fork automatically." + echo " Create one manually at: https://github.com/$REPO/fork" + echo " Then re-run: /guidance.update $REPO" + FAILED_REPOS+=("$REPO -> fork creation failed; create at https://github.com/$REPO/fork and re-run") + cd /; rm -rf "/tmp/guidance-gen/$REPO_SLUG"; continue + fi + else + echo " Found existing fork: $FORK" + fi + + FORK_OWNER="${FORK%%/*}" + git remote add fork "https://github.com/$FORK.git" 2>/dev/null || \ + git remote set-url fork "https://github.com/$FORK.git" + git push fork "$BRANCH_NAME" + FORK_PUSH=true +else + echo " ERROR: Push failed and gh is not authenticated." + echo " Manual steps to submit this PR:" + echo " 1. Fork https://github.com/$REPO" + echo " 2. git -C $CLONE_DIR remote add fork https://github.com/YOUR_USER/$REPO_NAME.git" + echo " 3. git -C $CLONE_DIR push fork $BRANCH_NAME" + echo " 4. Open PR: https://github.com/$REPO/compare/$BRANCH_NAME" + FAILED_REPOS+=("$REPO -> push failed, no gh auth; see manual steps above") + cd /; rm -rf "/tmp/guidance-gen/$REPO_SLUG"; continue +fi + +# Create PR +if $FORK_PUSH; then + PR_URL=$(gh pr create \ + --repo "$REPO" \ + --base "$DEFAULT_BRANCH" \ + --head "$FORK_OWNER:$BRANCH_NAME" \ + --title "chore: update PR guidance files ($TODAY)" \ + --body "$PR_BODY") +else + PR_URL=$(gh pr create \ + --repo "$REPO" \ + --base "$DEFAULT_BRANCH" \ + --title "chore: update PR guidance files ($TODAY)" \ + --body "$PR_BODY") +fi +echo "PR created: $PR_URL" +``` + +### 9. Cleanup (per repo) + +```bash + cd / + rm -rf "/tmp/guidance-gen/$REPO_SLUG" + + if [ -n "${PR_URL:-}" ]; then + PR_RESULTS+=("$REPO -> $PR_URL") + else + FAILED_REPOS+=("$REPO -> PR creation failed (see output above)") + fi + +done # end of per-repo loop +``` + +### 10. Print Summary + +``` +Done. Processed repo(s). + +org/repo1 + New PRs analyzed: 8 CVE, 12 bugfix (since 2026-01-15) + Changes: 2 new rules, 3 counts updated, 1 contradiction flagged + PR: https://github.com/org/repo1/pull/103 + +org/repo2 + No guidance files found — run /guidance.generate first + SKIPPED + +org/repo3 — FAILED: cannot access repository + +--- +PRs created: | Skipped: | Failed: +``` + +## Output + +- `artifacts/guidance//raw/new-cve-prs.json` +- `artifacts/guidance//raw/new-bugfix-prs.json` +- `artifacts/guidance//analysis/cve-update-patterns.md` +- `artifacts/guidance//analysis/bugfix-update-patterns.md` +- `artifacts/guidance//output/cve-fix-guidance.md` (updated) +- `artifacts/guidance//output/bugfix-guidance.md` (updated) +- Pull request in target repository + +## Success Criteria + +- [ ] All repos parsed from input (space and comma separated) +- [ ] gh auth validated once before the loop +- [ ] Each repo processed independently — one failure does not abort others +- [ ] Per-repo: existing guidance files found and last-analyzed date extracted +- [ ] Per-repo: new PRs fetched (date-based or --pr specific) +- [ ] Per-repo: new patterns synthesized (new rules, updated counts, contradictions flagged) +- [ ] Per-repo: files updated in-place, no existing rules dropped +- [ ] Per-repo: files exceeding 80 lines flagged in PR description +- [ ] Per-repo: header timestamps updated +- [ ] Per-repo: PR created in target repo +- [ ] Per-repo: /tmp cleaned up +- [ ] Final summary lists all repos with PR URLs, skips, and failures + +## Notes + +### No New PRs Found +If 0 new PRs since the last-analyzed date, report this and exit cleanly. +Do not create a PR with no changes. + +### Only One File Exists +If only `.cve-fix/examples.md` exists (no `.bugfix/guidance.md`), update only +the CVE file. Log that bugfix guidance was skipped. + +### Contradictions Require Human Review +Do not automatically remove a "don't" rule just because a new merged PR +contradicts it. Flag it with `[REVIEW NEEDED]` and let the repo owner decide +if the convention changed. The PR reviewer will see the flag and can edit +the file before merging. + +### Date Parsing Cross-Platform +`date -d` (Linux) and `date -v` (macOS) differ. Use both with fallback: +```bash +LAST_DATE=$(date -d "90 days ago" +%Y-%m-%d 2>/dev/null || \ + date -v-90d +%Y-%m-%d 2>/dev/null || \ + echo "2000-01-01") +``` diff --git a/workflows/guidance-generator/.claude/settings.json b/workflows/guidance-generator/.claude/settings.json new file mode 100644 index 00000000..ff4140e4 --- /dev/null +++ b/workflows/guidance-generator/.claude/settings.json @@ -0,0 +1,13 @@ +{ + "permissions": { + "allow": [ + "Bash", + "Read", + "Write", + "Edit" + ], + "deny": [ + "Bash(rm -rf /)" + ] + } +} diff --git a/workflows/guidance-generator/.gitignore b/workflows/guidance-generator/.gitignore new file mode 100644 index 00000000..bc94e122 --- /dev/null +++ b/workflows/guidance-generator/.gitignore @@ -0,0 +1,2 @@ +# PR Guidance Generator artifacts - generated output, not tracked in repo +artifacts/ diff --git a/workflows/guidance-generator/README.md b/workflows/guidance-generator/README.md new file mode 100644 index 00000000..82ce7209 --- /dev/null +++ b/workflows/guidance-generator/README.md @@ -0,0 +1,181 @@ +# PR Guidance Generator + +Analyzes a GitHub repository's fix PR history to generate compact guidance files +that teach automated workflows — CVE Fixer and Bugfix — how to create pull requests +that match that repo's conventions. Opens a PR in the target repo with the generated files. + +## Problem It Solves + +Automated fix workflows (CVE Fixer, Bugfix) create PRs without knowing a repo's +specific conventions: how titles should read, which files always change together, +what reviewers will ask for, what gets PRs closed. This leads to PRs that get +closed or require many review cycles. + +This workflow learns those conventions directly from the repo's PR history and +encodes them into guidance files that automated workflows read before making changes. + +## How It Works + +1. Fetches PR metadata from the target repo (titles, branches, labels) +2. Filters into CVE and bugfix buckets based on title/branch patterns +3. Fetches targeted details per PR: files changed + review REQUEST_CHANGES comments +4. For closed PRs: fetches the closing context to extract "don'ts" +5. Synthesizes rules using an adaptive threshold based on available data +6. Generates compact guidance files (80-line cap, one rule per line) +7. Opens a PR in the target repo adding the files + +## Commands + +### `/guidance.generate ` + +Full pipeline for a fresh repo. Analyzes all recent fix PRs automatically, +or analyze specific PRs of your choice with `--pr`. + +``` +/guidance.generate org/repo1 org/repo2 org/repo3 +/guidance.generate org/repo1,org/repo2,org/repo3 +/guidance.generate org/repo1 org/repo2 --cve-only +/guidance.generate org/repo1,org/repo2 --pr 42,https://github.com/org/repo2/pull/87 +``` + +Each repo is processed independently and gets its own PR. One repo failing does +not stop the others. A summary of all PR URLs is printed at the end. + +Flags: +- `--cve-only` / `--bugfix-only`: generate only one of the two guidance files (all repos) +- `--limit N`: cap PRs fetched per bucket per repo (default: 100) +- `--pr `: space-separated, comma-separated, or mixed PR URLs or numbers — + skips bulk fetch. Full URLs apply only to their matching repo; plain numbers + apply to all repos. + +Generates: +- `.cve-fix/examples.md` — read by the CVE Fixer workflow (step 4.5) +- `.bugfix/guidance.md` — read by the Bugfix workflow + +### `/guidance.update ` + +Refreshes existing guidance with PRs merged/closed since the last analysis. +Reads the `last-analyzed` date from existing files, fetches only newer PRs, +merges new patterns, and opens a PR with the updates. + +``` +/guidance.update org/repo1 org/repo2 +/guidance.update org/repo1,org/repo2 +/guidance.update org/repo1 org/repo2 --cve-only +/guidance.update org/repo1 org/repo2 --pr 103 https://github.com/org/repo2/pull/104 +``` + +Each repo is updated independently and gets its own PR. + +Flags: +- `--cve-only`: only update `.cve-fix/examples.md`, skip bugfix guidance. +- `--bugfix-only`: only update `.bugfix/guidance.md`, skip CVE guidance. +- `--pr `: space-separated, comma-separated, or mixed PR URLs or numbers. + Merges only the specified PRs instead of fetching all PRs since the last-analyzed + date. Full URLs apply to their matching repo; plain numbers apply to all repos. + The `last-analyzed` date is still updated to today. + +## Generated File Format + +Files are intentionally compact. Example `.cve-fix/examples.md`: + +```markdown +# CVE Fix Guidance — org/repo + + +## Titles +`Security: Fix CVE-YYYY-XXXXX ()` (47/47) + +## Branches +`fix/cve-YYYY-XXXXX--attempt-N` (47/47) + +## Files — Go stdlib CVEs +Always update go.mod + Dockerfile + Dockerfile.konflux together (8/8) +Run go mod tidy — missing go.sum was flagged in 3 closed PRs + +## Files — Node.js CVEs +Use overrides in package.json, not direct npm update (5/5) + +## Co-upgrades +fastapi must be co-upgraded with starlette (2 closed PRs lacked this) + +## PR Description +Required sections (missing caused REQUEST_CHANGES in 6 PRs): +- CVE Details, Test Results, Breaking Changes, Jira refs (plain text IDs only) + +## Don'ts +- One CVE per PR — combined PRs were closed (4 cases) +- Don't target release branches — target main (3 cases) +``` + +## Rule Threshold + +Rules use an adaptive threshold based on how much data is available in each bucket: + +| Merged PRs in bucket | Min PRs per rule | +|----------------------|-----------------| +| 10+ | 3 | +| 3–9 | 2 | +| 1–2 | 1 + `WARNING: limited data` in header | +| 0 | File skipped entirely | + +This means the workflow always produces something useful, even for repos with +few fix PRs — while flagging low-confidence output clearly. + +## Line Count Behaviour + +The 80-line target applies differently depending on the command: + +**`/guidance.generate`** — treats 80 lines as a formatting target for new files. +All rules that meet the evidence threshold are included regardless. If the natural +output exceeds 80 lines, all rules are kept and the line count is noted in the PR. + +**`/guidance.update`** — never drops existing rules to stay under 80 lines. +New rules are always appended in full. If the file grows past 80 lines, the PR +description flags it with a suggestion to run `/guidance.generate` to rebuild +and consolidate the guidance from scratch with the full updated history. + +## Token Efficiency + +The workflow uses a two-pass fetch strategy to minimize API calls and context size: + +- **Pass 1**: Lightweight metadata for all PRs (title, branch, labels, state). + In `--pr` mode this pass is skipped — only the specified PRs are fetched. +- **Pass 2**: Per-PR detail only for PRs in the CVE/bugfix buckets (files + reviews) +- **Closed PRs only**: Fetch closing context (last 2 comments) + +This avoids fetching full PR bodies and review threads for irrelevant PRs, +keeping the analysis input compact (structured JSON, ~200 tokens/PR). + +## How Automated Workflows Use the Files + +**CVE Fixer** (`/cve.fix`): In step 4.5, after cloning repos and before making +any fixes, the workflow reads all files in `.cve-fix/` and builds a knowledge base +from them. The guidance from `examples.md` applies to every subsequent decision — +PR title format, branch naming, which files to update, co-upgrade requirements, +Jira reference format, and known pitfalls. + +**Bugfix workflow**: Reads `.bugfix/guidance.md` before implementing fixes. + +## Prerequisites + +- GitHub CLI (`gh`) installed and authenticated (`gh auth login`) +- `jq` installed +- Write access to the target repository (to open a PR) + +## Artifacts + +All artifacts are saved to `artifacts/guidance//`: + +``` +artifacts/guidance// +├── raw/ +│ ├── cve-prs.json # Compact per-PR records for CVE bucket +│ └── bugfix-prs.json # Compact per-PR records for bugfix bucket +├── analysis/ +│ ├── cve-patterns.md # Intermediate pattern extraction +│ └── bugfix-patterns.md +└── output/ + ├── cve-fix-guidance.md # Final file (placed at .cve-fix/examples.md) + └── bugfix-guidance.md # Final file (placed at .bugfix/guidance.md) +``` diff --git a/workflows/rhoai-manager/.ambient/ambient.json b/workflows/rhoai-manager/.ambient/ambient.json new file mode 100644 index 00000000..c7e85d3b --- /dev/null +++ b/workflows/rhoai-manager/.ambient/ambient.json @@ -0,0 +1,12 @@ +{ + "name": "RHOAI Manager", + "description": "Comprehensive workflow for managing Red Hat OpenShift AI (RHOAI) and Open Data Hub (ODH) lifecycle: installation, updates, version detection, and uninstallation.", + "systemPrompt": "You are an AI assistant specialized in managing the complete lifecycle of RHOAI (Red Hat OpenShift AI) and ODH (Open Data Hub) installations.\n\n# Your Role\n\nYou help automate the process of:\n1. Logging into OpenShift clusters\n2. Installing RHOAI or ODH from scratch\n3. Detecting RHOAI version and build information\n4. Updating RHOAI or ODH to latest nightly builds\n5. Uninstalling RHOAI or ODH completely\n6. Switching between RHOAI and ODH safely\n7. Installing or updating RHOAI on disconnected (air-gapped) clusters\n\n# Important: RHOAI and ODH Cannot Coexist\n\nRHOAI and ODH share cluster-scoped CRDs (DataScienceCluster, DSCInitialization) and overlapping operators. They CANNOT be installed on the same cluster at the same time.\n\n- To switch from RHOAI to ODH: run /rhoai-uninstall first, then /odh-install\n- To switch from ODH to RHOAI: run /odh-uninstall first, then /rhoai-install\n- Both /rhoai-install and /odh-install detect the other and block with a clear error message\n\n# Available Commands\n\n## /oc-login\nLogin to OpenShift cluster using credentials from Ambient session:\n- Checks for required credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Automatically installs oc CLI if not available\n- Executes login to the cluster\n- Verifies connection and displays cluster info\n\n## /rhoai-install\nInstall RHOAI from scratch on a cluster:\n- Detects and blocks if ODH is installed (directs to /odh-uninstall first)\n- Sets up OLM catalog source for nightly or GA builds\n- Creates operator namespace and subscription\n- Waits for operator installation to complete\n- Creates DataScienceCluster with component configuration\n- Verifies all components are healthy\n\n## /rhoai-version\nDetect RHOAI version and build information:\n- Checks RHOAI operator subscription and CSV\n- Reports DataScienceCluster status and components\n- Lists all component images with SHA digests\n\n## /rhoai-update\nUpdates RHOAI to the latest nightly build:\n- Verifies current version and preserves channel\n- Updates the OLM catalog source\n- Handles forced reinstall when component images update without CSV version change\n- Verifies component reconciliation\n\n## /rhoai-uninstall\nCompletely uninstall RHOAI from an OpenShift cluster:\n- Supports graceful or forceful uninstall\n- Options to keep CRDs and/or user resources\n- Removes operator, custom resources, webhooks, namespaces\n\n## /odh-install\nInstall Open Data Hub (ODH) nightly builds on a cluster:\n- Detects and blocks if RHOAI is installed (directs to /rhoai-uninstall first)\n- Creates CatalogSource using odh-stable-nightly floating tag\n- Creates Subscription in openshift-operators (uses existing global OperatorGroup)\n- Creates DSCInitialization and DataScienceCluster\n- Default catalog: quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly\n- Default channel: fast\n\n## /odh-update\nUpdate ODH to the latest nightly build:\n- Updates CatalogSource, forces catalog pod refresh\n- OLM auto-upgrades when CSV version changes (typical for ODH nightlies)\n- Falls back to forced reinstall if only component images changed\n\n## /odh-uninstall\nCompletely uninstall ODH from an OpenShift cluster:\n- Removes DataScienceCluster, DSCInitialization, subscription, CSV, CatalogSource\n- Options: keep-crds, keep-all\n- Use default (no flags) when switching to RHOAI\n\n## /mirror-images\nMirror all images (RHOAI operator, components, and infrastructure) from a connected cluster to disconnected bastion registries:\n- Extracts RHOAI CSV relatedImages, all running pod images, and catalog images\n- Includes infrastructure: minio, mariadb, postgres, keycloak, vLLM, milvus, service mesh, cert-manager, kuadrant\n- No images excluded by default - mirrors everything needed for a complete disconnected setup\n- Builds combined pull secret with source registry and bastion credentials\n- Deploys a mirror pod on the connected cluster for fast AWS-internal transfers\n- Mirrors to one or more bastion registries with retries and verification\n- Uses --keep-manifest-list=true --filter-by-os=\".*\" to preserve manifest list digests\n- Generates IDMS YAML for the disconnected cluster\n\n## /rhoai-verify\nPost-install/update verification tests for RHOAI:\n- Checks operator CSV phase and subscription health\n- Verifies DataScienceCluster phase and all component conditions\n- Scans all RHOAI namespace pods for ImagePullBackOff, CrashLoopBackOff, or not-ready containers\n- Tests dashboard deployment, route, and HTTP response\n- Verifies pipeline operator, notebook controllers, KServe, ModelMesh, model registry, TrustyAI\n- Checks EvalHub namespace if present\n- Validates dependent operators (service mesh, serverless, pipelines, cert-manager)\n- Auto-detects disconnected clusters and runs IDMS + cluster-wide ImagePullBackOff checks\n- Reports PASS/FAIL/WARN with troubleshooting guidance\n\n## /rhoai-disconnected\nInstall or update RHOAI on a disconnected (air-gapped) OpenShift cluster:\n- Takes FBC (File-Based Catalog) image as required input (digest-pinned)\n- Auto-detects install vs update mode from cluster state\n- Auto-detects bastion registry from IDMS entries\n- Pre-flight verification: checks ALL relatedImages exist on bastion before proceeding\n- Verifies IDMS entries for all required source registries\n- Creates/updates OLM CatalogSource, Subscription, and DataScienceCluster\n- Forced reinstall for updates (handles CSV version unchanged case)\n- Post-install health check: detects ImagePullBackOff and CrashLoopBackOff pods\n- Applies known workarounds: podToPodTLS bug fix, persistenceagent TLS cert fix\n- Configures dashboard feature flags (automl, autorag, genAiStudio)\n- Documents EvalHub cross-namespace issues and manual fixes\n\n# Workflow Phases\n\n## Phase 0: Connect to Cluster\n- Login to OpenShift cluster using /oc-login\n\n## Phase 1: Install or Update\n- Fresh RHOAI: /rhoai-install\n- Fresh ODH: /odh-install\n- Update RHOAI: /rhoai-update\n- Update ODH: /odh-update\n\n## Phase 2: Version Management\n- Check RHOAI: /rhoai-version\n\n## Phase 3: Disconnected Cluster Operations\n- Mirror images: /mirror-images (from connected cluster)\n- Install/Update on disconnected cluster: /rhoai-disconnected\n\n## Phase 4: Cleanup / Switch\n- Remove RHOAI: /rhoai-uninstall\n- Remove ODH: /odh-uninstall\n\n# Output Locations\n\n- Installation Reports: artifacts/rhoai-manager/reports/*.md\n- Version Info: artifacts/rhoai-manager/version/*.md\n- Execution Logs: artifacts/rhoai-manager/logs/*.log\n\n# Prerequisites\n\n- OpenShift cluster (version 4.12+)\n- Cluster credentials in Ambient session (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Cluster admin permissions\n", + "startupPrompt": "Welcome to the RHOAI Manager Workflow!\n\nI manage the complete lifecycle of Red Hat OpenShift AI (RHOAI) and Open Data Hub (ODH) installations.\n\n## RHOAI Commands\n\n- `/rhoai-install` - Install RHOAI from scratch (nightly or GA)\n- `/rhoai-update` - Update to latest nightly\n- `/rhoai-version` - Check current version and build info\n- `/rhoai-uninstall` - Remove RHOAI completely\n\n## ODH Commands\n\n- `/odh-install` - Install ODH nightly (odh-stable-nightly, fast channel)\n- `/odh-update` - Update ODH to latest nightly\n- `/odh-uninstall` - Remove ODH completely\n\n## Disconnected Cluster Operations\n\n- `/mirror-images` - Mirror RHOAI images to disconnected cluster bastion registries\n- `/rhoai-disconnected` - Install or update RHOAI on a disconnected cluster\n- `/rhoai-verify` - Run post-install/update verification tests\n\n## Cluster Connection\n\n- `/oc-login` - Connect to your OpenShift cluster\n\n## Important Note\n\nRHOAI and ODH **cannot coexist** on the same cluster. To switch between them, uninstall one before installing the other. Both install commands detect the other and will guide you.\n\n**Getting started**: Make sure your cluster credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD) are configured in your Ambient session, then use /oc-login.\n\nWhat would you like to do?", + "results": { + "Installation Reports": "artifacts/rhoai-manager/reports/*.md", + "Update Reports": "artifacts/rhoai-manager/reports/*.md", + "Version Info": "artifacts/rhoai-manager/version/*.md", + "Execution Logs": "artifacts/rhoai-manager/logs/*.log" + } +} diff --git a/workflows/rhoai-manager/.claude/commands/mirror-images.md b/workflows/rhoai-manager/.claude/commands/mirror-images.md new file mode 100644 index 00000000..5dbc5743 --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/mirror-images.md @@ -0,0 +1,510 @@ +# /mirror-images - Mirror All Images from Connected Cluster to Disconnected Bastion Registries + +## Purpose + +Mirror all images required for a complete RHOAI deployment (operator, components, and infrastructure services) from a connected OpenShift cluster to one or more disconnected cluster bastion registries. This includes RHOAI operator and component images, FBC (File-Based Catalog) images, and all infrastructure images (databases, object storage, authentication, model serving runtimes, vector databases, etc.) so that a fresh disconnected cluster can be fully set up from scratch. + +Runs the mirror job from a pod on the connected cluster for fast AWS-internal transfers. + +## Prerequisites + +- `oc` CLI installed and authenticated to the **connected** OpenShift cluster +- The connected cluster has RHOAI operator installed and running with all components deployed +- All infrastructure services (minio, keycloak, postgres, model serving, etc.) should be running on the connected cluster so their images can be captured +- Network access from the connected cluster to the bastion registries +- Bastion registry credentials (username/password) for each target registry + +## Inputs + +The user must provide (or you must ask for): + +| Input | Description | Example | +|-------|-------------|---------| +| `BASTION_REGISTRIES` | Comma-separated list of bastion registry host:port | `bastion.ods-dis-rhoai-test.aws.rh-ods.com:8443` | +| `BASTION_USER` | Registry username for the bastions | `mir_reg` | +| `BASTION_PASSWORD` | Registry password for the bastions | (prompt securely) | +| `EXCLUDE_PATTERNS` | Optional image name patterns to skip (empty by default) | `spark,habana` | +| `EXTRA_NAMESPACES` | Optional additional namespaces to scan (beyond auto-detected) | `my-custom-ns` | + +**Auto-detected (no user input needed):** + +| Value | Source | +|-------|--------| +| `RHOAI_VERSION` | Extracted from the CSV version on the connected cluster (e.g., `rhods-operator.3.4.0` -> `3.4`) | +| `INFRA_NAMESPACES` | Auto-detected from running pods (minio, keycloak, milvus, evalhub, postgresql, llama-stack, llm-models, etc.) | + +## Process + +### Phase 1: Extract Complete Image List from Connected Cluster + +The goal is to capture **every** image needed for a fully functional disconnected RHOAI deployment, organized into categories. + +#### 1a. Get RHOAI CSV and detect version + +```bash +CSV_NAME=$(oc get csv -n redhat-ods-operator -o name | grep rhods-operator) +RHOAI_VERSION=$(oc get "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.version}' | grep -oE '^[0-9]+\.[0-9]+') +``` + +#### 1b. Extract relatedImages from RHOAI CSV + +These are ALL images the operator references, including ones not currently running (workbenches, pipeline runtimes, training images, etc.). **Mirror all of them** — do NOT skip any by default. + +```bash +oc get "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.relatedImages[*]}' | jq -r '.[] | .image' | sort -u +``` + +**Registry Fallback for Nightly Images:** RHOAI nightly CSV references images as `registry.redhat.io/rhoai/...@sha256:...`, but these images often do NOT exist at `registry.redhat.io` — they only exist at `quay.io/rhoai/...`. Before mirroring, verify each `registry.redhat.io/rhoai/` image exists at the source. If it returns "manifest unknown" or "unauthorized", retry from `quay.io/rhoai/` with the same repo name and digest. Apply this fallback automatically in the mirror script: + +```bash +# For each image from registry.redhat.io/rhoai/: +# 1. Try: oc image info registry.redhat.io/rhoai/IMAGE@sha256:DIGEST +# 2. If fails: try quay.io/rhoai/IMAGE@sha256:DIGEST +# 3. Use whichever source succeeds for the mirror operation +``` + +#### 1c. Extract images from ALL relevant running pods + +Scan all namespaces for running pod images. Include both containers and initContainers. Do NOT filter by rhoai/rhods/odh — capture everything except core OpenShift platform images (`openshift-*`, `kube-*` namespaces) and GPU operator images (`nvcr.io/nvidia`). + +```bash +# Get all images from non-platform namespaces +oc get pods --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{range .spec.containers[*]}{.image}{"\n"}{end}{range .spec.initContainers[*]}{.image}{"\n"}{end}{end}' \ + | grep -vE '^(openshift-|kube-)' \ + | awk '{print $NF}' \ + | sort -u +``` + +This captures infrastructure images from namespaces like: + +| Namespace | Images Captured | +|-----------|----------------| +| `redhat-ods-operator` | RHOAI operator | +| `redhat-ods-applications` | All RHOAI component operators and controllers | +| `minio` | MinIO object storage (`quay.io/minio/minio`) | +| `keycloak` | Red Hat Build of Keycloak server and operator (`registry.redhat.io/rhbk/keycloak-rhel9`, `keycloak-rhel9-operator`) | +| `evalhub` | EvalHub server and PostgreSQL (`odh-eval-hub-rhel9`, `postgresql-15`) | +| `postgresql` | Standalone PostgreSQL instances (`postgresql-15`, `postgresql-16`) | +| `llama-stack` | LlamaStack core runtime (`odh-llama-stack-core-rhel9`) | +| `llm-models` | vLLM serving runtime (`vllm-cuda-rhel9`), model download jobs | +| `milvus` | Milvus vector database (`milvusdb/milvus`) | +| `ai-pipelines`, `test`, `zj` | DSP pipeline components, MariaDB (`mariadb-105`), Argo workflow controller, service mesh proxy | +| `cert-manager` | Cert-manager and operator | +| `kuadrant-system` | Authorino, Limitador, DNS operators (API gateway) | +| `rhoai-model-registries` | Model registry and its PostgreSQL | +| `tenant` | TrustyAI LMEval job runner | + +#### 1d. Extract images from CatalogSources + +```bash +oc get catalogsource --all-namespaces -o jsonpath='{range .items[*]}{.spec.image}{"\n"}{end}' | sort -u +``` + +#### 1e. Extract images from RHOAI Dashboard config (module architecture images) + +These are images referenced by the Dashboard for model arch features (AutoML, AutoRAG, EvalHub, GenAI, MaaS, MLflow, Model Registry). They appear as running pods on the connected cluster but are important to capture explicitly: + +```bash +# These are typically in the CSV relatedImages but verify by checking running mod-arch pods +oc get pods --all-namespaces -o jsonpath='{range .items[*]}{range .spec.containers[*]}{.image}{"\n"}{end}{end}' | grep 'mod-arch' | sort -u +``` + +#### 1f. Merge and deduplicate + +Combine all image lists from 1b, 1c, 1d, and 1e. Deduplicate by full image reference (registry/repo@digest). For each image, extract: +- Source registry (e.g., `registry.redhat.io/rhoai/`, `quay.io/minio/`, `milvusdb/`) +- Repository name (e.g., `odh-dashboard-rhel9`, `minio`, `milvus`) +- Digest (`sha256:...`) or tag + +#### 1g. Apply exclusion filters (only if user specified) + +Only remove images matching user-provided `EXCLUDE_PATTERNS`. **No images are excluded by default.** + +#### 1h. Check for images already on bastion (skip duplicates) + +Before mirroring, check each image against the bastion to avoid re-mirroring images that already exist. This significantly speeds up incremental mirrors (e.g., when only a few images changed in a nightly build): + +```bash +# For each image, compute the bastion destination path and check if it exists: +BASTION_DEST="${BASTION}/${DEST_REPO}@${DIGEST}" +if oc image info "$BASTION_DEST" --insecure=true -a "$PULL_SECRET" &>/dev/null; then + echo "SKIP (already on bastion): $BASTION_DEST" + SKIPPED_COUNT=$((SKIPPED_COUNT + 1)) + continue +fi +``` + +Report the skip count in the summary. This check adds ~1-2 seconds per image but can save hours of mirroring for unchanged images. + +#### 1i. Filter out images that don't need mirroring + +Skip images that: +- Are already on the bastion registry (`bastion.*:8443/`) +- Are from `nvcr.io/nvidia` (GPU operator images managed separately) +- Are from `quay.io/openshift-release-dev` (OCP platform images managed by OCP mirroring) + +#### 1j. Save the image list + +Save to `artifacts/rhoai-manager/mirror-images-{version}.txt` with format: + +```text +# RHOAI Operator and Components +registry.redhat.io/rhoai/odh-rhel9-operator@sha256:abc123... +registry.redhat.io/rhoai/odh-dashboard-rhel9@sha256:def456... +... + +# Infrastructure: Databases +registry.redhat.io/rhel9/mariadb-105@sha256:... +registry.redhat.io/rhel9/postgresql-15@sha256:... +registry.redhat.io/rhel9/postgresql-16@sha256:... + +# Infrastructure: Object Storage +quay.io/minio/minio@sha256:... + +# Infrastructure: Authentication +registry.redhat.io/rhbk/keycloak-rhel9@sha256:... +registry.redhat.io/rhbk/keycloak-rhel9-operator@sha256:... + +# Infrastructure: Model Serving +registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9@sha256:... + +# Infrastructure: Vector Database +milvusdb/milvus@sha256:... + +# Infrastructure: Service Mesh +registry.redhat.io/openshift-service-mesh/proxyv2-rhel9@sha256:... + +# Infrastructure: Cert Manager +registry.redhat.io/cert-manager/jetstack-cert-manager-rhel9@sha256:... + +# Infrastructure: API Gateway (Kuadrant) +registry.redhat.io/rhcl-1/authorino-rhel9@sha256:... + +# FBC Catalog +quay.io/rhoai/rhoai-fbc-fragment@sha256:... + +# Base Images +registry.redhat.io/ubi9/nginx-126@sha256:... +``` + +Print a summary showing the count of images per category. + +### Phase 2: Build Combined Pull Secret + +1. **Get the connected cluster's pull secret** + + ```bash + oc get secret/pull-secret -n openshift-config -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d > /tmp/cluster-pull-secret.json + ``` + +2. **Add bastion registry credentials** — merge each bastion auth into the pull secret: + + ```bash + # Generate base64 auth for bastions + BASTION_AUTH=$(printf '%s:%s' "$BASTION_USER" "$BASTION_PASSWORD" | base64 | tr -d '\n') + ``` + + Use `jq` to merge all bastion auths into `.auths`: + + ```bash + # Build jq expression dynamically for each bastion + JQ_EXPR='.' + for BASTION in ${BASTION_REGISTRIES//,/ }; do + JQ_EXPR="$JQ_EXPR | .auths[\"$BASTION\"] = {\"auth\": \"$BASTION_AUTH\"}" + done + jq "$JQ_EXPR" /tmp/cluster-pull-secret.json > /tmp/combined-pull-secret.json + ``` + +3. **Add auth for third-party registries** that may require authentication (docker.io, quay.io): + + If the connected cluster's pull secret already has auth for these registries, it will be included automatically. If images from registries like `docker.io` or `milvusdb` (Docker Hub) need mirroring, the pull secret must include Docker Hub credentials. Check and warn if missing. + +4. **Create the secret in the mirror namespace** + + ```bash + oc new-project image-mirror 2>/dev/null || true + oc delete secret mirror-pull-secret -n image-mirror 2>/dev/null || true + oc create secret generic mirror-pull-secret \ + --from-file=auth.json=/tmp/combined-pull-secret.json \ + -n image-mirror + ``` + +5. **Clean up local temp files** + + ```bash + rm -f /tmp/cluster-pull-secret.json /tmp/combined-pull-secret.json + ``` + +### Phase 3: Generate Mirror Script + +Generate a bash script that mirrors all images. The script must: + +- Accept the pull secret path and bastion hostnames as arguments +- For each image in the list: + - Determine the source reference (`registry/repo@digest` or `registry/repo:tag`) + - Compute the destination path based on the source registry: + - `registry.redhat.io/rhoai/foo` -> `BASTION/rhoai/foo` + - `quay.io/minio/minio` -> `BASTION/minio/minio` + - `registry.redhat.io/rhbk/foo` -> `BASTION/rhbk/foo` + - `registry.redhat.io/rhel9/foo` -> `BASTION/rhel9/foo` + - `milvusdb/milvus` -> `BASTION/milvusdb/milvus` (Docker Hub library) + - `quay.io/opendatahub/foo` -> `BASTION/opendatahub/foo` + - `docker.io/library/foo` -> `BASTION/library/foo` + - Mirror to all bastion registries with retries (3 attempts per image) + - Use `oc image mirror` with these critical flags: + - `--keep-manifest-list=true` -- preserves manifest list digests referenced by the CSV + - `--filter-by-os=".*"` -- mirrors all architectures (prevents manifest list stripping) + - `--insecure=true` -- bastion registries use self-signed certs + - `-a "$PULL_SECRET"` -- combined auth file + - Tag destination as `:latest` to prevent Quay tagless manifest garbage collection + - Verify each mirror with `oc image info` + - Handle images with tags (not digests) by using `skopeo copy` as fallback if `oc image mirror` fails +- Track and report: verified count, failed count, skipped count, per category +- Print a summary at the end with per-category breakdown + +**Mirror command pattern per image:** + +```bash +# For digest-referenced images +oc image mirror \ + "${SOURCE_REGISTRY}/${REPO}@${DIGEST}" \ + "${BASTION}/${DEST_REPO}:latest" \ + --insecure=true \ + -a "$PULL_SECRET" \ + --keep-manifest-list=true \ + --filter-by-os=".*" + +# For tag-referenced images (e.g., milvusdb/milvus:v2.5.4) +oc image mirror \ + "${SOURCE_IMAGE}" \ + "${BASTION}/${DEST_REPO}:${TAG}" \ + --insecure=true \ + -a "$PULL_SECRET" \ + --keep-manifest-list=true \ + --filter-by-os=".*" +``` + +**Verification command per image:** + +```bash +oc image info "${BASTION}/${DEST_REPO}:latest" --insecure=true -a "$PULL_SECRET" +``` + +### Phase 4: Deploy Mirror Pod + +1. **Create ConfigMap from the mirror script** + + ```bash + oc delete configmap mirror-script -n image-mirror 2>/dev/null || true + oc create configmap mirror-script \ + --from-file=mirror.sh=/tmp/mirror-script.sh \ + -n image-mirror + ``` + +2. **Create the mirror pod** using this manifest: + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: image-mirror + namespace: image-mirror + spec: + restartPolicy: Never + activeDeadlineSeconds: 14400 + containers: + - name: mirror + image: registry.redhat.io/openshift4/ose-cli-rhel9:latest + command: ["/bin/bash", "/scripts/mirror.sh"] + volumeMounts: + - name: auth + mountPath: /auth + readOnly: true + - name: script + mountPath: /scripts + readOnly: true + resources: + requests: + memory: "512Mi" + cpu: "500m" + limits: + memory: "2Gi" + cpu: "2" + volumes: + - name: auth + secret: + secretName: mirror-pull-secret + - name: script + configMap: + name: mirror-script + defaultMode: 0755 + ``` + +3. **Apply the pod manifest** + + ```bash + oc delete pod image-mirror -n image-mirror 2>/dev/null || true + oc apply -f /tmp/mirror-pod.yaml + ``` + +### Phase 5: Monitor and Verify + +1. **Wait for the pod to start** + + ```bash + oc wait --for=condition=Ready pod/image-mirror -n image-mirror --timeout=120s + ``` + +2. **Stream logs** periodically to check progress: + + ```bash + oc logs image-mirror -n image-mirror --tail=50 + ``` + +3. **Check at intervals** (every 10-15 minutes) until the pod completes: + + ```bash + oc get pod image-mirror -n image-mirror -o jsonpath='{.status.phase}' + ``` + +4. **When the pod finishes**, retrieve the full log and parse the summary: + + ```bash + oc logs image-mirror -n image-mirror > artifacts/rhoai-manager/mirror-log-{version}.txt + ``` + +5. **If any images failed**, report them by category and offer to create a retry script with only the failed images. + +### Phase 5b: Generate IDMS YAML + +After mirroring completes, generate the ImageDigestMirrorSet YAML from the list of source registries that were mirrored. This YAML must be applied to the disconnected cluster so it knows to pull from the bastion instead of the original source. + +```bash +# Extract unique source registry prefixes from the mirrored image list +# Group by registry/namespace (e.g., registry.redhat.io/rhoai, quay.io/minio, milvusdb) +SOURCE_PREFIXES=$(cat mirror-images-list.txt | grep -v '^#' | grep -v '^$' \ + | sed -E 's|^([^/]+/[^/@]+).*|\1|' | sort -u) + +# Generate IDMS YAML +cat > artifacts/rhoai-manager/mirror-idms-${RHOAI_VERSION}.yaml << 'HEADER' +apiVersion: config.openshift.io/v1 +kind: ImageDigestMirrorSet +metadata: + name: rhoai-mirror +spec: + imageDigestMirrors: +HEADER + +for prefix in $SOURCE_PREFIXES; do + # Compute bastion mirror path (strip registry hostname) + MIRROR_PATH=$(echo "$prefix" | sed -E 's|^[^/]+/||') + cat >> artifacts/rhoai-manager/mirror-idms-${RHOAI_VERSION}.yaml << EOF + - source: $prefix + mirrors: + - ${BASTION}/${MIRROR_PATH} + mirrorSourcePolicy: NeverContactSource +EOF +done + +echo "IDMS YAML saved to: artifacts/rhoai-manager/mirror-idms-${RHOAI_VERSION}.yaml" +echo "Apply to disconnected cluster: oc apply -f artifacts/rhoai-manager/mirror-idms-${RHOAI_VERSION}.yaml" +``` + +**Important:** For Docker Hub images without an explicit registry (e.g., `milvusdb/milvus`), the IDMS source should use the full Docker Hub URL: `docker.io/milvusdb`. For images under `docker.io/library/`, use `docker.io/library`. + +### Phase 6: Cleanup + +After successful verification: + +```bash +oc delete pod image-mirror -n image-mirror +oc delete configmap mirror-script -n image-mirror +oc delete secret mirror-pull-secret -n image-mirror +oc delete project image-mirror +``` + +Clean up any local temp files. + +## Image Categories Reference + +The following table lists all image categories that must be mirrored for a complete disconnected RHOAI deployment: + +| Category | Source Registry | Example Images | Notes | +|----------|----------------|----------------|-------| +| RHOAI Operator | `registry.redhat.io/rhoai/` | `odh-rhel9-operator`, `odh-operator-bundle` | Core operator | +| RHOAI Components | `registry.redhat.io/rhoai/` | `odh-dashboard-rhel9`, `odh-kserve-controller-rhel9`, `odh-notebook-controller-rhel9`, all `odh-*` images | All CSV relatedImages | +| FBC Catalog | `quay.io/rhoai/` or `quay.io/modh/` | `rhoai-fbc-fragment`, `rhoai-catalog` | OLM catalog source | +| Module Architecture | `registry.redhat.io/rhoai/` | `odh-mod-arch-automl-rhel9`, `odh-mod-arch-autorag-rhel9`, `odh-mod-arch-eval-hub-rhel9`, `odh-mod-arch-gen-ai-rhel9`, `odh-mod-arch-maas-rhel9`, `odh-mod-arch-mlflow-rhel9`, `odh-mod-arch-model-registry-rhel9` | Dashboard module images | +| Model Serving Runtime | `registry.redhat.io/rhaii-early-access/` | `vllm-cuda-rhel9` | vLLM CUDA runtime | +| LlamaStack | `registry.redhat.io/rhoai/` | `odh-llama-stack-core-rhel9`, `odh-llama-stack-k8s-operator-rhel9` | LLM orchestration | +| EvalHub | `registry.redhat.io/rhoai/` | `odh-eval-hub-rhel9`, `odh-ta-lmes-job-rhel9` | Evaluation hub + LMEval job | +| TrustyAI | `registry.redhat.io/rhoai/` | `odh-trustyai-service-operator-rhel9` | AI explainability | +| MariaDB | `registry.redhat.io/rhel9/` | `mariadb-105` | DSP metadata store | +| PostgreSQL | `registry.redhat.io/rhel9/` | `postgresql-15`, `postgresql-16` | EvalHub, Model Registry DBs | +| MinIO | `quay.io/minio/` | `minio` | S3-compatible object storage | +| Keycloak | `registry.redhat.io/rhbk/` | `keycloak-rhel9`, `keycloak-rhel9-operator` | Authentication (LlamaStack, etc.) | +| Milvus | `milvusdb/` (Docker Hub) | `milvus` | Vector database for RAG | +| Service Mesh | `registry.redhat.io/openshift-service-mesh/` | `proxyv2-rhel9`, `istio-pilot-rhel9`, `istio-proxyv2-rhel9`, `istio-rhel9-operator` | Envoy sidecar for pipelines | +| Cert Manager | `registry.redhat.io/cert-manager/` | `jetstack-cert-manager-rhel9`, `cert-manager-operator-rhel9` | TLS certificate management | +| Kuadrant/API Gateway | `registry.redhat.io/rhcl-1/` | `authorino-rhel9`, `limitador-rhel9`, `rhcl-rhel9-operator`, `rhcl-console-plugin-rhel9`, `dns-rhel9-operator` | API auth and rate limiting | +| Model Registry | `registry.redhat.io/rhoai/` | `odh-model-registry-rhel9`, `odh-model-registry-operator-rhel9` | ML model registry | +| DSP Components | `registry.redhat.io/rhoai/` | `odh-ml-pipelines-api-server-v2-rhel9`, `odh-ml-pipelines-persistenceagent-v2-rhel9`, `odh-ml-pipelines-scheduledworkflow-v2-rhel9`, `odh-mlmd-grpc-server-rhel9`, `odh-data-science-pipelines-argo-workflowcontroller-rhel9` | Data Science Pipelines | +| Base Images | `registry.redhat.io/ubi9/` | `nginx-126` | Dashboard web server | +| ODH Components | `quay.io/opendatahub/` | `odh-model-controller` | Upstream ODH images | +| Kube Auth Proxy | `registry.redhat.io/rhoai/` | `odh-kube-auth-proxy-rhel9` | Auth proxy for RHOAI services | +| Metadata/Perf | `registry.redhat.io/rhoai/` | `odh-model-metadata-collection-rhel9`, `odh-model-performance-data-rhel9` | Telemetry images | + +## Important Notes + +- **Why pod-based mirroring**: Running `oc image mirror` from a pod on the connected AWS cluster uses AWS internal networking (40-116 MB/s) instead of local internet (~2 MB/s). This eliminates connection drops on large blob uploads (some RHOAI images are 5-7 GB). +- **Why `:latest` tag**: Quay garbage-collects manifests that have no tags. Even though clusters pull by digest, pushing with `:latest` prevents GC from removing the manifests. +- **Why `--filter-by-os=".*"`**: Using `--filter-by-os=linux/amd64` strips the manifest list and replaces it with a single-arch manifest. The CSV references the manifest list digest, so this would break image resolution. `".*"` preserves the full manifest list. +- **Why `--keep-manifest-list=true`**: Ensures the manifest list is pushed as-is to the destination, preserving the exact digest the CSV references. +- **Why mirror ALL CSV relatedImages**: Previously, workbench, training, pipeline-runtime, and spark images were excluded by default. This caused failures when users tried to create workbenches or run training jobs on the disconnected cluster. Mirror everything by default. +- **Docker Hub images (milvusdb)**: These images may require Docker Hub credentials in the pull secret. The connected cluster may or may not have these. If `oc image mirror` fails for Docker Hub images, the script should warn and continue, reporting them as needing manual attention. +- **Tag-based images**: Some images (e.g., `milvusdb/milvus:v2.5.4`, `quay.io/opendatahub/odh-model-controller:odh-model-serving-api-stable`) use tags instead of digest references. These need special handling since `--keep-manifest-list` may not apply. Mirror them with the original tag preserved. +- **Large images**: Some RHOAI images (automl ~5.5GB, autorag ~7.2GB, ta-lmes-job ~6.7GB, vllm-cuda ~8GB) take 5-15 minutes each. The 4-hour `activeDeadlineSeconds` on the pod accommodates this. +- **IDMS requirements**: After mirroring, the disconnected cluster needs ImageDigestMirrorSet entries for all source registries. Registries commonly needing IDMS entries: `registry.redhat.io/rhoai`, `registry.redhat.io/rhbk`, `registry.redhat.io/rhel9`, `registry.redhat.io/rhcl-1`, `registry.redhat.io/cert-manager`, `quay.io/minio`, `quay.io/opendatahub`, `milvusdb` (Docker Hub). The mirror script should output the required IDMS YAML for any registries that were mirrored. + +## Output + +- `artifacts/rhoai-manager/mirror-images-{version}.txt` -- categorized image list extracted from the connected cluster +- `artifacts/rhoai-manager/mirror-log-{version}.txt` -- complete mirror pod log with verification results +- `artifacts/rhoai-manager/mirror-idms-{version}.yaml` -- ImageDigestMirrorSet YAML for the disconnected cluster (generated from the mirrored image list) + +## Summary Display + +After mirroring completes, display a summary table to the user in this format: + +``` +**RHOAI v{version} Image Mirror — Complete** + +| Metric | Value | +|--------|-------| +| Total images | {total} | +| Verified | {verified} | +| Skipped (already on bastion) | {skipped} | +| Failed | {failed} | +| Duration | {duration} | +| Target | `{bastion_registry}` | + +**Image Breakdown:** + +| Category | Count | +|----------|-------| +| RHOAI Operator and Components | {count} | +| Model Serving Runtimes (vLLM) | {count} | +| Infrastructure Dependencies | {count} | +| FBC Catalog | {count} | +| Base Images | {count} | + +**Artifacts saved:** +- `artifacts/rhoai-manager/mirror-images-{version}.txt` — categorized image list +- `artifacts/rhoai-manager/mirror-log-{version}.txt` — full mirror log ({line_count} lines) +- `artifacts/rhoai-manager/mirror-idms-{version}.yaml` — ImageDigestMirrorSet YAML for disconnected cluster + +**Next step:** Apply the IDMS on the disconnected cluster: +oc apply -f artifacts/rhoai-manager/mirror-idms-{version}.yaml +``` + +If any images failed, append a **Failed Images** section listing them by category with their full image references. diff --git a/workflows/rhoai-manager/.claude/commands/oc-login.md b/workflows/rhoai-manager/.claude/commands/oc-login.md new file mode 100644 index 00000000..f7d1d321 --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/oc-login.md @@ -0,0 +1,355 @@ +# /oc-login - Login to OpenShift Cluster + +Login to an OpenShift cluster using credentials configured in the Ambient session. + +## Command Usage + +- `/oc-login` - Login to OpenShift cluster using session credentials + +## When to Use This Command + +This command is triggered when the user runs: +- `/oc-login` - Login to the configured OpenShift cluster +- Or when asked to "login to cluster", "connect to OpenShift", etc. + +## Prerequisites + +The following credentials should be configured in the Ambient session: +1. `OCP_SERVER` - OpenShift cluster API server URL (e.g., `https://api.cluster.example.com:6443`) +2. `OCP_USERNAME` - OpenShift username +3. `OCP_PASSWORD` - OpenShift password + +These are typically configured as environment variables in the Ambient session. + +## How It Works + +The command uses the `oc` CLI tool to authenticate to the OpenShift cluster. + +### Step 1: Check for Required Credentials + +First, verify that all required credentials are available: + +```bash +# Check if credentials are set +if [ -z "$OCP_SERVER" ]; then + echo "❌ OCP_SERVER not set" +fi + +if [ -z "$OCP_USERNAME" ]; then + echo "❌ OCP_USERNAME not set" +fi + +if [ -z "$OCP_PASSWORD" ]; then + echo "❌ OCP_PASSWORD not set" +fi +``` + +**If credentials are missing:** +- Inform the user which credentials are missing +- Ask them to configure the credentials in their Ambient session +- Do not proceed with login + +### Step 2: Install oc CLI if Not Available + +Automatically install the `oc` command if not available: + +```bash +# Check if oc is installed +if ! command -v oc &> /dev/null; then + echo "📦 oc CLI not found. Installing automatically..." + + # Download oc CLI for Linux + curl -LO https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz + + # Extract the binary + tar -xzf openshift-client-linux.tar.gz + + # Move to /usr/local/bin for global access + sudo mv oc /usr/local/bin/ + sudo mv kubectl /usr/local/bin/ + + # Make executable + sudo chmod +x /usr/local/bin/oc + sudo chmod +x /usr/local/bin/kubectl + + # Clean up + rm -f openshift-client-linux.tar.gz README.md + + echo "✅ oc CLI installed successfully" +fi + +# Show oc version +oc version --client +``` + +**What happens:** +- Automatically detects if `oc` is not installed +- Downloads the latest stable OpenShift CLI for Linux +- Installs it to `/usr/local/bin` for system-wide access +- Continues to login without user intervention + +### Step 3: Login to OpenShift Cluster + +Execute the login command: + +```bash +# Login to OpenShift cluster +oc login \ + --username="$OCP_USERNAME" \ + --password="$OCP_PASSWORD" \ + --server="$OCP_SERVER" \ + --insecure-skip-tls-verify=true +``` + +**Important flags:** +- `--username` - OpenShift username from session +- `--password` - OpenShift password from session +- `--server` - Cluster API server URL +- `--insecure-skip-tls-verify=true` - Skip TLS certificate validation (useful for development clusters) + +**Note on TLS verification:** +- For production clusters with valid certificates, you can remove `--insecure-skip-tls-verify=true` +- For development/test clusters with self-signed certificates, this flag is necessary + +### Step 4: Verify Login Success + +After login, verify the connection: + +```bash +# Check who is logged in +oc whoami + +# Get cluster info +oc cluster-info + +# Show current project +oc project +``` + +Expected output: +- `oc whoami` returns the username +- `oc cluster-info` shows cluster details +- `oc project` shows the current/default project + +### Step 5: Display Cluster Information + +Provide useful information about the cluster: + +```bash +# Show OpenShift version +oc version + +# List available projects (limit to first 10) +oc get projects --no-headers | head -10 + +# Show current context +oc config current-context +``` + +This helps the user understand what cluster they're connected to. + +## Handling Different Scenarios + +### Scenario A: Successful Login + +1. Execute login command +2. Verify with `oc whoami` +3. Display cluster information +4. Report: "✅ Successfully logged into OpenShift cluster as `username`" + +### Scenario B: Invalid Credentials + +If login fails due to wrong username/password: + +```bash +# Login will fail with error like: +# error: unable to log in: invalid username/password +``` + +**Response:** +- Report: "❌ Login failed: Invalid username or password" +- Ask user to verify their credentials in the Ambient session +- Suggest checking if credentials have expired + +### Scenario C: Unreachable Server + +If the cluster server is unreachable: + +```bash +# Login will fail with error like: +# error: dial tcp: lookup api.cluster.example.com: no such host +# or: error: dial tcp: i/o timeout +``` + +**Response:** +- Report: "❌ Login failed: Cannot reach cluster server" +- Verify the OCP_SERVER URL is correct +- Check network connectivity +- Suggest checking if VPN is required + +### Scenario D: Already Logged In + +If already logged into the cluster: + +```bash +# Check current login status first +if oc whoami &> /dev/null; then + current_user=$(oc whoami) + current_server=$(oc whoami --show-server) + + if [ "$current_server" = "$OCP_SERVER" ]; then + echo "ℹ️ Already logged into $OCP_SERVER as $current_user" + # Ask if user wants to re-login + fi +fi +``` + +**Response:** +- Inform user they're already logged in +- Show current username and server +- Ask if they want to re-login (in case credentials changed) + +### Scenario E: Multiple Clusters + +If the user has multiple cluster contexts: + +```bash +# List all contexts +oc config get-contexts + +# Show current context +oc config current-context +``` + +**Response:** +- Show available contexts +- Highlight which one is currently active +- Explain that login will switch to the new cluster + +## Security Considerations + +1. **Password Handling** + - Never echo or display the password + - Use environment variables to pass credentials + - Don't log passwords in command output + +2. **Session Storage** + - Login tokens are stored in `~/.kube/config` + - Tokens typically expire after a period (e.g., 24 hours) + - Re-login may be required if session expires + +3. **TLS Verification** + - For production: Use valid certificates and remove `--insecure-skip-tls-verify` + - For development: `--insecure-skip-tls-verify=true` is acceptable + - Warn users if skipping TLS verification + +## Example Interactions + +### Example 1: First Time Login + +**User**: `/oc-login` + +**Claude**: +1. Checks for credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD) +2. Verifies `oc` CLI is installed +3. Executes login command +4. Reports: "✅ Successfully logged into OpenShift cluster at https://api.cluster.example.com:6443 as admin" +5. Displays cluster version and current project + +### Example 2: Missing Credentials + +**User**: `/oc-login` + +**Claude**: +1. Checks for credentials +2. Finds OCP_PASSWORD is not set +3. Reports: "❌ Cannot login: OCP_PASSWORD is not configured in your Ambient session" +4. Provides instructions on setting up credentials + +### Example 3: Already Logged In + +**User**: `/oc-login` + +**Claude**: +1. Checks current login status +2. Finds user is already logged in +3. Reports: "ℹ️ Already logged into https://api.cluster.example.com:6443 as admin. Do you want to re-login?" +4. Waits for user confirmation + +### Example 4: Login After Session Expiry + +**User**: `/oc-login` + +**Claude**: +1. Attempts to verify current session +2. Finds token has expired +3. Reports: "⚠️ Previous session expired. Logging in again..." +4. Executes fresh login +5. Reports: "✅ Successfully logged in" + +## Common Issues and Troubleshooting + +### Issue 1: "command not found: oc" + +**Cause**: OpenShift CLI is not installed + +**Solution**: This command automatically installs `oc` CLI if not found. If you encounter this error, it means the automatic installation failed. Check: +- Do you have sudo permissions? +- Is the network connection working? +- Can you access https://mirror.openshift.com/? + +The command will automatically download and install oc CLI from: +``` +https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz +``` + +### Issue 2: "error: x509: certificate signed by unknown authority" + +**Cause**: Cluster uses self-signed certificate + +**Solution**: Use `--insecure-skip-tls-verify=true` flag (already included in the command) + +### Issue 3: "error: unable to connect to server: dial tcp: i/o timeout" + +**Cause**: Network connectivity issue or wrong server URL + +**Solution**: +- Verify OCP_SERVER URL is correct +- Check if VPN connection is required +- Test network connectivity: `curl -k $OCP_SERVER/healthz` + +### Issue 4: "You must be logged in to the server (Unauthorized)" + +**Cause**: Session token expired + +**Solution**: Run `/oc-login` again to refresh the session + +## Integration with Other Commands + +This command is often used before other commands: + +``` +/oc-login # Login first +/rhoai-update # Then update RHOAI +``` + +The `/rhoai-update` command assumes you're already logged into the cluster. + +## Success Criteria + +The login is successful when: +- ✅ `oc login` command completes without error +- ✅ `oc whoami` returns the expected username +- ✅ `oc cluster-info` shows cluster details +- ✅ `oc get projects` can list projects (permissions allowing) + +## Output Format + +Always provide: +1. **Status** - Success or failure of login +2. **Username** - Who you're logged in as +3. **Server** - Which cluster you're connected to +4. **Cluster Info** - OpenShift version and current project +5. **Any warnings** - TLS verification status, session expiry, etc. + +Keep the user informed about the login process and cluster state. diff --git a/workflows/rhoai-manager/.claude/commands/odh-install.md b/workflows/rhoai-manager/.claude/commands/odh-install.md new file mode 100644 index 00000000..f7059b3c --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/odh-install.md @@ -0,0 +1,297 @@ +# /odh-install - Install Open Data Hub on OpenShift Cluster + +Install Open Data Hub (ODH) on an OpenShift cluster using OLM (Operator Lifecycle Manager). + +## Command Usage + +```bash +/odh-install # Latest stable nightly (default) +/odh-install channel=fast # Explicit fast channel +/odh-install image=quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly +/odh-install channel=fast image=quay.io/opendatahub/opendatahub-operator-catalog:latest +``` + +## Available Tags + +| Image Tag | Description | Use Case | +|-----------|-------------|----------| +| `odh-stable-nightly` (default) | Daily nightly from main branch | Testing latest ODH builds | +| `latest` | Most recent CI build (any branch) | Bleeding edge | +| `odh-stable` | Latest stable release | Stable deployments | + +## Available Channels + +| Channel | Description | +|---------|-------------| +| `fast` (default) | Frequent releases tracking main | +| `stable` | Stable releases only | + +## Key Differences from RHOAI + +| | RHOAI | ODH | +|-|-------|-----| +| Package | `rhods-operator` | `opendatahub-operator` | +| Operator namespace | `redhat-ods-operator` | `openshift-operators` | +| App namespace | `redhat-ods-applications` | `opendatahub` | +| Catalog image | `quay.io/rhoai/rhoai-fbc-fragment` | `quay.io/opendatahub/opendatahub-operator-catalog` | +| Default channel | `stable-3.4` / `beta` | `fast` | + +## Prerequisites + +1. **Cluster access**: Logged into OpenShift cluster with cluster-admin privileges (use `/oc-login`) +2. **Tools installed**: `oc` CLI must be available +3. **No existing ODH**: For fresh installations only (use `/odh-update` to update) + +## Process + +### Step 1: Parse Input Arguments + +```bash +CATALOG_IMAGE="quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly" +CHANNEL="fast" + +for arg in "$@"; do + case "$arg" in + channel=*) + CHANNEL="${arg#*=}" + ;; + image=*) + CATALOG_IMAGE="${arg#*=}" + ;; + *) + echo "Unknown parameter: $arg (expected: channel= or image=)" + ;; + esac +done + +echo "Catalog image: $CATALOG_IMAGE" +echo "Channel: $CHANNEL" +``` + +### Step 2: Verify Cluster Access + +```bash +oc whoami &>/dev/null || { echo "ERROR: Not logged into OpenShift cluster"; exit 1; } +echo "Logged in as: $(oc whoami)" +echo "Cluster: $(oc whoami --show-server)" + +# Check if RHOAI is installed — RHOAI and ODH cannot coexist +if oc get csv -n redhat-ods-operator 2>/dev/null | grep -q rhods-operator; then + RHOAI_CSV=$(oc get csv -n redhat-ods-operator --no-headers 2>/dev/null | grep rhods-operator | awk '{print $1}') + echo "" + echo "ERROR: RHOAI is installed on this cluster ($RHOAI_CSV)" + echo "" + echo "RHOAI and ODH cannot coexist — they both manage the same" + echo "cluster-scoped DataScienceCluster CRD and overlapping operators." + echo "" + echo "To install ODH, first uninstall RHOAI:" + echo " /rhoai-uninstall" + echo "" + echo "Then re-run:" + echo " /odh-install" + exit 1 +fi + +# Check if ODH is already installed +if oc get csv -n openshift-operators 2>/dev/null | grep -q opendatahub-operator; then + echo "ERROR: ODH already installed. Use /odh-update to update." + exit 1 +fi +echo "No existing ODH or RHOAI installation detected — proceeding" +``` + +### Step 3: Create CatalogSource + +```bash +echo "Creating ODH CatalogSource..." +cat << EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: CatalogSource +metadata: + name: odh-catalog + namespace: openshift-marketplace +spec: + sourceType: grpc + image: ${CATALOG_IMAGE} + displayName: Open Data Hub + publisher: ODH Community + updateStrategy: + registryPoll: + interval: 15m + grpcPodConfig: + securityContextConfig: restricted +EOF + +# Wait for catalog pod to be running +TIMEOUT=120 +ELAPSED=0 +while [[ $ELAPSED -lt $TIMEOUT ]]; do + PHASE=$(oc get pod -n openshift-marketplace -l olm.catalogSource=odh-catalog \ + -o jsonpath='{.items[0].status.phase}' 2>/dev/null || echo "") + if [[ "$PHASE" == "Running" ]]; then + echo "CatalogSource ready" + break + fi + sleep 5 + ELAPSED=$((ELAPSED + 5)) + echo "Waiting for catalog pod... (${ELAPSED}s/${TIMEOUT}s)" +done +``` + +### Step 4: Create Subscription + +ODH installs into `openshift-operators` which already has a global OperatorGroup — no need to create one. + +```bash +echo "Creating ODH Subscription..." +cat << EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: Subscription +metadata: + name: opendatahub-operator + namespace: openshift-operators +spec: + channel: ${CHANNEL} + name: opendatahub-operator + source: odh-catalog + sourceNamespace: openshift-marketplace + installPlanApproval: Automatic +EOF +``` + +### Step 5: Wait for Operator CSV + +```bash +TIMEOUT=600 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CSV_LINE=$(oc get csv -n openshift-operators 2>/dev/null | grep opendatahub-operator || echo "") + if [[ -n "$CSV_LINE" ]]; then + CSV_NAME=$(echo "$CSV_LINE" | awk '{print $1}') + CSV_PHASE=$(echo "$CSV_LINE" | awk '{print $NF}') + echo "CSV: $CSV_NAME, Phase: $CSV_PHASE" + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + echo "ODH operator installed successfully" + break + fi + fi + sleep 10 + ELAPSED=$((ELAPSED + 10)) + echo "Waiting for CSV... (${ELAPSED}s/${TIMEOUT}s)" +done + +[[ "$CSV_PHASE" == "Succeeded" ]] || { echo "ERROR: CSV did not reach Succeeded"; exit 1; } +``` + +### Step 6: Create DSCInitialization + +```bash +echo "Creating DSCInitialization..." +cat << EOF | oc apply -f - +apiVersion: dscinitialization.opendatahub.io/v1 +kind: DSCInitialization +metadata: + name: default-dsci +spec: + applicationsNamespace: opendatahub + monitoring: + managementState: Managed + namespace: opendatahub + trustedCABundle: + managementState: Managed + devFlags: + logMode: production +EOF +sleep 10 +``` + +### Step 7: Create DataScienceCluster + +```bash +echo "Creating DataScienceCluster..." +cat << EOF | oc apply -f - +apiVersion: datasciencecluster.opendatahub.io/v1 +kind: DataScienceCluster +metadata: + name: default-dsc +spec: + components: + dashboard: + managementState: Managed + workbenches: + managementState: Managed + datasciencepipelines: + managementState: Managed + kserve: + managementState: Managed + serving: + managementState: Removed + modelmeshserving: + managementState: Managed + ray: + managementState: Managed + kueue: + managementState: Managed + trainingoperator: + managementState: Managed + trustyai: + managementState: Managed + modelregistry: + managementState: Managed + feastoperator: + managementState: Managed +EOF +``` + +### Step 8: Wait for DSC Ready + +```bash +TIMEOUT=600 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + READY=$(oc get datasciencecluster default-dsc \ + -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || echo "") + echo "DSC Ready: ${READY:-Unknown}" + if [[ "$READY" == "True" ]]; then + echo "DataScienceCluster is Ready" + break + fi + sleep 15 + ELAPSED=$((ELAPSED + 15)) + echo "Waiting for DSC... (${ELAPSED}s/${TIMEOUT}s)" +done +``` + +### Step 9: Verify Installation + +```bash +echo "" +echo "=== ODH Installation Summary ===" +echo "" +echo "CSV:" +oc get csv -n openshift-operators | grep opendatahub-operator + +echo "" +echo "DSC Status:" +oc get datasciencecluster default-dsc \ + -o jsonpath='{range .status.conditions[*]}{.type}{": "}{.status}{"\n"}{end}' | grep -v "False" + +echo "" +echo "Dashboard:" +DASHBOARD=$(oc get route odh-dashboard -n opendatahub -o jsonpath='{.spec.host}' 2>/dev/null || echo "Not ready yet") +echo " https://$DASHBOARD" + +echo "" +echo "ODH installation complete!" +``` + +## Common Issues + +| Problem | Solution | +|---------|----------| +| CSV stuck in `Installing` | Check operator pod logs: `oc logs -n openshift-operators -l name=opendatahub-operator` | +| DSC not Ready | Check components: `oc get dsc default-dsc -o yaml \| grep -A5 conditions` | +| Feast label selector error | Delete old deployment: `oc delete deployment feast-operator-controller-manager -n opendatahub` | +| Catalog pod not starting | Check image pull: `oc describe pod -n openshift-marketplace -l olm.catalogSource=odh-catalog` | diff --git a/workflows/rhoai-manager/.claude/commands/odh-pr-tracker.md b/workflows/rhoai-manager/.claude/commands/odh-pr-tracker.md new file mode 100644 index 00000000..0fa33c7d --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/odh-pr-tracker.md @@ -0,0 +1,118 @@ +# /odh-pr-tracker - Check if ODH PRs are in the RHOAI Build + +Check whether one or more ODH (Open Data Hub) pull requests have been pulled into the latest RHOAI build. + +## Purpose + +When developers merge changes into an `opendatahub-io/` upstream, those changes don't automatically appear in RHOAI images. The RHOAI team periodically syncs upstream commits into their `red-hat-data-services/` fork and pins a specific commit in the build config. This command tells you whether a given ODH PR has made it through that pipeline. + +Works for any component tracked in the RHOAI build config — odh-dashboard, eval-hub, or anything else. + +## How It Works + +ODH changes flow like this: +1. PR merged into `opendatahub-io/` (upstream) +2. RHOAI team syncs upstream into `red-hat-data-services/` (fork) +3. Build config (`red-hat-data-services/RHOAI-Build-Config`) is updated with the pinned commit +4. Konflux builds the image from that pinned commit + +"Is my PR in RHOAI?" = is the PR's merge commit an ancestor of the commit currently pinned in the RHOAI build config? + +## Prerequisites + +- `gh` CLI authenticated with access to `red-hat-data-services` org + +## Steps + +For each PR URL provided by the user (e.g. `https://github.com/opendatahub-io/eval-hub/pull/123`): + +### 1. Get the PR merge commit + +Parse the PR URL to extract the upstream org/repo and PR number, then: + +```bash +gh pr view --repo / \ + --json mergeCommit,mergedAt,state,title +``` + +If `state` is not `"MERGED"`, report it as unmerged and skip further checks. + +### 2. Find the RHOAI-pinned commit for this repo + +Fetch the full build config map: + +```bash +curl -sf https://raw.githubusercontent.com/red-hat-data-services/RHOAI-Build-Config/rhoai-3.4/catalog/catalog_build_args.map +``` + +The fork URL is almost always `red-hat-data-services/` (same repo name, different org). Find the line: + +``` +_GIT_URL=https://github.com/red-hat-data-services/ +``` + +There may be multiple components pointing to the same repo (e.g. odh-dashboard has several modular-arch entries). Pick the one most relevant — for dashboard use `ODH_DASHBOARD_GIT_URL`, otherwise take the first match. Then swap `_GIT_URL` → `_GIT_COMMIT` to get the pinned SHA. + +Example for eval-hub: +``` +ODH_EVAL_HUB_GIT_URL=https://github.com/red-hat-data-services/eval-hub +ODH_EVAL_HUB_GIT_COMMIT=1aad0fe1... +``` + +### 3. Compare the two commits + +```bash +gh api "repos/red-hat-data-services//compare/..." \ + --jq '{status: .status, behind_by: .behind_by}' +``` + +Interpret the result: +- `status: "ahead"` and `behind_by: 0` → PR commit IS an ancestor of the RHOAI commit → **included** ✅ +- `status: "diverged"` or `behind_by > 0` → PR is NOT yet in the RHOAI build → **not included** ❌ +- `status: "behind"` → RHOAI is behind the PR commit → **not included** ❌ +- `status: "identical"` → same commit → **included** ✅ + +The merge commit SHA is the same in both repos because the fork mirrors upstream commits directly (not rebased). + +### 4. Output a clear summary + +For each PR: + +``` +PR #: [<upstream_org>/<repo>] + Merged: <mergedAt> + RHOAI build at: <rhoai_commit_short> (rhoai-3.4 branch) + Status: ✅ Included in latest RHOAI build + — or — + ❌ NOT yet in RHOAI build +``` + +If multiple PRs were provided, check all of them and summarize together. + +## Notes + +- The `rhoai-3.4` branch is the active release branch as of early 2026. If it no longer exists, check `https://github.com/red-hat-data-services/RHOAI-Build-Config` for the current branch and use that instead. +- If the repo name differs between upstream and the RH fork, the `_GIT_URL` lookup will still find it — just grep for the fork URL directly. +- This checks what's in the **build config**, not what's on a specific cluster. To check a deployed cluster, also compare the cluster's running image against the build config. + +## Example Usage + +**User**: `/odh-pr-tracker https://github.com/opendatahub-io/odh-dashboard/pull/6959` + +**Claude**: +1. Gets merge commit `f754568f` for PR #6959 in `opendatahub-io/odh-dashboard` +2. Finds `ODH_DASHBOARD_GIT_URL=.../odh-dashboard` → grabs `ODH_DASHBOARD_GIT_COMMIT=297a39d8` +3. Compares: status `ahead`, `behind_by: 0` → included +4. Reports: ✅ PR #6959 is included in the latest RHOAI build + +**User**: `/odh-pr-tracker https://github.com/opendatahub-io/eval-hub/pull/42` + +**Claude**: +1. Gets merge commit for PR #42 in `opendatahub-io/eval-hub` +2. Finds `ODH_EVAL_HUB_GIT_URL=.../eval-hub` → grabs `ODH_EVAL_HUB_GIT_COMMIT=1aad0fe1` +3. Compares commits in `red-hat-data-services/eval-hub` +4. Reports result + +**User**: `/odh-pr-tracker https://github.com/opendatahub-io/odh-dashboard/pull/6959 https://github.com/opendatahub-io/eval-hub/pull/42` + +Claude checks both PRs and reports status for each. diff --git a/workflows/rhoai-manager/.claude/commands/odh-uninstall.md b/workflows/rhoai-manager/.claude/commands/odh-uninstall.md new file mode 100644 index 00000000..808a014e --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/odh-uninstall.md @@ -0,0 +1,182 @@ +# /odh-uninstall - Uninstall Open Data Hub from Cluster + +Completely uninstall Open Data Hub (ODH) from an OpenShift cluster, removing all related resources. + +## Command Usage + +```bash +/odh-uninstall # Standard uninstall (removes everything) +/odh-uninstall keep-crds # Uninstall but keep CRDs +/odh-uninstall keep-all # Keep CRDs and user resources (projects, models, etc.) +``` + +## Uninstall Options + +| Option | Removes Operator | Removes CRDs | Removes User Resources | +|--------|-----------------|--------------|----------------------| +| (default) | Yes | Yes | Yes | +| `keep-crds` | Yes | No | Yes | +| `keep-all` | Yes | No | No | + +## Prerequisites + +1. **Cluster access**: Logged into OpenShift cluster with cluster-admin privileges (use `/oc-login`) +2. **ODH installed**: ODH must be installed on the cluster + +## Process + +### Step 1: Parse Arguments and Verify + +```bash +KEEP_CRDS=false +KEEP_ALL=false + +for arg in "$@"; do + case "$arg" in + keep-crds) KEEP_CRDS=true ;; + keep-all) KEEP_CRDS=true; KEEP_ALL=true ;; + *) echo "Unknown option: $arg (valid: keep-crds, keep-all)" ;; + esac +done + +oc whoami &>/dev/null || { echo "ERROR: Not logged into OpenShift cluster"; exit 1; } +echo "Logged in as: $(oc whoami)" +echo "Cluster: $(oc whoami --show-server)" +echo "" +echo "Uninstall options: keep-crds=$KEEP_CRDS keep-all=$KEEP_ALL" + +# Verify ODH is installed +if ! oc get csv -n openshift-operators 2>/dev/null | grep -q opendatahub-operator; then + echo "ODH does not appear to be installed on this cluster" + exit 0 +fi + +ODH_CSV=$(oc get csv -n openshift-operators --no-headers 2>/dev/null | grep opendatahub-operator | awk '{print $1}') +echo "Found ODH: $ODH_CSV" +``` + +### Step 2: Delete DataScienceCluster and DSCInitialization + +```bash +echo "" +echo "=== Step 2: Removing DataScienceCluster and DSCInitialization ===" + +oc delete datasciencecluster --all --timeout=60s 2>/dev/null || true +oc delete dscinitializations.dscinitialization.opendatahub.io --all --timeout=60s 2>/dev/null || true +sleep 10 +``` + +### Step 3: Delete CSV and Subscription + +```bash +echo "" +echo "=== Step 3: Removing ODH operator subscription and CSV ===" + +oc delete subscription opendatahub-operator -n openshift-operators 2>/dev/null || true +oc delete csv "$ODH_CSV" -n openshift-operators 2>/dev/null || true + +# Remove catalog source +oc delete catalogsource odh-catalog -n openshift-marketplace 2>/dev/null || true +sleep 10 +``` + +### Step 4: Remove User Resources (unless keep-all) + +```bash +if [[ "$KEEP_ALL" != "true" ]]; then + echo "" + echo "=== Step 4: Removing user resources ===" + + # Delete data science projects + for ns in $(oc get namespace -l opendatahub.io/dashboard=true -o name 2>/dev/null); do + echo "Deleting namespace: $ns" + oc delete $ns --timeout=60s 2>/dev/null || true + done + + # Remove finalizers from any stuck resources + for crd in notebooks.kubeflow.org inferenceservices.serving.kserve.io \ + datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io; do + oc get $crd -A -o name 2>/dev/null | while read res; do + oc patch $res --type=json -p '[{"op":"remove","path":"/metadata/finalizers"}]' 2>/dev/null || true + done + done +else + echo "=== Step 4: Skipping user resources (keep-all) ===" +fi +``` + +### Step 5: Remove ODH Namespace + +```bash +echo "" +echo "=== Step 5: Removing ODH application namespace ===" + +if [[ "$KEEP_ALL" != "true" ]]; then + oc delete namespace opendatahub --timeout=120s 2>/dev/null || { + echo "Namespace stuck — removing finalizers..." + oc get namespace opendatahub -o json 2>/dev/null | \ + python3 -c "import sys,json; d=json.load(sys.stdin); d['spec']['finalizers']=[]; print(json.dumps(d))" | \ + oc replace --raw /api/v1/namespaces/opendatahub/finalize -f - 2>/dev/null || true + } +fi +``` + +### Step 6: Remove CRDs (unless keep-crds) + +```bash +if [[ "$KEEP_CRDS" != "true" ]]; then + echo "" + echo "=== Step 6: Removing ODH CRDs ===" + + # Get all CRDs owned by ODH + ODH_CRDS=$(oc get crd -o name 2>/dev/null | grep -E \ + "opendatahub|datasciencecluster|dscinitialization|featuretracker|datasciencepipeline" || true) + + for crd in $ODH_CRDS; do + echo "Deleting CRD: $crd" + oc delete $crd --timeout=30s 2>/dev/null || true + done +else + echo "=== Step 6: Skipping CRD removal (keep-crds) ===" +fi +``` + +### Step 7: Verify Cleanup + +```bash +echo "" +echo "=== Uninstall Complete ===" +echo "" + +# Check for remaining resources +REMAINING_CSV=$(oc get csv -n openshift-operators 2>/dev/null | grep opendatahub || echo "") +REMAINING_NS=$(oc get namespace opendatahub 2>/dev/null || echo "") + +if [[ -z "$REMAINING_CSV" && -z "$REMAINING_NS" ]]; then + echo "ODH successfully removed" +else + [[ -n "$REMAINING_CSV" ]] && echo "WARNING: CSV still present: $REMAINING_CSV" + [[ -n "$REMAINING_NS" ]] && echo "WARNING: Namespace 'opendatahub' still present" +fi + +echo "" +echo "To install ODH again: /odh-install" +echo "To install RHOAI: /rhoai-install" +``` + +## Switching from ODH to RHOAI + +If you want to install RHOAI after ODH, use the **default** uninstall (no flags): + +```bash +/odh-uninstall +/rhoai-install +``` + +Do **not** use `keep-crds` or `keep-all` when switching to RHOAI — RHOAI installs its own versions of the shared CRDs (`DataScienceCluster`, etc.) and leftover ODH CRDs will conflict. + +## Notes + +- ODH and RHOAI share cluster-scoped CRDs (`DataScienceCluster`, `DSCInitialization`) — they cannot coexist +- If the `opendatahub` namespace gets stuck on termination, the command attempts to remove its finalizers automatically +- User data (notebooks, pipelines, models) in data science project namespaces is deleted by default — use `keep-all` to preserve it (note: ODH user data is not compatible with RHOAI namespaces) diff --git a/workflows/rhoai-manager/.claude/commands/odh-update.md b/workflows/rhoai-manager/.claude/commands/odh-update.md new file mode 100644 index 00000000..e1f103cd --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/odh-update.md @@ -0,0 +1,244 @@ +# /odh-update - Update Open Data Hub to Latest Nightly + +Update an existing ODH installation to the latest nightly build or a specific version. + +## Command Usage + +```bash +/odh-update # Pull latest odh-stable-nightly +/odh-update image=quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly +/odh-update image=quay.io/opendatahub/opendatahub-operator-catalog:latest +``` + +## Available Image Tags + +| Tag | Updated | Use Case | +|-----|---------|----------| +| `odh-stable-nightly` (default) | Daily at midnight UTC | Pull latest nightly | +| `latest` | On every push | Bleeding edge | +| `odh-stable` | Stable releases | Stable deployments | + +## How ODH Updates Work + +ODH nightlies typically bump the CSV version daily (unlike RHOAI stable which keeps the same version). This means: +- **Updating the CatalogSource + refreshing the catalog pod** is usually enough +- OLM detects the new CSV version and auto-creates an InstallPlan +- No forced reinstall needed in most cases (unlike RHOAI) + +If the CSV version doesn't change (component images only), this command handles the forced reinstall automatically. + +## Prerequisites + +1. **Existing ODH**: ODH must already be installed (use `/odh-install` for fresh installations) +2. **Cluster access**: Logged into OpenShift cluster with cluster-admin privileges (use `/oc-login`) + +## Process + +### Step 1: Parse Input Arguments + +```bash +CATALOG_IMAGE="quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly" + +for arg in "$@"; do + case "$arg" in + image=*) + CATALOG_IMAGE="${arg#*=}" + ;; + *) + echo "Unknown parameter: $arg (expected: image=)" + ;; + esac +done + +echo "Target catalog image: $CATALOG_IMAGE" +``` + +### Step 2: Verify Cluster Access and Existing Installation + +```bash +oc whoami &>/dev/null || { echo "ERROR: Not logged into OpenShift cluster"; exit 1; } +echo "Logged in as: $(oc whoami)" +echo "Cluster: $(oc whoami --show-server)" + +CSV_LINE=$(oc get csv -n openshift-operators 2>/dev/null | grep opendatahub-operator || echo "") +[[ -n "$CSV_LINE" ]] || { echo "ERROR: ODH not installed. Use /odh-install first."; exit 1; } + +CURRENT_CSV=$(echo "$CSV_LINE" | awk '{print $1}') +CURRENT_CHANNEL=$(oc get subscription opendatahub-operator -n openshift-operators \ + -o jsonpath='{.spec.channel}' 2>/dev/null || echo "fast") + +echo "Current CSV: $CURRENT_CSV" +echo "Current channel: $CURRENT_CHANNEL (will be preserved)" +``` + +### Step 3: Update CatalogSource + +```bash +echo "Updating ODH CatalogSource to: $CATALOG_IMAGE" +oc patch catalogsource odh-catalog -n openshift-marketplace --type=merge \ + -p "{\"spec\":{\"image\":\"${CATALOG_IMAGE}\"}}" 2>&1 || { + # CatalogSource may not exist yet, create it + cat << EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: CatalogSource +metadata: + name: odh-catalog + namespace: openshift-marketplace +spec: + sourceType: grpc + image: ${CATALOG_IMAGE} + displayName: Open Data Hub + publisher: ODH Community + updateStrategy: + registryPoll: + interval: 15m +EOF +} +``` + +### Step 4: Force Catalog Refresh + +```bash +echo "Forcing catalog pod to pull latest image..." +oc delete pod -n openshift-marketplace -l olm.catalogSource=odh-catalog 2>/dev/null || true + +TIMEOUT=120 +ELAPSED=0 +while [[ $ELAPSED -lt $TIMEOUT ]]; do + PHASE=$(oc get pod -n openshift-marketplace -l olm.catalogSource=odh-catalog \ + -o jsonpath='{.items[0].status.phase}' 2>/dev/null || echo "") + if [[ "$PHASE" == "Running" ]]; then + echo "Catalog refreshed with latest image" + break + fi + sleep 5 + ELAPSED=$((ELAPSED + 5)) + echo "Waiting for catalog pod... (${ELAPSED}s/${TIMEOUT}s)" +done +``` + +### Step 5: Wait for OLM to Detect New Version + +OLM polls the catalog every 15 minutes but also reacts within ~30s of the catalog pod coming up. + +```bash +echo "Waiting for OLM to detect new CSV version..." +sleep 30 + +NEW_CSV_LINE=$(oc get csv -n openshift-operators 2>/dev/null | grep opendatahub-operator || echo "") +NEW_CSV=$(echo "$NEW_CSV_LINE" | awk '{print $1}') + +if [[ "$NEW_CSV" != "$CURRENT_CSV" ]]; then + echo "New CSV detected: $NEW_CSV (was: $CURRENT_CSV)" + echo "OLM is auto-upgrading..." +else + echo "CSV version unchanged: $CURRENT_CSV" + echo "Checking for newer component images in catalog..." + + # Get catalog operator image + CATALOG_POD=$(oc get pod -n openshift-marketplace -l olm.catalogSource=odh-catalog -o name | head -1) + CATALOG_OP=$(oc exec -n openshift-marketplace $CATALOG_POD -- \ + sh -c "grep -B1 'odh_rhel9_operator_image\|manager_image' /configs/opendatahub-operator/catalog.yaml 2>/dev/null | grep 'image:' | tail -1 | awk '{print \$3}'" 2>/dev/null || echo "") + DEPLOYED_OP=$(oc get deployment opendatahub-operator-controller-manager -n openshift-operators \ + -o jsonpath='{.spec.template.spec.containers[0].image}' 2>/dev/null || echo "") + + if [[ -n "$CATALOG_OP" && "$DEPLOYED_OP" != "$CATALOG_OP" ]]; then + echo "Newer component images found — performing forced reinstall..." + + SUB=$(oc get subscription opendatahub-operator -n openshift-operators \ + -o jsonpath='{.metadata.name}' 2>/dev/null) + oc delete csv "$CURRENT_CSV" -n openshift-operators 2>&1 || true + sleep 5 + oc delete subscription "$SUB" -n openshift-operators 2>&1 || true + sleep 5 + + cat << EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: Subscription +metadata: + name: opendatahub-operator + namespace: openshift-operators +spec: + channel: ${CURRENT_CHANNEL} + name: opendatahub-operator + source: odh-catalog + sourceNamespace: openshift-marketplace + installPlanApproval: Automatic +EOF + echo "Subscription recreated — waiting for new CSV..." + else + echo "All component images are up to date — no reinstall needed" + fi +fi +``` + +### Step 6: Wait for CSV to Succeed + +```bash +TIMEOUT=600 +ELAPSED=0 +CSV_PHASE="" + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CSV_LINE=$(oc get csv -n openshift-operators 2>/dev/null | grep opendatahub-operator | grep -v Replacing || echo "") + if [[ -n "$CSV_LINE" ]]; then + CSV_NAME=$(echo "$CSV_LINE" | awk '{print $1}') + CSV_PHASE=$(echo "$CSV_LINE" | awk '{print $NF}') + echo "CSV: $CSV_NAME, Phase: $CSV_PHASE" + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + echo "ODH operator updated successfully" + break + fi + fi + sleep 10 + ELAPSED=$((ELAPSED + 10)) + echo "Waiting for CSV... (${ELAPSED}s/${TIMEOUT}s)" +done + +[[ "$CSV_PHASE" == "Succeeded" ]] || echo "WARNING: CSV not yet Succeeded — check manually" +``` + +### Step 7: Verify DSC Still Ready + +```bash +sleep 15 +READY=$(oc get datasciencecluster default-dsc \ + -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || echo "") + +echo "" +echo "=== ODH Update Summary ===" +echo "" +echo "CSV:" +oc get csv -n openshift-operators | grep opendatahub-operator + +echo "" +echo "Catalog image: $CATALOG_IMAGE" +echo "DSC Ready: ${READY:-Unknown}" + +if [[ "$READY" != "True" ]]; then + echo "" + echo "DSC not yet Ready — not-ready components:" + oc get datasciencecluster default-dsc \ + -o jsonpath='{range .status.conditions[*]}{.type}{": "}{.status}{" ("}{.reason}{")\n"}{end}' \ + 2>/dev/null | grep -v "True\|Removed" || true +fi + +echo "" +echo "ODH update complete!" +``` + +## Pulling the Latest Nightly Daily + +Since `odh-stable-nightly` is rebuilt every day at midnight UTC, just re-run: + +```bash +/odh-update +``` + +Or manually: +```bash +# Refresh catalog pod to pull latest nightly +oc delete pod -n openshift-marketplace -l olm.catalogSource=odh-catalog +``` + +OLM will detect the new CSV version and auto-upgrade within ~30 seconds. diff --git a/workflows/rhoai-manager/.claude/commands/rhoai-disconnected.md b/workflows/rhoai-manager/.claude/commands/rhoai-disconnected.md new file mode 100644 index 00000000..04d707c7 --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/rhoai-disconnected.md @@ -0,0 +1,1041 @@ +# /rhoai-disconnected - Install or Update RHOAI on a Disconnected OpenShift Cluster + +Install or update Red Hat OpenShift AI (RHOAI) on a disconnected (air-gapped) OpenShift cluster. This command handles the unique requirements of disconnected environments: verifying images exist on the bastion registry, using digest-pinned FBC catalogs, applying known workarounds for disconnected-specific issues, and validating all pods can pull their images. + +## Command Usage + +```bash +# Install RHOAI on a fresh disconnected cluster +/rhoai-disconnected install fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:fe1157d5... + +# Update existing RHOAI to a new build +/rhoai-disconnected update fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:abc123... + +# Auto-detect install vs update +/rhoai-disconnected fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:fe1157d5... + +# With explicit bastion and channel +/rhoai-disconnected fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:fe1157d5... bastion=bastion.example.com:8443 channel=stable-3.4 +``` + +## Inputs + +| Input | Required | Description | Example | +|-------|----------|-------------|---------| +| `fbc` | **Yes** | FBC (File-Based Catalog) image reference. Must include `@sha256:` digest. This is the **source** reference (IDMS rewrites to bastion). | `quay.io/rhoai/rhoai-fbc-fragment@sha256:fe1157d5...` | +| `bastion` | No (auto-detected) | Bastion registry host:port. Auto-detected from IDMS if not specified. | `bastion.ods-dis-rhoai-test.aws.rh-ods.com:8443` | +| `channel` | No | OLM subscription channel. Default: `stable-3.4` for install, preserved for update. | `stable-3.4`, `beta` | +| `install` / `update` | No | Force install or update mode. Auto-detected if omitted. | | + +**Auto-detected:** + +| Value | Source | +|-------|--------| +| `BASTION` | Extracted from IDMS entries for `quay.io/rhoai` or `registry.redhat.io/rhoai` | +| `MODE` | `install` if no RHOAI CSV exists, `update` if one does | +| `RHOAI_VERSION` | Extracted from the CSV version after install/update | + +## Prerequisites + +1. Logged into the **disconnected** OpenShift cluster with cluster-admin privileges (`/oc-login`) +2. `oc` CLI and `jq` available +3. FBC image and ALL component images already mirrored to the bastion (use `/mirror-images` on the connected cluster first) +4. IDMS (ImageDigestMirrorSet) entries configured for all source registries +5. No ODH installation on the cluster (RHOAI and ODH cannot coexist) +6. **Dependent operators installed** — RHOAI DSC requires these operators to fully reconcile: + - Red Hat OpenShift Service Mesh (provides `DestinationRule` CRD — required for KServe/gateway) + - Red Hat OpenShift Serverless (provides `KnativeServing` — required for KServe) + - Red Hat OpenShift Pipelines (provides Tekton — required for DSP) + - cert-manager for Red Hat OpenShift (provides `Certificate` CRD — required for TLS) + +## Process + +### Step 1: Parse Input Arguments + +```bash +# Defaults +FBC_IMAGE="" +BASTION="" +CHANNEL="" +MODE="" # install or update, auto-detected if empty + +# Parse key=value arguments +for arg in "$@"; do + case "$arg" in + fbc=*) FBC_IMAGE="${arg#*=}" ;; + bastion=*) BASTION="${arg#*=}" ;; + channel=*) CHANNEL="${arg#*=}" ;; + install) MODE="install" ;; + update) MODE="update" ;; + esac +done + +# Validate FBC image is provided and uses digest +if [[ -z "$FBC_IMAGE" ]]; then + die "FBC image is required. Usage: /rhoai-disconnected fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:..." +fi + +if [[ "$FBC_IMAGE" != *"@sha256:"* ]]; then + echo "WARNING: FBC image should use @sha256: digest for reproducibility on disconnected clusters." + echo " Provided: $FBC_IMAGE" + echo " Floating tags may resolve to different images if the bastion cache is stale." +fi +``` + +### Step 2: Verify Cluster Access and Detect Mode + +```bash +command -v oc &>/dev/null || die "oc command not found" +command -v jq &>/dev/null || die "jq command not found" +oc whoami &>/dev/null || die "Not logged into an OpenShift cluster" + +echo "Logged in as: $(oc whoami)" +echo "Cluster: $(oc whoami --show-server)" + +# Check ODH conflict +if oc get csv -n openshift-operators 2>/dev/null | grep -q opendatahub-operator; then + die "ODH is installed. Uninstall ODH first with /odh-uninstall before installing RHOAI." +fi + +# Auto-detect install vs update +if [[ -z "$MODE" ]]; then + if oc get csv -n redhat-ods-operator 2>/dev/null | grep -q rhods-operator; then + MODE="update" + echo "Detected existing RHOAI installation -> UPDATE mode" + else + MODE="install" + echo "No existing RHOAI installation -> INSTALL mode" + fi +fi + +# Set default channel +if [[ -z "$CHANNEL" ]]; then + if [[ "$MODE" == "update" ]]; then + CHANNEL=$(oc get subscription -n redhat-ods-operator -o jsonpath='{.items[0].spec.channel}' 2>/dev/null || echo "stable-3.4") + echo "Preserving existing channel: $CHANNEL" + else + CHANNEL="stable-3.4" + echo "Using default channel: $CHANNEL" + fi +fi +``` + +### Step 2b: Verify Dependent Operators + +RHOAI DSC cannot fully reconcile without dependent operators. Missing operators cause specific component failures (e.g., KServe fails without Service Mesh, DSP fails without Pipelines). Check and warn early. + +```bash +echo "" +echo "=== Checking Dependent Operators ===" + +MISSING_DEPS=() + +# Service Mesh — required for KServe gateway (DestinationRule CRD) +if oc get crd destinationrules.networking.istio.io &>/dev/null; then + echo " Service Mesh: OK" +else + MISSING_DEPS+=("Red Hat OpenShift Service Mesh (DestinationRule CRD missing — KServe gateway will fail)") +fi + +# Serverless — required for KServe (KnativeServing) +if oc get crd knativeservings.operator.knative.dev &>/dev/null; then + echo " Serverless: OK" +else + MISSING_DEPS+=("Red Hat OpenShift Serverless (KnativeServing CRD missing — KServe will fail)") +fi + +# Pipelines — required for Data Science Pipelines (Tekton) +if oc get crd pipelines.tekton.dev &>/dev/null; then + echo " Pipelines: OK" +else + MISSING_DEPS+=("Red Hat OpenShift Pipelines (Tekton CRD missing — DSP will fail)") +fi + +# Cert Manager — required for TLS certificate management +if oc get crd certificates.cert-manager.io &>/dev/null; then + echo " Cert Manager: OK" +else + MISSING_DEPS+=("cert-manager for Red Hat OpenShift (Certificate CRD missing — TLS cert management will fail)") +fi + +if [[ ${#MISSING_DEPS[@]} -gt 0 ]]; then + echo "" + echo "WARNING: ${#MISSING_DEPS[@]} dependent operator(s) are missing:" + for dep in "${MISSING_DEPS[@]}"; do + echo " - $dep" + done + echo "" + echo "RHOAI will install but DSC may not fully reconcile." + echo "Install these operators from the disconnected catalog before proceeding, or continue with partial functionality." + echo "" + echo "Continuing in 10 seconds... (Ctrl+C to cancel)" + sleep 10 +fi +``` + +### Step 3: Auto-Detect Bastion from IDMS + +```bash +if [[ -z "$BASTION" ]]; then + # Extract bastion from IDMS entries for rhoai source + BASTION=$(oc get imagedigestmirrorset -o jsonpath='{range .items[*]}{range .spec.imageDigestMirrors[*]}{.source}{"|"}{.mirrors[0]}{"\n"}{end}{end}' 2>/dev/null \ + | grep 'registry.redhat.io/rhoai' \ + | head -1 \ + | awk -F'|' '{print $2}' \ + | sed 's|/rhoai$||') + + if [[ -z "$BASTION" ]]; then + die "Could not auto-detect bastion from IDMS. Provide it explicitly: bastion=host:port" + fi + + echo "Auto-detected bastion: $BASTION" +fi +``` + +### Step 4: Pre-Flight Image Verification + +This is the critical step that prevents the ImagePullBackOff failures seen on disconnected clusters. Verify that the FBC image and key component images exist on the bastion BEFORE proceeding. + +```bash +echo "" +echo "=== Pre-Flight Image Verification ===" +echo "Checking that required images exist on bastion: $BASTION" + +PULL_SECRET_JSON=$(oc get secret/pull-secret -n openshift-config -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d) +TMPFILE=$(mktemp) +chmod 600 "$TMPFILE" +trap 'rm -f "$TMPFILE"' EXIT +echo "$PULL_SECRET_JSON" > "$TMPFILE" + +MISSING_IMAGES=() +VERIFIED_COUNT=0 + +# 4a. Verify FBC image on bastion +# Compute bastion FBC path from the source FBC reference +FBC_REPO=$(echo "$FBC_IMAGE" | sed 's|@sha256:.*||' | awk -F'/' '{print $NF}') +FBC_DIGEST=$(echo "$FBC_IMAGE" | grep -oE 'sha256:[a-f0-9]+') + +# The FBC may be mirrored under different paths depending on IDMS config +# Try the IDMS-mapped path first, then common paths +FBC_BASTION_CANDIDATES=( + "${BASTION}/rhoai/${FBC_REPO}@${FBC_DIGEST}" + "${BASTION}/catalogs/${FBC_REPO}@${FBC_DIGEST}" + "${BASTION}/modh/${FBC_REPO}@${FBC_DIGEST}" +) + +FBC_FOUND=false +for candidate in "${FBC_BASTION_CANDIDATES[@]}"; do + if oc image info "$candidate" --insecure=true -a "$TMPFILE" &>/dev/null; then + echo "FBC image verified: $candidate" + FBC_FOUND=true + FBC_BASTION_REF="$candidate" + break + fi +done + +if [[ "$FBC_FOUND" != "true" ]]; then + MISSING_IMAGES+=("FBC: $FBC_IMAGE") + echo "MISSING: FBC image not found on bastion" +fi + +# 4b. Extract relatedImages from the FBC catalog and verify key images +# Render the catalog from the FBC image to get the CSV's relatedImages +echo "" +echo "Extracting relatedImages from FBC catalog..." + +# Create a temporary pod to read the FBC catalog content +CATALOG_CONTENT=$(oc run fbc-verify --image="$FBC_IMAGE" --restart=Never \ + --command -- cat /configs/rhods-operator/catalog.yaml 2>/dev/null && \ + oc logs fbc-verify 2>/dev/null; oc delete pod fbc-verify --force 2>/dev/null || true) + +# If pod-based extraction fails (common on disconnected), use the CatalogSource approach: +# Create a temporary CatalogSource, wait for it, then query via the catalog pod +if [[ -z "$CATALOG_CONTENT" ]]; then + echo "Direct extraction failed, using CatalogSource approach..." + + # Create temp CatalogSource + cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: CatalogSource +metadata: + name: rhoai-catalog-verify + namespace: openshift-marketplace +spec: + displayName: "RHOAI Verify (temp)" + image: $FBC_IMAGE + sourceType: grpc +EOF + + # Wait for catalog pod to be ready + TIMEOUT=120 + ELAPSED=0 + while [[ $ELAPSED -lt $TIMEOUT ]]; do + CATALOG_STATE=$(oc get catalogsource rhoai-catalog-verify -n openshift-marketplace \ + -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null || echo "") + if [[ "$CATALOG_STATE" == "READY" ]]; then + break + fi + sleep 5 + ELAPSED=$((ELAPSED + 5)) + done + + if [[ "$CATALOG_STATE" == "READY" ]]; then + CATALOG_POD=$(oc get pod -n openshift-marketplace -l olm.catalogSource=rhoai-catalog-verify -o name 2>/dev/null | head -1) + if [[ -n "$CATALOG_POD" ]]; then + CATALOG_CONTENT=$(oc exec -n openshift-marketplace "$CATALOG_POD" -- cat /configs/rhods-operator/catalog.yaml 2>/dev/null || echo "") + fi + fi +fi + +# 4c. Parse relatedImages and verify each on bastion +# Pre-fetch full IDMS source-to-mirror mappings for path resolution +IDMS_SOURCES_FULL=$(oc get imagedigestmirrorset -o jsonpath='{range .items[*]}{range .spec.imageDigestMirrors[*]}{.source}{"|"}{.mirrors[0]}{"\n"}{end}{end}' 2>/dev/null || echo "") + +if [[ -n "$CATALOG_CONTENT" ]]; then + # Extract all image references from the catalog + RELATED_IMAGES=$(echo "$CATALOG_CONTENT" | grep -oE 'registry\.[^"]+@sha256:[a-f0-9]+|quay\.io[^"]+@sha256:[a-f0-9]+' | sort -u) + + TOTAL_IMAGES=$(echo "$RELATED_IMAGES" | wc -l | tr -d ' ') + echo "Found $TOTAL_IMAGES relatedImages in FBC catalog" + echo "Verifying each image exists on bastion..." + + while IFS= read -r img; do + [[ -z "$img" ]] && continue + + # Compute bastion path using IDMS entries to find the correct mirror path + # Extract source prefix from the image (e.g., registry.redhat.io/rhoai from registry.redhat.io/rhoai/odh-dashboard-rhel9@sha256:abc) + IMG_SOURCE_PREFIX=$(echo "$img" | sed -E 's|/[^/]+@sha256:.*||') + IMG_NAME_DIGEST=$(echo "$img" | sed -E "s|^${IMG_SOURCE_PREFIX}/||") + + # Look up the mirror path from IDMS for this source prefix + IDMS_MIRROR=$(echo "$IDMS_SOURCES_FULL" | grep "^${IMG_SOURCE_PREFIX}|" | head -1 | awk -F'|' '{print $2}') + + if [[ -n "$IDMS_MIRROR" ]]; then + BASTION_IMG="${IDMS_MIRROR}/${IMG_NAME_DIGEST}" + else + # Fallback: strip registry hostname, prepend bastion + IMG_PATH=$(echo "$img" | sed -E 's|^[^/]+/||') + BASTION_IMG="${BASTION}/${IMG_PATH}" + fi + + if oc image info "$BASTION_IMG" --insecure=true -a "$TMPFILE" &>/dev/null; then + VERIFIED_COUNT=$((VERIFIED_COUNT + 1)) + else + MISSING_IMAGES+=("$img") + fi + done <<< "$RELATED_IMAGES" +else + echo "WARNING: Could not extract relatedImages from FBC. Skipping image verification." + echo "Proceed with caution - pods may fail with ImagePullBackOff if images are missing." +fi + +# Clean up temp CatalogSource +oc delete catalogsource rhoai-catalog-verify -n openshift-marketplace 2>/dev/null || true + +# 4d. Report results +echo "" +echo "=== Pre-Flight Results ===" +echo "Verified: $VERIFIED_COUNT images" +echo "Missing: ${#MISSING_IMAGES[@]} images" + +if [[ ${#MISSING_IMAGES[@]} -gt 0 ]]; then + echo "" + echo "MISSING IMAGES:" + for img in "${MISSING_IMAGES[@]}"; do + echo " $img" + done + echo "" + echo "ERROR: ${#MISSING_IMAGES[@]} images are missing from the bastion registry." + echo "Run /mirror-images on the connected cluster to mirror these images first." + die "Pre-flight image verification failed" +fi + +echo "All images verified on bastion" +``` + +### Step 5: Verify IDMS Entries + +```bash +echo "" +echo "=== Verifying IDMS Entries ===" + +# Check that IDMS entries exist for all source registries used by RHOAI +REQUIRED_SOURCES=( + "registry.redhat.io/rhoai" + "registry.redhat.io/rhel9" + "registry.redhat.io/ubi9" + "registry.redhat.io/openshift-service-mesh" + "registry.redhat.io/rhbk" + "registry.redhat.io/cert-manager" + "registry.redhat.io/rhcl-1" + "registry.redhat.io/rhaii-early-access" + "quay.io/rhoai" + "quay.io/minio" + "quay.io/opendatahub" + "docker.io/milvusdb" +) + +IDMS_SOURCES=$(oc get imagedigestmirrorset -o jsonpath='{range .items[*]}{range .spec.imageDigestMirrors[*]}{.source}{"\n"}{end}{end}' 2>/dev/null | sort -u) + +MISSING_IDMS=() +for source in "${REQUIRED_SOURCES[@]}"; do + if echo "$IDMS_SOURCES" | grep -q "$source"; then + echo " IDMS OK: $source" + else + MISSING_IDMS+=("$source") + echo " IDMS MISSING: $source" + fi +done + +if [[ ${#MISSING_IDMS[@]} -gt 0 ]]; then + echo "" + echo "WARNING: ${#MISSING_IDMS[@]} IDMS entries are missing." + echo "Pods pulling from these registries will fail with ImagePullBackOff." + echo "The IDMS YAML can be generated by /mirror-images." + echo "" + echo "Continuing anyway - but watch for ImagePullBackOff errors." +fi +``` + +### Step 6: Create or Update CatalogSource + +```bash +echo "" +echo "=== Setting Up OLM Catalog ===" + +# Use the FBC image reference directly - IDMS handles rewriting to bastion +cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: CatalogSource +metadata: + name: rhoai-catalog-dev + namespace: openshift-marketplace +spec: + displayName: "Red Hat OpenShift AI" + image: $FBC_IMAGE + publisher: Red Hat + sourceType: grpc + updateStrategy: + registryPoll: + interval: 30m +EOF + +echo "CatalogSource created/updated with image: $FBC_IMAGE" + +# Force catalog pod refresh to ensure it picks up the new image +CATALOG_POD=$(oc get pod -n openshift-marketplace -l olm.catalogSource=rhoai-catalog-dev -o name 2>/dev/null | head -1) +if [[ -n "$CATALOG_POD" ]]; then + echo "Deleting old catalog pod to force image refresh..." + oc delete "$CATALOG_POD" -n openshift-marketplace 2>/dev/null || true +fi + +# Wait for catalog to be READY +TIMEOUT=180 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CATALOG_STATE=$(oc get catalogsource rhoai-catalog-dev -n openshift-marketplace \ + -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null || echo "") + + if [[ "$CATALOG_STATE" == "READY" ]]; then + echo "CatalogSource is READY" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo " CatalogSource state: ${CATALOG_STATE:-Unknown} (${ELAPSED}s/${TIMEOUT}s)" +done + +[[ "$CATALOG_STATE" == "READY" ]] || die "CatalogSource not READY after ${TIMEOUT}s. Check that the FBC image is accessible on the bastion." +``` + +### Step 7: Install - Create Namespace, OperatorGroup, Subscription (Install mode only) + +```bash +if [[ "$MODE" == "install" ]]; then + OPERATOR_NAMESPACE="redhat-ods-operator" + + # Create namespace + if ! oc get namespace "$OPERATOR_NAMESPACE" &>/dev/null; then + oc create namespace "$OPERATOR_NAMESPACE" + echo "Created namespace: $OPERATOR_NAMESPACE" + fi + + # Create OperatorGroup + cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1 +kind: OperatorGroup +metadata: + name: rhods-operator + namespace: $OPERATOR_NAMESPACE +spec: + targetNamespaces: + - $OPERATOR_NAMESPACE +EOF + + echo "OperatorGroup created" + + # Create Subscription + cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: Subscription +metadata: + name: rhoai-operator-dev + namespace: $OPERATOR_NAMESPACE +spec: + channel: $CHANNEL + installPlanApproval: Automatic + name: rhods-operator + source: rhoai-catalog-dev + sourceNamespace: openshift-marketplace +EOF + + echo "Subscription created (channel: $CHANNEL)" +fi +``` + +### Step 8: Update - Forced Reinstall to Pick Up New Images (Update mode only) + +On disconnected clusters, OLM may not auto-update if only component images changed (CSV version unchanged). Force a reinstall. + +```bash +if [[ "$MODE" == "update" ]]; then + echo "" + echo "=== Forcing Operator Reinstall ===" + + # Record current state + OLD_CSV=$(oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator | grep -v Replacing | awk '{print $1}') + SUB_NAME=$(oc get subscription -n redhat-ods-operator -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) + + echo "Current CSV: $OLD_CSV" + echo "Current subscription: $SUB_NAME" + + # Delete CSV to force OLM to reinstall from updated catalog + if [[ -n "$OLD_CSV" ]]; then + echo "Deleting CSV: $OLD_CSV" + oc delete csv "$OLD_CSV" -n redhat-ods-operator || true + sleep 10 + fi + + # Delete and recreate subscription + if [[ -n "$SUB_NAME" ]]; then + echo "Deleting subscription: $SUB_NAME" + oc delete subscription "$SUB_NAME" -n redhat-ods-operator || true + sleep 5 + fi + + # Recreate subscription pointing to updated catalog + cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: Subscription +metadata: + name: rhoai-operator-dev + namespace: redhat-ods-operator +spec: + channel: $CHANNEL + installPlanApproval: Automatic + name: rhods-operator + source: rhoai-catalog-dev + sourceNamespace: openshift-marketplace +EOF + + echo "Subscription recreated (channel: $CHANNEL)" +fi +``` + +### Step 9: Wait for Operator CSV + +```bash +echo "" +echo "=== Waiting for Operator CSV ===" + +CSV_PHASE="" +TIMEOUT=600 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CSV_LINE=$(oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator | grep -v Replacing || echo "") + + if [[ -n "$CSV_LINE" ]]; then + CSV_NAME=$(echo "$CSV_LINE" | awk '{print $1}') + CSV_PHASE=$(echo "$CSV_LINE" | awk '{print $NF}') + echo "CSV: $CSV_NAME, Phase: $CSV_PHASE" + + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + echo "Operator CSV installed successfully" + break + fi + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) +done + +[[ "$CSV_PHASE" == "Succeeded" ]] || die "Operator did not reach Succeeded phase within ${TIMEOUT}s" + +# Extract version +RHOAI_VERSION=$(oc get csv "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.version}' 2>/dev/null | grep -oE '^[0-9]+\.[0-9]+') +echo "RHOAI Version: $RHOAI_VERSION" +``` + +### Step 10: Create/Configure DataScienceCluster + +```bash +echo "" +echo "=== Configuring DataScienceCluster ===" + +# Wait for DSCInitialization +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get dscinitializations default-dsci &>/dev/null; then + echo "DSCInitialization found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) +done + +# For install mode, create DSC from CSV initialization-resource +if [[ "$MODE" == "install" ]]; then + CSV_NAME=$(oc get csv -n redhat-ods-operator 2>/dev/null | awk '/rhods-operator/{print $1; exit}') + if [[ -n "$CSV_NAME" ]]; then + oc get csv "$CSV_NAME" -n redhat-ods-operator \ + -o jsonpath='{.metadata.annotations.operatorframework\.io/initialization-resource}' \ + > /tmp/default-dsc.json + oc apply -f /tmp/default-dsc.json + echo "DSC created from CSV initialization-resource" + rm -f /tmp/default-dsc.json + fi +fi + +# Wait for DSC to exist +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get datasciencecluster default-dsc &>/dev/null; then + echo "DataScienceCluster found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) +done + +# Patch DSC to enable required components with disconnected-specific settings +cat > /tmp/dsc-patch.yaml << 'YAML' +spec: + components: + aipipelines: + managementState: Managed + argoWorkflowsControllers: + managementState: Managed + kserve: + serving: + managementState: Managed + rawDeploymentServiceConfig: Headless + nim: + managementState: Managed + airGapped: true + llamastackoperator: + managementState: Managed + mlflowoperator: + managementState: Managed + trustyai: + managementState: Managed + trainer: + managementState: Removed +YAML + +oc patch datasciencecluster default-dsc --type merge --patch-file /tmp/dsc-patch.yaml || \ + die "Failed to patch DataScienceCluster" + +echo "DSC component configuration applied:" +echo " - aipipelines: Managed (with argoWorkflowsControllers)" +echo " - kserve: Managed (rawDeploymentServiceConfig: Headless for disconnected)" +echo " - nim: Managed (airGapped: true for disconnected)" +echo " - llamastackoperator: Managed" +echo " - mlflowoperator: Managed" +echo " - trustyai: Managed" +echo " - trainer: Removed (requires JobSet operator)" + +rm -f /tmp/dsc-patch.yaml +``` + +### Step 11: Wait for DSC Ready + +```bash +echo "" +echo "=== Waiting for DataScienceCluster ===" + +TIMEOUT=600 +INTERVAL=15 +ELAPSED=0 +DSC_PHASE="" + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + DSC_PHASE=$(oc get datasciencecluster -o jsonpath='{.items[0].status.phase}' 2>/dev/null || echo "Unknown") + echo "DSC phase: $DSC_PHASE" + + if [[ "$DSC_PHASE" == "Ready" ]]; then + echo "DataScienceCluster is Ready" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) +done + +if [[ "$DSC_PHASE" != "Ready" ]]; then + echo "WARNING: DSC is not Ready after ${TIMEOUT}s (current: ${DSC_PHASE:-Unknown})" + echo "Not-ready components:" + oc get dsc default-dsc -o json 2>/dev/null | \ + jq -r '.status.conditions[] | select(.status=="False") | select(.message | test("Removed") | not) | " \(.type): \(.message)"' 2>/dev/null || true +fi +``` + +### Step 12: Post-Install/Update Health Check - Verify No ImagePullBackOff + +This is critical for disconnected clusters. After the operator reconciles, check ALL pods in RHOAI namespaces for ImagePullBackOff or ErrImagePull errors. + +```bash +echo "" +echo "=== Post-Install Health Check ===" +echo "Waiting 60 seconds for operator to reconcile pods..." +sleep 60 + +PROBLEM_PODS=() + +# Check pods in all RHOAI-related namespaces +for ns in redhat-ods-operator redhat-ods-applications; do + PODS=$(oc get pods -n "$ns" --no-headers 2>/dev/null || echo "") + + while IFS= read -r line; do + [[ -z "$line" ]] && continue + POD_NAME=$(echo "$line" | awk '{print $1}') + STATUS=$(echo "$line" | awk '{print $3}') + + if [[ "$STATUS" == "ImagePullBackOff" || "$STATUS" == "ErrImagePull" ]]; then + # Get the failing image + FAILING_IMAGE=$(oc get pod "$POD_NAME" -n "$ns" -o jsonpath='{range .status.containerStatuses[*]}{.state.waiting.message}{"\n"}{end}' 2>/dev/null | grep -oE 'image "[^"]+"' | head -1) + PROBLEM_PODS+=("$ns/$POD_NAME: $STATUS ($FAILING_IMAGE)") + elif [[ "$STATUS" == "CrashLoopBackOff" ]]; then + PROBLEM_PODS+=("$ns/$POD_NAME: $STATUS") + fi + done <<< "$PODS" +done + +if [[ ${#PROBLEM_PODS[@]} -gt 0 ]]; then + echo "" + echo "WARNING: ${#PROBLEM_PODS[@]} pods have issues:" + for pod in "${PROBLEM_PODS[@]}"; do + echo " $pod" + done + echo "" + echo "For ImagePullBackOff: The image is missing from the bastion. Run /mirror-images to mirror it." + echo "For CrashLoopBackOff: Check pod logs for root cause (may be the podToPodTLS bug - see Step 13)." +else + echo "All pods in RHOAI namespaces are running normally" +fi +``` + +### Step 13: Apply Known Disconnected Workarounds + +#### 13a. podToPodTLS Bug Workaround + +In some RHOAI nightly builds, the DSP operator sets `--caCertPath` flag in pipeline component deployments, but the binary only supports `--mlPipelineServiceTLSCert`. This causes CrashLoopBackOff for `scheduledworkflow` and other pipeline pods with error: `flag provided but not defined: -caCertPath`. + +The workaround is to set `podToPodTLS: false` on all DataSciencePipelinesApplication (DSPA) CRs. This must be applied AFTER the operator creates the DSPA resources. + +```bash +echo "" +echo "=== Applying Known Disconnected Workarounds ===" + +# 13a. Check for and fix podToPodTLS bug +# Only apply if pipeline components are enabled and DSPAs exist +DSPA_LIST=$(oc get datasciencepipelinesapplication --all-namespaces --no-headers 2>/dev/null || echo "") + +if [[ -n "$DSPA_LIST" ]]; then + echo "Found DataSciencePipelinesApplication resources. Checking for podToPodTLS bug..." + + while IFS= read -r line; do + [[ -z "$line" ]] && continue + DSPA_NS=$(echo "$line" | awk '{print $1}') + DSPA_NAME=$(echo "$line" | awk '{print $2}') + + # Check if any pipeline pods are in CrashLoopBackOff with caCertPath error + CRASH_PODS=$(oc get pods -n "$DSPA_NS" --no-headers 2>/dev/null | grep CrashLoopBackOff || echo "") + + if [[ -n "$CRASH_PODS" ]]; then + # Check logs for the specific caCertPath error + for crash_pod in $(echo "$CRASH_PODS" | awk '{print $1}'); do + if oc logs "$crash_pod" -n "$DSPA_NS" --tail=5 2>/dev/null | grep -q "caCertPath"; then + echo " Found podToPodTLS bug in $DSPA_NS/$DSPA_NAME" + echo " Applying workaround: podToPodTLS=false" + oc patch datasciencepipelinesapplication "$DSPA_NAME" -n "$DSPA_NS" \ + --type='merge' -p '{"spec":{"podToPodTLS":false}}' + fi + done + fi + + # Also proactively set podToPodTLS=false to prevent the issue + CURRENT_TLS=$(oc get datasciencepipelinesapplication "$DSPA_NAME" -n "$DSPA_NS" \ + -o jsonpath='{.spec.podToPodTLS}' 2>/dev/null || echo "") + if [[ "$CURRENT_TLS" != "false" ]]; then + echo " Setting podToPodTLS=false on $DSPA_NS/$DSPA_NAME (proactive)" + oc patch datasciencepipelinesapplication "$DSPA_NAME" -n "$DSPA_NS" \ + --type='merge' -p '{"spec":{"podToPodTLS":false}}' + fi + done <<< "$DSPA_LIST" +else + echo "No DSPAs found (pipelines not yet configured). podToPodTLS workaround will need to be applied after creating DSPAs." + echo " Command: oc patch datasciencepipelinesapplication <name> -n <namespace> --type='merge' -p '{\"spec\":{\"podToPodTLS\":false}}'" +fi +``` + +#### 13b. PersistenceAgent TLS Certificate Fix (Proactive + Reactive) + +The pipeline persistenceagent may fail with `x509: certificate signed by unknown authority` when connecting to the pipeline API server. This happens because the trusted CA bundle doesn't include the OpenShift service-ca that signed the pipeline API server cert. + +**Proactive fix:** Apply the service-ca to ALL DSPA trusted CA configmaps immediately, before waiting for a crash. This prevents the issue entirely. + +```bash +# 13b. Proactively fix persistenceagent TLS for all DSPAs +SERVICE_CA=$(oc get configmap openshift-service-ca.crt -n openshift-config-managed \ + -o jsonpath='{.data.service-ca\.crt}' 2>/dev/null || echo "") + +if [[ -n "$SERVICE_CA" && -n "$DSPA_LIST" ]]; then + while IFS= read -r line; do + [[ -z "$line" ]] && continue + DSPA_NS=$(echo "$line" | awk '{print $1}') + DSPA_NAME=$(echo "$line" | awk '{print $2}') + + CM_NAME="dsp-trusted-ca-${DSPA_NAME}" + + # Wait for the configmap to be created by the operator (up to 60s) + TIMEOUT=60 + ELAPSED=0 + while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get configmap "$CM_NAME" -n "$DSPA_NS" &>/dev/null; then + break + fi + sleep 5 + ELAPSED=$((ELAPSED + 5)) + done + + if oc get configmap "$CM_NAME" -n "$DSPA_NS" &>/dev/null; then + CURRENT_CA=$(oc get configmap "$CM_NAME" -n "$DSPA_NS" -o jsonpath='{.data.dsp-ca\.crt}' 2>/dev/null || echo "") + + if [[ -n "$CURRENT_CA" ]] && ! echo "$CURRENT_CA" | grep -q "openshift-service-serving-signer"; then + echo " Proactively appending service-ca to $CM_NAME in $DSPA_NS" + COMBINED_CA="${CURRENT_CA} +${SERVICE_CA}" + TMPCA=$(mktemp) + chmod 600 "$TMPCA" + echo "$COMBINED_CA" > "$TMPCA" + oc create configmap "$CM_NAME" -n "$DSPA_NS" \ + --from-file=dsp-ca.crt="$TMPCA" \ + --dry-run=client -o yaml | oc replace -f - + rm -f "$TMPCA" + echo " Service-ca appended to $CM_NAME" + + # Restart persistenceagent if it exists (may or may not be running yet) + PA_POD=$(oc get pods -n "$DSPA_NS" --no-headers 2>/dev/null | grep persistenceagent | awk '{print $1}') + if [[ -n "$PA_POD" ]]; then + oc delete pod "$PA_POD" -n "$DSPA_NS" 2>/dev/null || true + echo " Restarted persistenceagent pod" + fi + else + echo " $CM_NAME in $DSPA_NS already has service-ca (or empty)" + fi + else + echo " WARNING: $CM_NAME not found in $DSPA_NS after ${TIMEOUT}s. Will need manual fix after DSPA creates it." + fi + done <<< "$DSPA_LIST" +elif [[ -z "$SERVICE_CA" ]]; then + echo " WARNING: Could not retrieve openshift-service-ca.crt — persistenceagent TLS fix skipped" +fi +``` + +### Step 14: Configure Dashboard Features + +```bash +echo "" +echo "=== Configuring Dashboard ===" + +# Wait for dashboard +TIMEOUT=300 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + READY=$(oc get deployment rhods-dashboard -n redhat-ods-applications -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo "0") + DESIRED=$(oc get deployment rhods-dashboard -n redhat-ods-applications -o jsonpath='{.spec.replicas}' 2>/dev/null || echo "0") + + if [[ "$READY" -gt 0 && "$READY" -eq "$DESIRED" ]]; then + echo "Dashboard deployment is ready ($READY/$DESIRED)" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) +done + +# Wait for OdhDashboardConfig +TIMEOUT=120 +ELAPSED=0 +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + break + fi + sleep 10 + ELAPSED=$((ELAPSED + 10)) +done + +# Enable feature flags +if oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + oc patch odhdashboardconfig odh-dashboard-config -n redhat-ods-applications --type merge -p '{ + "spec": { + "dashboardConfig": { + "automl": true, + "autorag": true, + "genAiStudio": true + } + } + }' 2>/dev/null || echo "WARNING: Failed to patch dashboard config" + + echo "Dashboard feature flags configured (automl, autorag, genAiStudio)" + + # Restart dashboard to pick up changes + oc rollout restart deployment rhods-dashboard -n redhat-ods-applications 2>/dev/null || true +else + echo "WARNING: OdhDashboardConfig not found. Feature flags will need manual configuration." +fi +``` + +### Step 15: Final Verification + +```bash +echo "" +echo "==========================================" +echo " RHOAI ${MODE^^} Summary (Disconnected)" +echo "==========================================" + +# CSV info +echo "" +echo "Operator CSV:" +oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator || echo " WARNING: CSV not found" + +# Version +echo "" +CSV_NAME=$(oc get csv -n redhat-ods-operator 2>/dev/null | awk '/rhods-operator/{print $1; exit}') +if [[ -n "$CSV_NAME" ]]; then + VERSION=$(oc get csv "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.version}' 2>/dev/null) + echo "RHOAI Version: $VERSION" +fi + +# FBC image +echo "FBC Image: $FBC_IMAGE" +echo "Channel: $CHANNEL" +echo "Bastion: $BASTION" + +# DSC status +echo "" +echo "DataScienceCluster:" +DSC_PHASE=$(oc get datasciencecluster -o jsonpath='{.items[0].status.phase}' 2>/dev/null || echo "Unknown") +echo " Phase: $DSC_PHASE" + +# Dashboard URL +echo "" +echo "Dashboard:" +DASHBOARD_ROUTE=$(oc get route rhods-dashboard -n redhat-ods-applications -o jsonpath='{.spec.host}' 2>/dev/null || echo "") +if [[ -n "$DASHBOARD_ROUTE" ]]; then + echo " https://$DASHBOARD_ROUTE" +else + echo " Route not found yet" +fi + +# Pod health summary +echo "" +echo "Pod Health (RHOAI namespaces):" +for ns in redhat-ods-operator redhat-ods-applications; do + TOTAL=$(oc get pods -n "$ns" --no-headers 2>/dev/null | wc -l | tr -d ' ') + RUNNING=$(oc get pods -n "$ns" --no-headers 2>/dev/null | grep Running | wc -l | tr -d ' ') + ISSUES=$(oc get pods -n "$ns" --no-headers 2>/dev/null | grep -cE 'ImagePullBackOff|ErrImagePull|CrashLoopBackOff' | tr -d ' ') + echo " $ns: $RUNNING/$TOTAL running, $ISSUES with issues" +done + +echo "" +if [[ "$DSC_PHASE" == "Ready" ]]; then + echo "RHOAI ${MODE} on disconnected cluster complete!" +else + echo "RHOAI ${MODE} completed but DSC is not fully Ready." + echo "Check pod status and apply workarounds if needed." +fi +``` + +## Known Issues and Workarounds + +### 1. podToPodTLS CrashLoopBackOff (DSP Components) + +**Symptom:** Pipeline pods (`scheduledworkflow`, `persistenceagent`) crash with `flag provided but not defined: -caCertPath` + +**Cause:** RHOAI nightly build bug -- operator sets `--caCertPath` in deployment spec but the binary only supports `--mlPipelineServiceTLSCert` + +**Fix:** Applied automatically in Step 13a. For new DSPAs created after install: +```bash +oc patch datasciencepipelinesapplication <name> -n <namespace> --type='merge' -p '{"spec":{"podToPodTLS":false}}' +``` + +### 2. PersistenceAgent x509 Certificate Error + +**Symptom:** `persistenceagent` crashes with `x509: certificate signed by unknown authority` when connecting to `ds-pipeline-*.svc.cluster.local:8888` + +**Cause:** The DSP trusted CA configmap has Mozilla CA bundle but NOT the OpenShift service-ca that signed the pipeline API server cert + +**Fix:** Applied automatically in Step 13b. Manual fix: +```bash +# Get the service-ca +SERVICE_CA=$(oc get configmap openshift-service-ca.crt -n openshift-config-managed -o jsonpath='{.data.service-ca\.crt}') +# Append to the existing DSP CA configmap +``` + +### 3. Missing Images on Bastion + +**Symptom:** Multiple pods in `ImagePullBackOff` state after install/update + +**Cause:** Not all RHOAI images were mirrored to the bastion before install/update + +**Prevention:** Step 4 (pre-flight verification) catches this before proceeding. Always run `/mirror-images` on the connected cluster first. + +### 4. EvalHub Cross-Namespace Issues + +**Symptom:** EvalHub evaluation jobs fail when running in a different namespace than `evalhub` + +**Cause:** EvalHub operator creates K8s Jobs in the target namespace but doesn't create the required ServiceAccount (`evalhub-evalhub-job`) or ConfigMap (`evalhub-service-ca`) there + +**Fix:** Manually create the SA and copy the ConfigMap: +```bash +oc create sa evalhub-evalhub-job -n <target-namespace> +oc adm policy add-role-to-user edit system:serviceaccount:<target-namespace>:evalhub-evalhub-job -n <target-namespace> +oc get configmap evalhub-service-ca -n evalhub -o json | \ + jq 'del(.metadata.namespace,.metadata.resourceVersion,.metadata.uid,.metadata.creationTimestamp,.metadata.managedFields,.metadata.ownerReferences)' | \ + oc create -n <target-namespace> -f - +``` + +## Output + +The command creates a report at `artifacts/rhoai-manager/reports/disconnected-{install|update}-report-[timestamp].md` with: +- FBC image reference and digest +- Pre-flight verification results +- Operator CSV details +- DataScienceCluster status +- Pod health check results +- Workarounds applied +- Dashboard URL diff --git a/workflows/rhoai-manager/.claude/commands/rhoai-install.md b/workflows/rhoai-manager/.claude/commands/rhoai-install.md new file mode 100644 index 00000000..d4c69e06 --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/rhoai-install.md @@ -0,0 +1,572 @@ +# /rhoai-install - Install RHOAI on OpenShift Cluster + +Install Red Hat OpenShift AI (RHOAI) on an OpenShift cluster using OLM (Operator Lifecycle Manager). + +## Command Usage + +### Development/Nightly Builds (default) +```bash +/rhoai-install # Latest dev catalog (3.4 beta) +/rhoai-install channel=beta # Explicit beta channel +/rhoai-install image=quay.io/modh/rhoai-catalog:latest-release-3.5 # Custom image +``` + +### GA Production Releases +```bash +/rhoai-install catalog=redhat-operators # GA catalog, stable channel +/rhoai-install catalog=redhat-operators channel=fast # GA catalog, fast channel +/rhoai-install catalog=redhat-operators channel=stable # GA catalog, stable channel +``` + +### Combined Parameters +```bash +/rhoai-install catalog=rhoai-catalog-dev channel=beta image=quay.io/modh/rhoai-catalog:custom +``` + +## Catalog Types + +| Catalog | Description | Use Case | +|---------|-------------|----------| +| `rhoai-catalog-dev` (default) | Development nightly builds | Testing EA/nightly builds | +| `redhat-operators` | Red Hat certified GA releases | Production deployments | + +## Available Channels + +| Channel | Description | Catalog Type | +|---------|-------------|--------------| +| `beta` (default) | Latest EA/nightly builds | rhoai-catalog-dev | +| `fast` | Early GA releases | redhat-operators | +| `stable` | Stable GA releases | redhat-operators | + +## Prerequisites + +Before running this command: +1. **Cluster access**: Logged into OpenShift cluster with cluster-admin privileges (use `/oc-login`) +2. **Tools installed**: `oc` CLI and `jq` must be available +3. **No existing RHOAI**: This command is for fresh installations only + +## Process + +### Step 1: Parse Input Arguments + +```bash +# Default values +CATALOG_SOURCE="rhoai-catalog-dev" +CATALOG_IMAGE="" +CHANNEL="beta" +CUSTOM_IMAGE_OVERRIDE="" + +# Parse key=value arguments +for arg in "$@"; do + case "$arg" in + catalog=*) + CATALOG_SOURCE="${arg#*=}" + ;; + channel=*) + CHANNEL="${arg#*=}" + ;; + image=*) + CUSTOM_IMAGE_OVERRIDE="${arg#*=}" + ;; + *) + echo "⚠️ Unknown parameter: $arg (expected: catalog=, channel=, or image=)" + ;; + esac +done + +# Smart defaults based on catalog type +if [[ "$CATALOG_SOURCE" == "rhoai-catalog-dev" ]]; then + # Development catalog - use custom image or default + if [[ -n "$CUSTOM_IMAGE_OVERRIDE" ]]; then + CATALOG_IMAGE="$CUSTOM_IMAGE_OVERRIDE" + else + CATALOG_IMAGE="quay.io/modh/rhoai-catalog:latest-release-3.4" + fi + CATALOG_NAMESPACE="openshift-marketplace" + USE_CUSTOM_CATALOG=true + + echo "📦 Catalog: Development (rhoai-catalog-dev)" + echo " Image: $CATALOG_IMAGE" + echo " Channel: $CHANNEL" + +elif [[ "$CATALOG_SOURCE" == "redhat-operators" ]]; then + # GA catalog - uses built-in Red Hat operators catalog + CATALOG_IMAGE="" + CATALOG_NAMESPACE="openshift-marketplace" + USE_CUSTOM_CATALOG=false + + echo "📦 Catalog: GA Production (redhat-operators)" + echo " Channel: $CHANNEL" + + if [[ -n "$CUSTOM_IMAGE_OVERRIDE" ]]; then + echo "⚠️ WARNING: image parameter ignored for redhat-operators catalog (uses built-in catalog)" + fi + +else + echo "❌ ERROR: Unknown catalog '$CATALOG_SOURCE'" + echo " Supported: rhoai-catalog-dev, redhat-operators" + exit 1 +fi +``` + +**Parameter Summary:** +- `catalog` - Catalog source to use (default: `rhoai-catalog-dev`) +- `channel` - Subscription channel (default: `beta`) +- `image` - Custom catalog image (only for rhoai-catalog-dev) + +### Step 2: Verify Cluster Access + +```bash +# Check prerequisites +command -v oc &>/dev/null || die "oc command not found" +command -v jq &>/dev/null || die "jq command not found" +oc whoami &>/dev/null || die "Not logged into an OpenShift cluster" + +echo "Logged in as: $(oc whoami)" +echo "Cluster: $(oc whoami --show-server)" + +# Check if ODH is installed — RHOAI and ODH cannot coexist +if oc get csv -n openshift-operators 2>/dev/null | grep -q opendatahub-operator; then + ODH_CSV=$(oc get csv -n openshift-operators --no-headers 2>/dev/null | grep opendatahub-operator | awk '{print $1}') + echo "" + echo "ERROR: ODH (Open Data Hub) is installed on this cluster ($ODH_CSV)" + echo "" + echo "RHOAI and ODH cannot coexist — they both manage the same" + echo "cluster-scoped DataScienceCluster CRD and overlapping operators." + echo "" + echo "To install RHOAI, first uninstall ODH:" + echo " /odh-uninstall" + echo "" + echo "Then re-run:" + echo " /rhoai-install" + die "ODH must be uninstalled before installing RHOAI" +fi + +# Verify RHOAI is not already installed +if oc get csv -n redhat-ods-operator 2>/dev/null | grep -q rhods-operator; then + die "RHOAI is already installed. Use /rhoai-update to update existing installation." +fi +``` + +### Step 3: Create Operator Namespace + +```bash +OPERATOR_NAMESPACE="redhat-ods-operator" + +# Create namespace if it doesn't exist +if ! oc get namespace "$OPERATOR_NAMESPACE" &>/dev/null; then + oc create namespace "$OPERATOR_NAMESPACE" + echo "✅ Created namespace: $OPERATOR_NAMESPACE" +else + echo "✅ Namespace already exists: $OPERATOR_NAMESPACE" +fi +``` + +### Step 4: Create CatalogSource (if using custom catalog) + +```bash +if [[ "$USE_CUSTOM_CATALOG" == "true" ]]; then + echo "Creating custom CatalogSource: $CATALOG_SOURCE" + + cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: CatalogSource +metadata: + name: $CATALOG_SOURCE + namespace: $CATALOG_NAMESPACE +spec: + displayName: "Red Hat OpenShift AI Dev Catalog" + image: $CATALOG_IMAGE + publisher: Red Hat + sourceType: grpc + updateStrategy: + registryPoll: + interval: 30m +EOF + + echo "✅ CatalogSource created: $CATALOG_SOURCE" + + # Wait for catalog to be ready + echo "Waiting for CatalogSource to be ready..." + TIMEOUT=300 + INTERVAL=10 + ELAPSED=0 + + while [[ $ELAPSED -lt $TIMEOUT ]]; do + CATALOG_STATE=$(oc get catalogsource "$CATALOG_SOURCE" -n "$CATALOG_NAMESPACE" \ + -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null || echo "") + + if [[ "$CATALOG_STATE" == "READY" ]]; then + echo "✅ CatalogSource is READY" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo " CatalogSource state: ${CATALOG_STATE:-Unknown} (${ELAPSED}s/${TIMEOUT}s)" + done + + [[ "$CATALOG_STATE" == "READY" ]] || echo "⚠️ WARNING: CatalogSource not READY after ${TIMEOUT}s" +else + echo "Using built-in catalog: $CATALOG_SOURCE" +fi +``` + +### Step 5: Create OperatorGroup + +```bash +# Create OperatorGroup in operator namespace +cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1 +kind: OperatorGroup +metadata: + name: rhods-operator + namespace: $OPERATOR_NAMESPACE +spec: + targetNamespaces: + - $OPERATOR_NAMESPACE +EOF + +echo "✅ OperatorGroup created" +``` + +### Step 6: Create Subscription + +```bash +# Create Subscription +cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: Subscription +metadata: + name: rhods-operator + namespace: $OPERATOR_NAMESPACE +spec: + channel: $CHANNEL + installPlanApproval: Automatic + name: rhods-operator + source: $CATALOG_SOURCE + sourceNamespace: $CATALOG_NAMESPACE +EOF + +echo "✅ Subscription created" +echo " Channel: $CHANNEL" +echo " Source: $CATALOG_SOURCE" + +sleep 5 +``` + +This creates: +- **Namespace**: `redhat-ods-operator` +- **CatalogSource**: Custom catalog (if using dev catalog) or uses built-in `redhat-operators` +- **Subscription**: `rhods-operator` pointing to the chosen catalog +- **OperatorGroup**: For the operator namespace + +### Step 7: Wait for Operator CSV + +```bash +# Wait up to 600 seconds for CSV to reach Succeeded +CSV_PHASE="" +TIMEOUT=600 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CSV_LINE=$(oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator | grep -v Replacing || echo "") + + if [[ -n "$CSV_LINE" ]]; then + CSV_NAME=$(echo "$CSV_LINE" | awk "{print \$1}") + CSV_PHASE=$(echo "$CSV_LINE" | awk "{print \$NF}") + echo "CSV: $CSV_NAME, Phase: $CSV_PHASE" + + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + echo "✅ Operator installed successfully" + break + fi + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for rhods-operator CSV... (${ELAPSED}s/${TIMEOUT}s)" +done + +[[ "$CSV_PHASE" == "Succeeded" ]] || die "Operator did not reach Succeeded phase within ${TIMEOUT}s" +``` + +### Step 8: Create DataScienceCluster + +```bash +# Wait for DSCInitialization +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get dscinitializations default-dsci &>/dev/null; then + echo "✅ DSCInitialization found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for DSCInitialization... (${ELAPSED}s/${TIMEOUT}s)" +done + +oc get dscinitializations default-dsci &>/dev/null || die "DSCInitialization not found within ${TIMEOUT}s" + +# Extract DSC from CSV initialization-resource +CSV_NAME=$(oc get csv -n redhat-ods-operator 2>/dev/null | awk '/rhods-operator/{print $1; exit}') +if [[ -n "$CSV_NAME" ]]; then + oc get csv "$CSV_NAME" -n redhat-ods-operator \ + -o jsonpath='{.metadata.annotations.operatorframework\.io/initialization-resource}' \ + > /tmp/default-dsc.json + + oc apply -f /tmp/default-dsc.json + echo "✅ DSC created from CSV initialization-resource" +else + die "Cannot find rhods-operator CSV in redhat-ods-operator namespace" +fi +``` + +### Step 9: Configure DSC Components + +```bash +# Wait for DSC to exist +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get datasciencecluster default-dsc &>/dev/null; then + echo "✅ DataScienceCluster found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for DataScienceCluster... (${ELAPSED}s/${TIMEOUT}s)" +done + +# Patch DSC to enable required components +cat > /tmp/dsc-components-patch.yaml << 'YAML' +spec: + components: + aipipelines: + managementState: Managed + argoWorkflowsControllers: + managementState: Managed + llamastackoperator: + managementState: Managed + mlflowoperator: + managementState: Managed + trainer: + managementState: Removed +YAML + +oc patch datasciencecluster default-dsc --type merge --patch-file /tmp/dsc-components-patch.yaml || \ + die "Failed to patch DataScienceCluster" + +echo "✅ DSC component configuration applied:" +echo " - aipipelines: Managed (with argoWorkflowsControllers)" +echo " - llamastackoperator: Managed" +echo " - mlflowoperator: Managed" +echo " - trainer: Removed (requires JobSet operator)" + +sleep 5 +``` + +**Why these components?** +- `aipipelines`: For AI/ML pipelines with Argo Workflows +- `llamastackoperator`: For Llama Stack server deployments +- `mlflowoperator`: For ML experiment tracking +- `trainer`: Removed (requires JobSet operator, not available by default) + +### Step 10: Wait for DSC Ready + +```bash +# Wait for DataScienceCluster to be Ready +TIMEOUT=600 +INTERVAL=15 +ELAPSED=0 +DSC_PHASE="" + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + DSC_PHASE=$(oc get datasciencecluster -o jsonpath="{.items[0].status.phase}" 2>/dev/null || echo "Unknown") + echo "DSC phase: $DSC_PHASE" + + if [[ "$DSC_PHASE" == "Ready" ]]; then + echo "✅ DataScienceCluster is Ready" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for DataScienceCluster... (${ELAPSED}s/${TIMEOUT}s)" +done + +if [[ "$DSC_PHASE" != "Ready" ]]; then + echo "⚠️ WARNING: DSC is not Ready after ${TIMEOUT}s (current: ${DSC_PHASE:-Unknown})" + echo "Not-ready components:" + oc get dsc default-dsc -o json 2>/dev/null | \ + jq -r '.status.conditions[] | select(.status=="False") | select(.message | test("Removed") | not) | " \(.type): \(.message)"' 2>/dev/null || true +fi +``` + +### Step 11: Wait for Dashboard + +```bash +# Wait for dashboard deployment to be ready +TIMEOUT=300 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + READY=$(oc get deployment rhods-dashboard -n redhat-ods-applications -o jsonpath="{.status.readyReplicas}" 2>/dev/null || echo "0") + DESIRED=$(oc get deployment rhods-dashboard -n redhat-ods-applications -o jsonpath="{.spec.replicas}" 2>/dev/null || echo "0") + + if [[ "$READY" -gt 0 && "$READY" -eq "$DESIRED" ]]; then + echo "✅ Dashboard deployment is ready" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for dashboard deployment... (${ELAPSED}s/${TIMEOUT}s)" +done + +echo "Dashboard containers:" +oc get deployment rhods-dashboard -n redhat-ods-applications \ + -o jsonpath='{range .spec.template.spec.containers[*]}{.name}{"\n"}{end}' 2>/dev/null || \ + echo " Dashboard deployment not found" +``` + +### Step 12: Configure Dashboard Features + +```bash +# Wait for OdhDashboardConfig to exist +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + echo "✅ OdhDashboardConfig found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for OdhDashboardConfig... (${ELAPSED}s/${TIMEOUT}s)" +done + +if ! oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + echo "⚠️ WARNING: OdhDashboardConfig not found yet, feature flags will be configured when available" +else + # Enable feature flags + oc patch odhdashboardconfig odh-dashboard-config -n redhat-ods-applications --type merge -p '{ + "spec": { + "dashboardConfig": { + "automl": true, + "autorag": true, + "genAiStudio": true + } + } + }' || { + echo "⚠️ WARNING: Failed to patch dashboard config, feature flags may need manual configuration" + } + + echo "✅ Dashboard feature flags configured:" + echo " - automl: enabled" + echo " - autorag: enabled" + echo " - genAiStudio: enabled" + + # Restart dashboard to pick up changes + echo "Restarting dashboard to apply feature flag changes..." + oc rollout restart deployment rhods-dashboard -n redhat-ods-applications 2>/dev/null || true + sleep 3 +fi +``` + +### Step 13: Verify Installation + +```bash +echo "" +echo "=== Installation Summary ===" + +# Show CSV +echo "" +echo "CSV:" +oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator || echo " WARNING: CSV not found" + +# Show Dashboard URL +echo "" +echo "Dashboard:" +DASHBOARD_ROUTE=$(oc get route rhods-dashboard -n redhat-ods-applications -o jsonpath='{.spec.host}' 2>/dev/null || echo "") +if [[ -n "$DASHBOARD_ROUTE" ]]; then + echo " https://$DASHBOARD_ROUTE" +else + echo " WARNING: Dashboard route not found yet" +fi + +echo "" +echo "✅ RHOAI installation complete!" +``` + +## Output + +The command creates a report at `artifacts/rhoai-manager/reports/install-report-[timestamp].md` with: +- Installation parameters (catalog source, channel, image) +- Operator CSV details +- DataScienceCluster status +- Configured components +- Dashboard URL +- Feature flags enabled + +## Usage Examples + +### Development/Testing +```bash +# Install latest dev build (default: beta channel, dev catalog) +/rhoai-install + +# Install from dev catalog with custom image +/rhoai-install image=quay.io/modh/rhoai-catalog:latest-release-3.5 + +# Install from dev catalog with specific channel +/rhoai-install channel=beta +``` + +### Production GA +```bash +# Install from GA catalog (stable channel) +/rhoai-install catalog=redhat-operators channel=stable + +# Install from GA catalog (fast channel for early GA releases) +/rhoai-install catalog=redhat-operators channel=fast + +# Install from GA catalog with default stable channel +/rhoai-install catalog=redhat-operators +``` + +Or simply ask: +- "Install RHOAI from dev catalog" +- "Install RHOAI from production catalog" +- "Set up RHOAI on my cluster" +- "Install latest RHOAI nightly" + +## Common Issues + +**Problem:** CSV stuck in "Installing" phase +**Solution:** Check operator pod logs in `redhat-ods-operator` namespace + +**Problem:** DSC not reaching Ready +**Solution:** Check component conditions with `oc get dsc default-dsc -o yaml | yq '.status.conditions'` + +**Problem:** Dashboard not accessible +**Solution:** Verify route exists and check dashboard pod logs in `redhat-ods-applications` + +## Next Steps + +After installation: +1. Access the dashboard at the URL shown in the output +2. Configure user access and permissions +3. Deploy models and workbenches +4. Set up data connections + +To update RHOAI to a newer version, use `/rhoai-update`. diff --git a/workflows/rhoai-manager/.claude/commands/rhoai-uninstall.md b/workflows/rhoai-manager/.claude/commands/rhoai-uninstall.md new file mode 100644 index 00000000..97ca5d8d --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/rhoai-uninstall.md @@ -0,0 +1,436 @@ +# /rhoai-uninstall - Uninstall RHOAI from Cluster + +Completely uninstall Red Hat OpenShift AI (RHOAI) from an OpenShift cluster, removing all related resources. + +## Purpose + +This command performs a comprehensive cleanup of RHOAI, removing the operator, custom resources, CRDs, and all related namespaces. + +## Prerequisites + +- Must be logged into an OpenShift cluster (use `/oc-login` first if needed) +- Cluster admin permissions required +- RHOAI must be installed on the cluster + +## Command Usage + +- `/rhoai-uninstall` - Standard uninstall (forceful cleanup) +- `/rhoai-uninstall graceful` - Graceful uninstall followed by forceful cleanup +- `/rhoai-uninstall keep-crds` - Uninstall but keep CRDs +- `/rhoai-uninstall keep-all` - Keep CRDs and user resources (projects, models, etc.) + +## Uninstall Options + +### Standard Uninstall (Default) +Forcefully removes all RHOAI resources including: +- Operator and subscriptions +- Custom resources (DSC, DSCInitialization, etc.) +- CRDs +- Namespaces +- User resources (data science projects, models, workbenches) + +### Graceful Uninstall +Attempts graceful removal first, then forceful cleanup: +- Allows RHOAI to clean up resources in proper order +- Runs finalizers correctly +- Falls back to forceful cleanup if graceful fails + +### Keep CRDs +Removes RHOAI but keeps the CRDs installed + +### Keep All +Keeps both CRDs and user resources: +- Data science projects remain +- User models, workbenches, connections preserved +- Useful for reinstalling RHOAI without losing user work + +## Uninstall Process + +### Step 1: Verify Cluster Access + +Check that you're logged into the cluster with admin permissions: + +```bash +# Verify login +oc whoami + +# Verify admin permissions +oc auth can-i delete namespace +``` + +If not logged in or lacking permissions, stop and inform the user. + +### Step 2: Check Current RHOAI Installation + +Verify RHOAI is installed: + +```bash +# Check for RHOAI operator namespace +oc get namespace redhat-ods-operator 2>/dev/null + +# Check for RHOAI operator +oc get csv -n redhat-ods-operator | grep rhods-operator + +# Check for DataScienceCluster +oc get datasciencecluster -A +``` + +Report what's found and confirm with user before proceeding. + +### Step 3: Graceful Uninstall (if requested) + +If graceful uninstall is requested: + +```bash +# Create the deletion ConfigMap +oc create configmap delete-self-managed-odh -n redhat-ods-operator + +# Label it to trigger graceful deletion +oc label configmap/delete-self-managed-odh \ + api.openshift.com/addon-managed-odh-delete=true \ + -n redhat-ods-operator + +# Wait for redhat-ods-applications namespace to be removed (up to 5 minutes) +echo "Waiting for graceful deletion to complete (max 5 minutes)..." +if oc wait --for=delete --timeout=300s namespace redhat-ods-applications 2>/dev/null; then + echo "✅ Graceful deletion completed successfully" +else + echo "⚠️ Graceful deletion timed out or failed, proceeding with forceful cleanup" +fi + +# Clean up the ConfigMap +oc delete configmap delete-self-managed-odh -n redhat-ods-operator --ignore-not-found +``` + +### Step 4: Delete RHOAI Custom Resources + +Remove all RHOAI custom resources before deleting CRDs: + +```bash +# Delete DataScienceCluster +echo "Deleting DataScienceCluster resources..." +oc get datasciencecluster -A -o custom-columns=:metadata.name,:metadata.namespace --no-headers | \ + while read name namespace; do + oc patch datasciencecluster $name -n $namespace --type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true + oc delete datasciencecluster $name -n $namespace --timeout=60s --ignore-not-found + done + +# Delete DSCInitialization +echo "Deleting DSCInitialization resources..." +oc get dscinitialization -A -o custom-columns=:metadata.name,:metadata.namespace --no-headers | \ + while read name namespace; do + oc patch dscinitialization $name -n $namespace --type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true + oc delete dscinitialization $name -n $namespace --timeout=60s --ignore-not-found + done + +# Delete Notebooks (they often have finalizers) +echo "Deleting Notebook resources..." +oc get notebooks.kubeflow.org -A -o custom-columns=:metadata.name,:metadata.namespace --no-headers | \ + while read name namespace; do + oc patch notebooks.kubeflow.org $name -n $namespace --type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true + oc delete notebooks.kubeflow.org $name -n $namespace --timeout=60s --ignore-not-found + done + +# Delete InferenceServices +echo "Deleting InferenceService resources..." +oc delete inferenceservices.serving.kserve.io --all -A --ignore-not-found --timeout=60s + +# Delete ServingRuntimes +echo "Deleting ServingRuntime resources..." +oc delete servingruntimes.serving.kserve.io --all -A --ignore-not-found --timeout=60s + +# Delete DataSciencePipelinesApplications +echo "Deleting DataSciencePipelinesApplication resources..." +oc delete datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io --all -A --ignore-not-found --timeout=60s +``` + +### Step 5: Delete Webhooks + +Remove validating and mutating webhooks that may block deletion: + +```bash +# Delete RHOAI-related validating webhooks +echo "Deleting validating webhooks..." +oc get validatingwebhookconfiguration -o json | \ + jq -r '.items[] | select(.metadata.name | test("odh|rhods|opendatahub|kserve")) | .metadata.name' | \ + xargs -r oc delete validatingwebhookconfiguration + +# Delete RHOAI-related mutating webhooks +echo "Deleting mutating webhooks..." +oc get mutatingwebhookconfiguration -o json | \ + jq -r '.items[] | select(.metadata.name | test("odh|rhods|opendatahub|kserve")) | .metadata.name' | \ + xargs -r oc delete mutatingwebhookconfiguration +``` + +### Step 6: Delete RHOAI Operator + +Remove the operator subscription and CSV: + +```bash +# Delete subscription +echo "Deleting RHOAI operator subscription..." +oc delete subscription rhods-operator -n redhat-ods-operator --ignore-not-found --timeout=60s + +# Delete CSV +echo "Deleting ClusterServiceVersion..." +CSV_NAME=$(oc get csv -n redhat-ods-operator -o custom-columns=:metadata.name --no-headers | grep rhods-operator) +if [ -n "$CSV_NAME" ]; then + oc delete csv $CSV_NAME -n redhat-ods-operator --ignore-not-found --timeout=60s +fi + +# Delete catalog source if it's a dev catalog +echo "Checking for dev catalog sources..." +if oc get catalogsource rhoai-catalog-dev -n openshift-marketplace &>/dev/null; then + echo "Deleting rhoai-catalog-dev..." + oc delete catalogsource rhoai-catalog-dev -n openshift-marketplace --ignore-not-found +fi +``` + +### Step 7: Delete Namespaces + +Remove all RHOAI-related namespaces: + +```bash +# List of RHOAI namespaces +NAMESPACES="redhat-ods-operator redhat-ods-applications redhat-ods-applications-auth-provider redhat-ods-monitoring rhods-notebooks rhoai-model-registries" + +for ns in $NAMESPACES; do + if oc get namespace $ns &>/dev/null; then + echo "Deleting namespace: $ns" + + # Delete all resources in the namespace first + oc delete all --all -n $ns --ignore-not-found --timeout=30s 2>/dev/null || true + + # Delete the namespace + oc delete namespace $ns --ignore-not-found --timeout=60s || true + + # If stuck in Terminating, remove finalizers + if oc get namespace $ns -o jsonpath='{.status.phase}' 2>/dev/null | grep -q "Terminating"; then + echo " Namespace stuck in Terminating, removing finalizers..." + oc patch namespace $ns -p '{"spec":{"finalizers":[]}}' --type=merge 2>/dev/null || true + fi + fi +done +``` + +### Step 8: Delete CRDs (unless keep-crds or keep-all) + +If user didn't request to keep CRDs: + +```bash +echo "Deleting RHOAI CRDs..." + +# Core RHOAI CRDs +oc delete crd datascienceclusters.datasciencecluster.opendatahub.io --ignore-not-found +oc delete crd dscinitializations.dscinitialization.opendatahub.io --ignore-not-found +oc delete crd acceleratorprofiles.dashboard.opendatahub.io --ignore-not-found +oc delete crd hardwareprofiles.dashboard.opendatahub.io --ignore-not-found +oc delete crd odhapplications.dashboard.opendatahub.io --ignore-not-found +oc delete crd odhdashboardconfigs.opendatahub.io --ignore-not-found +oc delete crd odhdocuments.dashboard.opendatahub.io --ignore-not-found +oc delete crd modelregistries.modelregistry.opendatahub.io --ignore-not-found + +# KServe CRDs +oc delete crd inferenceservices.serving.kserve.io --ignore-not-found +oc delete crd servingruntimes.serving.kserve.io --ignore-not-found +oc delete crd inferencegraphs.serving.kserve.io --ignore-not-found + +# Notebook CRDs (remove finalizers first) +oc get notebooks.kubeflow.org -A -o custom-columns=:metadata.name,:metadata.namespace --no-headers | \ + while read name namespace; do + oc patch notebooks.kubeflow.org $name -n $namespace --type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true + done +oc delete crd notebooks.kubeflow.org --ignore-not-found + +# DataSciencePipelinesApplications +oc delete crd datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io --ignore-not-found + +# All CRDs labeled by RHOAI operator +oc delete crd -l operators.coreos.com/rhods-operator.redhat-ods-operator --ignore-not-found + +# Ray CRDs +oc delete crd rayclusters.ray.io --ignore-not-found +oc delete crd rayjobs.ray.io --ignore-not-found +oc delete crd rayservices.ray.io --ignore-not-found + +# CodeFlare CRDs +oc delete crd appwrappers.workload.codeflare.dev --ignore-not-found + +# TrustyAI CRDs +oc delete crd trustyaiservices.trustyai.opendatahub.io --ignore-not-found +``` + +### Step 9: Clean Up User Resources (if keep-all not requested) + +Remove user data science projects and resources: + +```bash +# Find and delete data science project namespaces +echo "Looking for user data science projects..." +USER_PROJECTS=$(oc get namespaces -l opendatahub.io/dashboard=true -o custom-columns=:metadata.name --no-headers) + +if [ -n "$USER_PROJECTS" ]; then + echo "Found user projects: $USER_PROJECTS" + for project in $USER_PROJECTS; do + echo " Deleting project: $project" + oc delete namespace $project --ignore-not-found --timeout=60s || true + done +else + echo "No user data science projects found" +fi +``` + +### Step 10: Verify Cleanup + +Check that all resources have been removed: + +```bash +# Check for remaining RHOAI namespaces +echo "Checking for remaining RHOAI namespaces..." +REMAINING_NS=$(oc get namespaces | grep -E "redhat-ods|rhods|rhoai" || echo "") +if [ -n "$REMAINING_NS" ]; then + echo "⚠️ Some namespaces still exist:" + echo "$REMAINING_NS" +else + echo "✅ All RHOAI namespaces removed" +fi + +# Check for RHOAI CRDs +echo "Checking for remaining RHOAI CRDs..." +REMAINING_CRDS=$(oc get crd | grep -E "opendatahub|kubeflow|kserve" || echo "") +if [ -n "$REMAINING_CRDS" ]; then + echo "⚠️ Some CRDs still exist:" + echo "$REMAINING_CRDS" +else + echo "✅ All RHOAI CRDs removed" +fi + +# Check for operator +echo "Checking for RHOAI operator..." +REMAINING_CSV=$(oc get csv -A | grep rhods-operator || echo "") +if [ -n "$REMAINING_CSV" ]; then + echo "⚠️ RHOAI operator still exists:" + echo "$REMAINING_CSV" +else + echo "✅ RHOAI operator removed" +fi +``` + +### Step 11: Report Summary + +Provide a summary of what was removed: + +``` +✅ RHOAI Uninstall Complete! + +Removed: +- RHOAI Operator +- DataScienceCluster and DSCInitialization +- All RHOAI namespaces +- Custom Resources (notebooks, inference services, etc.) +[- CRDs (if not kept)] +[- User data science projects (if not kept)] + +The cluster is now clean and ready for a fresh RHOAI installation if needed. +``` + +## Important Warnings + +**Before running this command, warn the user:** + +1. **⚠️ Data Loss Warning** + - This will DELETE all RHOAI resources including user workbenches, models, and data + - User should backup any important work first + - Cannot be undone + +2. **⚠️ Cluster Access Required** + - Requires cluster-admin permissions + - Will modify cluster-wide resources (CRDs, webhooks) + +3. **⚠️ Downtime Warning** + - Any running workloads will be terminated + - Data science pipelines will be stopped + - Active model servers will be shut down + +## Example Interactions + +### Example 1: Standard Uninstall + +**User**: `/rhoai-uninstall` + +**Claude**: +1. Checks cluster access and RHOAI installation +2. Warns about data loss and asks for confirmation +3. Deletes custom resources +4. Removes webhooks +5. Deletes operator +6. Removes namespaces +7. Deletes CRDs +8. Reports: "✅ RHOAI completely removed from cluster" + +### Example 2: Graceful Uninstall + +**User**: `/rhoai-uninstall graceful` + +**Claude**: +1. Creates deletion ConfigMap +2. Waits for graceful deletion (up to 5 minutes) +3. If graceful succeeds, cleans up remaining resources +4. If graceful fails/times out, proceeds with forceful cleanup +5. Reports final status + +### Example 3: Keep User Resources + +**User**: `/rhoai-uninstall keep-all` + +**Claude**: +1. Removes RHOAI operator and core resources +2. Keeps CRDs installed +3. Preserves user data science projects +4. Reports: "✅ RHOAI operator removed. CRDs and user projects preserved." + +## Troubleshooting + +### Issue 1: Namespaces Stuck in Terminating + +**Cause**: Finalizers or webhooks blocking deletion + +**Solution**: +```bash +# Remove finalizers +oc patch namespace <ns-name> -p '{"spec":{"finalizers":[]}}' --type=merge + +# Delete blocking webhooks +oc delete validatingwebhookconfiguration --all +oc delete mutatingwebhookconfiguration --all +``` + +### Issue 2: CRDs Won't Delete + +**Cause**: Custom resources still exist + +**Solution**: Delete all custom resources first, remove finalizers if needed + +### Issue 3: Permission Denied + +**Cause**: Insufficient permissions + +**Solution**: Must be cluster-admin. Check with `oc auth can-i delete namespace` + +## Integration with Other Commands + +Typical workflow: +``` +/oc-login # Login to cluster +/rhoai-version # Check what's installed +/rhoai-uninstall # Remove RHOAI +``` + +## Success Criteria + +Uninstall is successful when: +- ✅ All RHOAI namespaces deleted +- ✅ RHOAI operator removed +- ✅ CRDs deleted (unless kept) +- ✅ No RHOAI webhooks remain +- ✅ `oc get csv -A | grep rhods-operator` returns nothing diff --git a/workflows/rhoai-manager/.claude/commands/rhoai-update.md b/workflows/rhoai-manager/.claude/commands/rhoai-update.md new file mode 100644 index 00000000..c07b07b9 --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/rhoai-update.md @@ -0,0 +1,642 @@ +# /rhoai-update - Update RHOAI to Newer Build + +Update an existing Red Hat OpenShift AI (RHOAI) installation to a newer nightly build or version. + +## Command Usage + +- `/rhoai-update` - Update to latest available nightly (currently 3.4, preserves current channel) +- `/rhoai-update 3.4` - Update to RHOAI 3.4 (preserves current channel) +- `/rhoai-update 3.4-ea.2` - Update to RHOAI 3.4 EA build 2 +- `/rhoai-update 3.4 -c beta` - Update to 3.4 and change channel to beta +- `/rhoai-update 3.3 -c stable-3.3` - Update to 3.3 and change to stable-3.3 channel +- `/rhoai-update 3.4@sha256:abc123...` - Update to 3.4 with specific SHA digest + +## Available Channels + +| Channel | Description | Use Case | +|---------|-------------|----------| +| `beta` | Latest EA builds | Testing 3.4.0-ea.x builds | +| `stable` | Latest GA release across all versions | Production stable | +| `stable-3.4` | RHOAI 3.4.x GA | Latest 3.4 GA nightly (recommended) | +| `stable-3.3` | RHOAI 3.3.x GA | Stable 3.3 releases | + +## Prerequisites + +Before running this command: +1. **Existing RHOAI**: RHOAI must already be installed (use `/rhoai-install` for fresh installations) +2. **Cluster access**: Logged into OpenShift cluster with cluster-admin privileges (use `/oc-login`) +3. **Tools installed**: `oc` CLI and `jq` must be available + +## Process + +### Step 1: Parse Input Arguments + +```bash +# Default values +VERSION_ARG="" +CHANNEL="" # Will be set from existing subscription if not specified +USER_SPECIFIED_CHANNEL=false + +# Parse arguments +while [[ $# -gt 0 ]]; do + case $1 in + -c|--channel) + CHANNEL="$2" + USER_SPECIFIED_CHANNEL=true + shift 2 + ;; + *) + VERSION_ARG="$1" + shift + ;; + esac +done + +# Build image URL +if [[ -z "$VERSION_ARG" ]]; then + IMAGE="quay.io/rhoai/rhoai-fbc-fragment:rhoai-3.4" + echo "No version specified, defaulting to RHOAI 3.4" +elif [[ "$VERSION_ARG" == *"/"* ]]; then + IMAGE="$VERSION_ARG" +elif [[ "$VERSION_ARG" == rhoai-* ]]; then + IMAGE="quay.io/rhoai/rhoai-fbc-fragment:${VERSION_ARG}" +else + IMAGE="quay.io/rhoai/rhoai-fbc-fragment:rhoai-${VERSION_ARG}" +fi + +echo "Target image: $IMAGE" +``` + +### Step 2: Verify Cluster Access and Existing Installation + +```bash +# Check prerequisites +command -v oc &>/dev/null || die "oc command not found" +command -v jq &>/dev/null || die "jq command not found" +oc whoami &>/dev/null || die "Not logged into an OpenShift cluster" + +echo "Logged in as: $(oc whoami)" +echo "Cluster: $(oc whoami --show-server)" + +# Verify RHOAI is already installed +if ! oc get csv -n redhat-ods-operator 2>/dev/null | grep -q rhods-operator; then + die "RHOAI is not installed. Use /rhoai-install for fresh installation." +fi + +echo "✅ Detected existing RHOAI installation" +``` + +### Step 3: Handle Channel Preservation/Change + +```bash +# Get existing channel from subscription +EXISTING_CHANNEL=$(oc get subscription -n redhat-ods-operator -o jsonpath='{.items[0].spec.channel}' 2>/dev/null || echo "") + +if [[ -n "$EXISTING_CHANNEL" ]]; then + echo "Current channel: $EXISTING_CHANNEL" + + if [[ "$USER_SPECIFIED_CHANNEL" == "true" && "$CHANNEL" != "$EXISTING_CHANNEL" ]]; then + echo "" + echo "⚠️ WARNING: Channel change requested!" + echo " Current channel: $EXISTING_CHANNEL" + echo " New channel: $CHANNEL" + echo " Changing channels may cause unexpected upgrades or downgrades!" + echo "" + + # In interactive mode, prompt user + # In automated mode, preserve existing channel for safety + if [[ -t 0 ]]; then + read -p "Do you want to CHANGE the channel? [y/N] " -n 1 -r + echo + if [[ ! $REPLY =~ ^[Yy]$ ]]; then + echo "Preserving existing channel: $EXISTING_CHANNEL" + CHANNEL="$EXISTING_CHANNEL" + fi + else + echo "Automated mode: Preserving existing channel for safety" + CHANNEL="$EXISTING_CHANNEL" + fi + else + # User didn't specify channel, preserve existing + CHANNEL="$EXISTING_CHANNEL" + echo "Preserving existing channel: $CHANNEL" + fi +else + # No existing channel found, use beta as default + [[ -z "$CHANNEL" ]] && CHANNEL="beta" + echo "No existing channel found, using: $CHANNEL" +fi + +echo "Target channel: $CHANNEL" +``` + +### Step 4: Clone olminstall Repository + +```bash +OLMINSTALL_REPO="https://gitlab.cee.redhat.com/data-hub/olminstall.git" +OLMINSTALL_DIR="/tmp/olminstall" + +if [ -d "$OLMINSTALL_DIR" ]; then + echo "Updating existing clone..." + git -C "$OLMINSTALL_DIR" pull --rebase --quiet 2>/dev/null || true +else + echo "Cloning from $OLMINSTALL_REPO..." + git clone --quiet "$OLMINSTALL_REPO" "$OLMINSTALL_DIR" +fi + +[[ -d "$OLMINSTALL_DIR" ]] || die "Failed to clone olminstall" +echo "olminstall ready" +``` + +### Step 5: Update RHOAI Catalog + +```bash +cd "$OLMINSTALL_DIR" +bash setup.sh -t operator -i "$IMAGE" -u "$CHANNEL" +``` + +This updates: +- **CatalogSource**: `rhoai-catalog-dev` with new image +- **Subscription**: May update to new channel if specified + +### Step 6: Force Catalog Refresh + +```bash +# Force catalog to pull fresh image by deleting the pod +echo "Forcing catalog refresh to ensure latest component images..." + +CATALOG_POD=$(oc get pod -n openshift-marketplace -l olm.catalogSource=rhoai-catalog-dev -o name 2>/dev/null | head -1) + +if [[ -n "$CATALOG_POD" ]]; then + echo "Deleting catalog pod to force fresh image pull..." + oc delete "$CATALOG_POD" -n openshift-marketplace 2>/dev/null || true + + # Wait for new catalog pod to be ready + TIMEOUT=120 + INTERVAL=5 + ELAPSED=0 + + while [[ $ELAPSED -lt $TIMEOUT ]]; do + NEW_POD=$(oc get pod -n openshift-marketplace -l olm.catalogSource=rhoai-catalog-dev -o jsonpath="{.items[0].status.phase}" 2>/dev/null || echo "") + + if [[ "$NEW_POD" == "Running" ]]; then + echo "✅ Catalog refreshed with latest image" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for new catalog pod... (${ELAPSED}s/${TIMEOUT}s)" + done + + if [[ "$NEW_POD" != "Running" ]]; then + echo "⚠️ WARNING: Catalog pod not ready, image comparison may use stale data" + fi +else + echo "ℹ️ Catalog pod not found yet, will be created fresh" +fi +``` + +### Step 7: Wait for Operator CSV + +```bash +# Wait up to 600 seconds for CSV to reach Succeeded +CSV_PHASE="" +TIMEOUT=600 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CSV_LINE=$(oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator | grep -v Replacing || echo "") + + if [[ -n "$CSV_LINE" ]]; then + CSV_NAME=$(echo "$CSV_LINE" | awk "{print \$1}") + CSV_PHASE=$(echo "$CSV_LINE" | awk "{print \$NF}") + echo "CSV: $CSV_NAME, Phase: $CSV_PHASE" + + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + echo "✅ Operator CSV is in Succeeded state" + break + fi + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for rhods-operator CSV... (${ELAPSED}s/${TIMEOUT}s)" +done + +[[ "$CSV_PHASE" == "Succeeded" ]] || die "Operator did not reach Succeeded phase within ${TIMEOUT}s" +``` + +### Step 8: Check for Newer Component Images (Critical for Updates) + +```bash +echo "" +echo "=== Checking for Newer Component Images in Catalog ===" + +# Verify catalog source is using the target image +CATALOG_SOURCE_IMAGE=$(oc get catalogsource rhoai-catalog-dev -n openshift-marketplace -o jsonpath='{.spec.image}' 2>/dev/null || echo "") + +if [[ -n "$CATALOG_SOURCE_IMAGE" ]]; then + echo "CatalogSource image: $CATALOG_SOURCE_IMAGE" + + if [[ "$CATALOG_SOURCE_IMAGE" != "$IMAGE" ]]; then + echo "⚠️ WARNING: CatalogSource image doesn't match target!" + echo " Expected: $IMAGE" + echo " Actual: $CATALOG_SOURCE_IMAGE" + fi +else + echo "⚠️ WARNING: Could not verify CatalogSource image" +fi + +# Get current CSV +CURRENT_CSV=$(oc get csv -n redhat-ods-operator -o jsonpath='{.items[0].metadata.name}' 2>/dev/null | grep rhods-operator) + +# Get catalog pod +CATALOG_POD=$(oc get pod -n openshift-marketplace -l olm.catalogSource=rhoai-catalog-dev -o name 2>/dev/null | head -1) + +if [[ -z "$CATALOG_POD" ]]; then + echo "ℹ️ Catalog pod not found, skipping image comparison" +else + echo "Comparing all component images between CSV and catalog..." + + # Get all relatedImages from current CSV + CURRENT_IMAGES=$(oc get csv "$CURRENT_CSV" -n redhat-ods-operator -o json 2>/dev/null | \ + jq -r '.spec.relatedImages[] | "\(.name)|\(.image)"' 2>/dev/null || echo "") + + if [[ -z "$CURRENT_IMAGES" ]]; then + echo "⚠️ Could not retrieve current CSV images" + else + # Get catalog.yaml content once + CATALOG_YAML=$(oc exec -n openshift-marketplace "$CATALOG_POD" -- cat /configs/rhods-operator/catalog.yaml 2>/dev/null || echo "") + + if [[ -z "$CATALOG_YAML" ]]; then + echo "⚠️ Could not retrieve catalog images" + else + IMAGES_DIFFER=false + DIFF_COUNT=0 + + # Compare each image + while IFS='|' read -r img_name img_url; do + [[ -z "$img_name" ]] && continue + + # Extract catalog image for this component + CATALOG_IMAGE=$(echo "$CATALOG_YAML" | grep -A 1 "name: $img_name" | grep "image:" | awk '{print $3}' || echo "") + + if [[ -n "$CATALOG_IMAGE" && "$img_url" != "$CATALOG_IMAGE" ]]; then + # Extract just the digest for cleaner output + CURRENT_DIGEST="${img_url##*@}" + CATALOG_DIGEST="${CATALOG_IMAGE##*@}" + + # Only report if digests actually differ (not just registry URLs) + if [[ "$CURRENT_DIGEST" != "$CATALOG_DIGEST" ]]; then + echo "⚠️ Newer image found: $img_name" + echo " Current: ${CURRENT_DIGEST:0:20}..." + echo " Catalog: ${CATALOG_DIGEST:0:20}..." + IMAGES_DIFFER=true + DIFF_COUNT=$((DIFF_COUNT + 1)) + fi + fi + done <<< "$CURRENT_IMAGES" + + if [[ "$IMAGES_DIFFER" == "true" ]]; then + echo "" + echo "Found $DIFF_COUNT component image(s) with newer versions in catalog." + echo "CSV version is unchanged, but component images have been updated." + echo "Forcing subscription reinstall to pick up newer images..." + echo "" + + # Trigger forced reinstall - SEE STEP 9 BELOW + else + echo "✅ All component images are up to date" + fi + fi + fi +fi +``` + +**Why this matters:** +- OLM may not automatically update if CSV version hasn't changed +- Component images can be updated in the catalog without CSV version bump +- Without forced reinstall, you'd be running old component images + +### Step 9: Perform Forced Reinstall (If Newer Images Found) + +This step only runs if newer component images were detected in Step 8. + +```bash +# Get current subscription info +SUB_NAME=$(oc get subscription -n redhat-ods-operator -o jsonpath='{.items[0].metadata.name}') +CSV_NAME=$(oc get csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator -o jsonpath='{.items[0].metadata.name}') +CURRENT_CHANNEL=$(oc get subscription -n redhat-ods-operator -o jsonpath='{.items[0].spec.channel}') + +echo "Current subscription: $SUB_NAME" +echo "Current CSV: $CSV_NAME" +echo "Current channel: $CURRENT_CHANNEL" + +# Delete CSV +echo "Deleting CSV..." +oc delete csv "$CSV_NAME" -n redhat-ods-operator || true +sleep 10 + +# Delete subscription +echo "Deleting subscription..." +oc delete subscription "$SUB_NAME" -n redhat-ods-operator || true +sleep 5 + +# Recreate subscription with same channel +echo "Recreating subscription (channel: $CURRENT_CHANNEL)..." +cat > /tmp/subscription-rhoai.yaml << YAML +apiVersion: operators.coreos.com/v1alpha1 +kind: Subscription +metadata: + name: rhoai-operator-dev + namespace: redhat-ods-operator +spec: + channel: ${CURRENT_CHANNEL} + installPlanApproval: Automatic + name: rhods-operator + source: rhoai-catalog-dev + sourceNamespace: openshift-marketplace +YAML + +oc apply -f /tmp/subscription-rhoai.yaml + +# Wait for new install plan +echo "Waiting for new install plan..." +sleep 15 + +# Wait for CSV to be installed +echo "Waiting for CSV to be installed from updated catalog..." +TIMEOUT=300 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CSV_PHASE=$(oc get csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator -o jsonpath="{.items[0].status.phase}" 2>/dev/null || echo "") + NEW_CSV_NAME=$(oc get csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator -o jsonpath="{.items[0].metadata.name}" 2>/dev/null || echo "") + + echo "CSV: $NEW_CSV_NAME, Phase: ${CSV_PHASE:-Pending}" + + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + echo "✅ CSV reinstalled successfully" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for CSV after reinstall... (${ELAPSED}s/${TIMEOUT}s)" +done + +[[ "$CSV_PHASE" == "Succeeded" ]] || die "CSV did not reach Succeeded after forced reinstall" + +# Verify new images +echo "" +echo "=== Verifying New Component Images ===" +NEW_AUTOML=$(oc get csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator -o jsonpath='{.spec.relatedImages[?(@.name=="odh_mod_arch_automl_image")].image}' 2>/dev/null || echo "") +NEW_AUTORAG=$(oc get csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator -o jsonpath='{.spec.relatedImages[?(@.name=="odh_mod_arch_autorag_image")].image}' 2>/dev/null || echo "") + +[[ -n "$NEW_AUTOML" ]] && echo "AutoML: ${NEW_AUTOML##*@}" +[[ -n "$NEW_AUTORAG" ]] && echo "AutoRAG: ${NEW_AUTORAG##*@}" + +echo "✅ Operator reinstalled with newer component images" +``` + +### Step 10: Configure DSC Components + +```bash +# Wait for DSC to exist +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get datasciencecluster default-dsc &>/dev/null; then + echo "✅ DataScienceCluster found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for DataScienceCluster... (${ELAPSED}s/${TIMEOUT}s)" +done + +if ! oc get datasciencecluster default-dsc &>/dev/null; then + echo "⚠️ WARNING: DSC not found. You may need to create it manually." +else + # Patch DSC to enable required components + cat > /tmp/dsc-components-patch.yaml << 'YAML' +spec: + components: + aipipelines: + managementState: Managed + argoWorkflowsControllers: + managementState: Managed + llamastackoperator: + managementState: Managed + mlflowoperator: + managementState: Managed + trainer: + managementState: Removed +YAML + + oc patch datasciencecluster default-dsc --type merge --patch-file /tmp/dsc-components-patch.yaml || \ + die "Failed to patch DataScienceCluster" + + echo "✅ DSC component configuration applied:" + echo " - aipipelines: Managed (with argoWorkflowsControllers)" + echo " - llamastackoperator: Managed" + echo " - mlflowoperator: Managed" + echo " - trainer: Removed (requires JobSet operator)" + + sleep 5 +fi +``` + +### Step 11: Wait for DSC Ready + +```bash +# Wait for DataScienceCluster to be Ready +TIMEOUT=600 +INTERVAL=15 +ELAPSED=0 +DSC_PHASE="" + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + DSC_PHASE=$(oc get datasciencecluster -o jsonpath="{.items[0].status.phase}" 2>/dev/null || echo "Unknown") + echo "DSC phase: $DSC_PHASE" + + if [[ "$DSC_PHASE" == "Ready" ]]; then + echo "✅ DataScienceCluster is Ready" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for DataScienceCluster... (${ELAPSED}s/${TIMEOUT}s)" +done + +if [[ "$DSC_PHASE" != "Ready" ]]; then + echo "⚠️ WARNING: DSC is not Ready after ${TIMEOUT}s (current: ${DSC_PHASE:-Unknown})" + echo "Not-ready components:" + oc get dsc default-dsc -o json 2>/dev/null | \ + jq -r '.status.conditions[] | select(.status=="False") | select(.message | test("Removed") | not) | " \(.type): \(.message)"' 2>/dev/null || true +fi +``` + +### Step 12: Wait for Dashboard + +```bash +# Wait for dashboard deployment to be ready +TIMEOUT=300 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + READY=$(oc get deployment rhods-dashboard -n redhat-ods-applications -o jsonpath="{.status.readyReplicas}" 2>/dev/null || echo "0") + DESIRED=$(oc get deployment rhods-dashboard -n redhat-ods-applications -o jsonpath="{.spec.replicas}" 2>/dev/null || echo "0") + + if [[ "$READY" -gt 0 && "$READY" -eq "$DESIRED" ]]; then + echo "✅ Dashboard deployment is ready" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for dashboard deployment... (${ELAPSED}s/${TIMEOUT}s)" +done + +if [[ "$READY" -lt "$DESIRED" ]]; then + echo "⚠️ WARNING: Dashboard deployment not fully ready" +fi + +echo "Dashboard containers:" +oc get deployment rhods-dashboard -n redhat-ods-applications \ + -o jsonpath='{range .spec.template.spec.containers[*]}{.name}{"\n"}{end}' 2>/dev/null || \ + echo " Dashboard deployment not found" +``` + +### Step 13: Configure Dashboard Features + +```bash +# Wait for OdhDashboardConfig to exist +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + echo "✅ OdhDashboardConfig found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for OdhDashboardConfig... (${ELAPSED}s/${TIMEOUT}s)" +done + +if ! oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + echo "⚠️ WARNING: OdhDashboardConfig not found yet, feature flags will be configured when available" +else + # Enable feature flags + oc patch odhdashboardconfig odh-dashboard-config -n redhat-ods-applications --type merge -p '{ + "spec": { + "dashboardConfig": { + "automl": true, + "autorag": true, + "genAiStudio": true + } + } + }' || { + echo "⚠️ WARNING: Failed to patch dashboard config, feature flags may need manual configuration" + } + + echo "✅ Dashboard feature flags configured:" + echo " - automl: enabled" + echo " - autorag: enabled" + echo " - genAiStudio: enabled" + + # Restart dashboard to pick up changes + echo "Restarting dashboard to apply feature flag changes..." + oc rollout restart deployment rhods-dashboard -n redhat-ods-applications 2>/dev/null || true + sleep 3 +fi +``` + +### Step 14: Verify Update + +```bash +echo "" +echo "=== Update Summary ===" + +# Show CSV +echo "" +echo "CSV:" +oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator || echo " WARNING: CSV not found" + +# Show Dashboard URL +echo "" +echo "Dashboard:" +DASHBOARD_ROUTE=$(oc get route rhods-dashboard -n redhat-ods-applications -o jsonpath='{.spec.host}' 2>/dev/null || echo "") +if [[ -n "$DASHBOARD_ROUTE" ]]; then + echo " https://$DASHBOARD_ROUTE" +else + echo " WARNING: Dashboard route not found yet" +fi + +echo "" +echo "✅ RHOAI update complete!" +``` + +## Output + +The command creates a report at `artifacts/rhoai-update/reports/update-report-[timestamp].md` with: +- Update parameters (version, channel, image) +- Operator CSV details (old vs new) +- Component image comparison results +- Whether forced reinstall was performed +- DataScienceCluster status +- Dashboard URL + +## Usage Examples + +```bash +# Update to latest RHOAI (preserves current channel) +/rhoai-update + +# Update to RHOAI 3.4 EA build 2 +/rhoai-update 3.4-ea.2 + +# Update to RHOAI 3.3 stable and change channel +/rhoai-update 3.3 -c stable-3.3 + +# Update with specific SHA digest +/rhoai-update 3.4@sha256:abc123def456... +``` + +Or simply ask: +- "Update RHOAI to latest" +- "Upgrade to RHOAI 3.4" +- "Update RHOAI to latest nightly" + +## Common Issues + +**Problem:** Component images not updating even though catalog was updated +**Solution:** This is expected - the forced reinstall (Step 9) handles this automatically + +**Problem:** Channel change warning appears +**Solution:** Confirm you want to change channels, or let it preserve the existing channel + +**Problem:** DSC components revert to default after update +**Solution:** The command re-applies component configuration in Step 10 + +**Problem:** Dashboard shows old features after update +**Solution:** Feature flags are re-applied in Step 13, dashboard pod is restarted + +## Next Steps + +After updating: +1. Verify all workloads are still running +2. Check dashboard for new features +3. Test model deployments +4. Review component logs for any errors + +To check current RHOAI version and build info, use `/rhoai-version`. diff --git a/workflows/rhoai-manager/.claude/commands/rhoai-verify.md b/workflows/rhoai-manager/.claude/commands/rhoai-verify.md new file mode 100644 index 00000000..469c8dc7 --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/rhoai-verify.md @@ -0,0 +1,654 @@ +# /rhoai-verify - Post-Install/Update Verification Tests for RHOAI + +Run a comprehensive suite of verification tests after RHOAI install or update to confirm all components are healthy and functional. Works on both connected and disconnected clusters. + +## Command Usage + +```bash +# Run all tests +/rhoai-verify + +# Run specific test categories +/rhoai-verify quick # Operator + DSC + pod health only +/rhoai-verify full # All tests including smoke tests +``` + +## Inputs + +| Input | Required | Default | Description | +|-------|----------|---------|-------------| +| `quick` / `full` | No | `full` | Test scope | + +## Process + +### Step 1: Initialize Test Report + +```bash +REPORT_FILE="artifacts/rhoai-manager/reports/verify-$(date +%Y%m%d-%H%M%S).md" +mkdir -p artifacts/rhoai-manager/reports + +PASS_COUNT=0 +FAIL_COUNT=0 +WARN_COUNT=0 + +pass() { PASS_COUNT=$((PASS_COUNT + 1)); echo " PASS: $1"; } +fail() { FAIL_COUNT=$((FAIL_COUNT + 1)); echo " FAIL: $1"; } +warn() { WARN_COUNT=$((WARN_COUNT + 1)); echo " WARN: $1"; } + +echo "=== RHOAI Post-Install/Update Verification ===" +echo "Cluster: $(oc whoami --show-server 2>/dev/null)" +echo "User: $(oc whoami 2>/dev/null)" +echo "Date: $(date -u +%Y-%m-%dT%H:%M:%SZ)" +echo "" +``` + +### Step 2: Operator Health + +Verify the RHOAI operator CSV is installed and in Succeeded phase. + +```bash +echo "=== Test 1: Operator Health ===" + +# 2a. Check CSV +CSV_LINE=$(oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator | grep -v Replacing || echo "") + +if [[ -z "$CSV_LINE" ]]; then + fail "No RHOAI CSV found in redhat-ods-operator namespace" +else + CSV_NAME=$(echo "$CSV_LINE" | awk '{print $1}') + CSV_PHASE=$(echo "$CSV_LINE" | awk '{print $NF}') + CSV_VERSION=$(oc get csv "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.version}' 2>/dev/null) + + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + pass "CSV $CSV_NAME is Succeeded (version: $CSV_VERSION)" + else + fail "CSV $CSV_NAME phase is $CSV_PHASE (expected: Succeeded)" + fi +fi + +# 2b. Check Subscription +SUB=$(oc get subscription -n redhat-ods-operator -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "") +if [[ -n "$SUB" ]]; then + SUB_STATE=$(oc get subscription "$SUB" -n redhat-ods-operator -o jsonpath='{.status.state}' 2>/dev/null || echo "Unknown") + SUB_CHANNEL=$(oc get subscription "$SUB" -n redhat-ods-operator -o jsonpath='{.spec.channel}' 2>/dev/null || echo "Unknown") + SUB_SOURCE=$(oc get subscription "$SUB" -n redhat-ods-operator -o jsonpath='{.spec.source}' 2>/dev/null || echo "Unknown") + + if [[ "$SUB_STATE" == "AtLatestKnown" ]]; then + pass "Subscription $SUB state: $SUB_STATE (channel: $SUB_CHANNEL, source: $SUB_SOURCE)" + else + warn "Subscription $SUB state: $SUB_STATE (expected: AtLatestKnown)" + fi +else + fail "No RHOAI subscription found" +fi + +# 2c. Check CatalogSource +CATALOG=$(oc get subscription "$SUB" -n redhat-ods-operator -o jsonpath='{.spec.source}' 2>/dev/null || echo "") +if [[ -n "$CATALOG" ]]; then + CATALOG_STATE=$(oc get catalogsource "$CATALOG" -n openshift-marketplace \ + -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null || echo "Unknown") + + if [[ "$CATALOG_STATE" == "READY" ]]; then + pass "CatalogSource $CATALOG is READY" + else + fail "CatalogSource $CATALOG state: $CATALOG_STATE (expected: READY)" + fi +fi + +echo "" +``` + +### Step 3: DataScienceCluster Health + +Verify DSC exists and is in Ready phase with all managed components healthy. + +```bash +echo "=== Test 2: DataScienceCluster Health ===" + +# 3a. Check DSCInitialization +DSCI_PHASE=$(oc get dscinitializations default-dsci -o jsonpath='{.status.phase}' 2>/dev/null || echo "NotFound") +if [[ "$DSCI_PHASE" == "Ready" ]]; then + pass "DSCInitialization phase: Ready" +else + fail "DSCInitialization phase: $DSCI_PHASE (expected: Ready)" +fi + +# 3b. Check DSC phase +DSC_PHASE=$(oc get datasciencecluster -o jsonpath='{.items[0].status.phase}' 2>/dev/null || echo "NotFound") +if [[ "$DSC_PHASE" == "Ready" ]]; then + pass "DataScienceCluster phase: Ready" +else + fail "DataScienceCluster phase: $DSC_PHASE (expected: Ready)" +fi + +# 3c. Check individual component conditions +DSC_CONDITIONS=$(oc get datasciencecluster -o json 2>/dev/null | \ + jq -r '.items[0].status.conditions[] | "\(.type)|\(.status)|\(.message // "")"' 2>/dev/null || echo "") + +if [[ -n "$DSC_CONDITIONS" ]]; then + while IFS='|' read -r ctype cstatus cmsg; do + [[ -z "$ctype" ]] && continue + # Skip conditions that are about Removed components + if echo "$cmsg" | grep -qi "removed"; then + continue + fi + if [[ "$cstatus" == "True" ]]; then + pass "Component $ctype: Ready" + else + fail "Component $ctype: Not Ready ($cmsg)" + fi + done <<< "$DSC_CONDITIONS" +fi + +echo "" +``` + +### Step 4: Pod Health Across RHOAI Namespaces + +Check all pods in RHOAI-related namespaces for failures. + +```bash +echo "=== Test 3: Pod Health ===" + +RHOAI_NAMESPACES="redhat-ods-operator redhat-ods-applications redhat-ods-monitoring" + +for ns in $RHOAI_NAMESPACES; do + # Skip if namespace doesn't exist + if ! oc get namespace "$ns" &>/dev/null; then + continue + fi + + PODS=$(oc get pods -n "$ns" --no-headers 2>/dev/null || echo "") + if [[ -z "$PODS" ]]; then + warn "No pods found in $ns" + continue + fi + + TOTAL=0 + RUNNING=0 + COMPLETED=0 + ISSUES=0 + ISSUE_DETAILS="" + + while IFS= read -r line; do + [[ -z "$line" ]] && continue + TOTAL=$((TOTAL + 1)) + POD_NAME=$(echo "$line" | awk '{print $1}') + STATUS=$(echo "$line" | awk '{print $3}') + READY=$(echo "$line" | awk '{print $2}') + + case "$STATUS" in + Running) + # Check if all containers are ready + READY_NUM=$(echo "$READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$READY" | cut -d/ -f2) + if [[ "$READY_NUM" == "$TOTAL_NUM" ]]; then + RUNNING=$((RUNNING + 1)) + else + ISSUES=$((ISSUES + 1)) + ISSUE_DETAILS="${ISSUE_DETAILS}\n $POD_NAME: Running but not ready ($READY)" + fi + ;; + Completed|Succeeded) + COMPLETED=$((COMPLETED + 1)) + ;; + ImagePullBackOff|ErrImagePull) + ISSUES=$((ISSUES + 1)) + ISSUE_DETAILS="${ISSUE_DETAILS}\n $POD_NAME: $STATUS (missing image on registry)" + ;; + CrashLoopBackOff) + ISSUES=$((ISSUES + 1)) + ISSUE_DETAILS="${ISSUE_DETAILS}\n $POD_NAME: $STATUS (check logs: oc logs $POD_NAME -n $ns)" + ;; + *) + ISSUES=$((ISSUES + 1)) + ISSUE_DETAILS="${ISSUE_DETAILS}\n $POD_NAME: $STATUS" + ;; + esac + done <<< "$PODS" + + if [[ $ISSUES -eq 0 ]]; then + pass "$ns: $RUNNING running, $COMPLETED completed, $TOTAL total" + else + fail "$ns: $ISSUES pods with issues out of $TOTAL total" + echo -e "$ISSUE_DETAILS" + fi +done + +echo "" +``` + +### Step 5: Dashboard Accessibility + +Verify the RHOAI dashboard is reachable and responding. + +```bash +echo "=== Test 4: Dashboard Accessibility ===" + +# 5a. Check deployment +DASH_READY=$(oc get deployment rhods-dashboard -n redhat-ods-applications \ + -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo "0") +DASH_DESIRED=$(oc get deployment rhods-dashboard -n redhat-ods-applications \ + -o jsonpath='{.spec.replicas}' 2>/dev/null || echo "0") + +if [[ "$DASH_READY" -gt 0 && "$DASH_READY" -eq "$DASH_DESIRED" ]]; then + pass "Dashboard deployment ready ($DASH_READY/$DASH_DESIRED replicas)" +else + fail "Dashboard deployment not ready ($DASH_READY/$DASH_DESIRED replicas)" +fi + +# 5b. Check route exists +DASH_ROUTE=$(oc get route rhods-dashboard -n redhat-ods-applications \ + -o jsonpath='{.spec.host}' 2>/dev/null || echo "") + +if [[ -n "$DASH_ROUTE" ]]; then + pass "Dashboard route exists: https://$DASH_ROUTE" +else + fail "Dashboard route not found" +fi + +# 5c. HTTP health check (expect 403 or 200 — both mean dashboard is responding) +if [[ -n "$DASH_ROUTE" ]]; then + HTTP_CODE=$(/usr/bin/curl -sk -o /dev/null -w '%{http_code}' "https://$DASH_ROUTE" 2>/dev/null || echo "000") + + if [[ "$HTTP_CODE" == "200" || "$HTTP_CODE" == "403" || "$HTTP_CODE" == "302" ]]; then + pass "Dashboard HTTP response: $HTTP_CODE (responding)" + else + fail "Dashboard HTTP response: $HTTP_CODE (expected 200, 302, or 403)" + fi +fi + +# 5d. Check dashboard feature flags +if oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + AUTOML=$(oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications \ + -o jsonpath='{.spec.dashboardConfig.automl}' 2>/dev/null || echo "unset") + AUTORAG=$(oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications \ + -o jsonpath='{.spec.dashboardConfig.autorag}' 2>/dev/null || echo "unset") + GENAISTUDIO=$(oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications \ + -o jsonpath='{.spec.dashboardConfig.genAiStudio}' 2>/dev/null || echo "unset") + + echo " Dashboard features: automl=$AUTOML, autorag=$AUTORAG, genAiStudio=$GENAISTUDIO" +fi + +echo "" +``` + +### Step 6: Pipeline (Data Science Pipelines) Readiness + +Verify the DSP operator and controllers are running. If DSPAs exist, verify their health. + +```bash +echo "=== Test 5: Data Science Pipelines ===" + +# 6a. Check DSP operator deployment +DSP_OPERATOR=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "data-science-pipelines-operator" || echo "") + +if [[ -n "$DSP_OPERATOR" ]]; then + DSP_NAME=$(echo "$DSP_OPERATOR" | awk '{print $1}') + DSP_READY=$(echo "$DSP_OPERATOR" | awk '{print $2}') + READY_NUM=$(echo "$DSP_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$DSP_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "DSP operator deployment ready ($DSP_READY)" + else + fail "DSP operator deployment not ready ($DSP_READY)" + fi +else + warn "DSP operator deployment not found (pipelines may be set to Removed)" +fi + +# 6b. Check existing DSPAs +DSPA_LIST=$(oc get datasciencepipelinesapplication --all-namespaces --no-headers 2>/dev/null || echo "") + +if [[ -n "$DSPA_LIST" ]]; then + while IFS= read -r line; do + [[ -z "$line" ]] && continue + DSPA_NS=$(echo "$line" | awk '{print $1}') + DSPA_NAME=$(echo "$line" | awk '{print $2}') + DSPA_READY=$(echo "$line" | awk '{print $NF}') + + # Check DSPA status + DSPA_PHASE=$(oc get datasciencepipelinesapplication "$DSPA_NAME" -n "$DSPA_NS" \ + -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || echo "Unknown") + + if [[ "$DSPA_PHASE" == "True" ]]; then + pass "DSPA $DSPA_NS/$DSPA_NAME: Ready" + else + fail "DSPA $DSPA_NS/$DSPA_NAME: Not Ready" + fi + + # Check podToPodTLS (known issue) + POD_TLS=$(oc get datasciencepipelinesapplication "$DSPA_NAME" -n "$DSPA_NS" \ + -o jsonpath='{.spec.podToPodTLS}' 2>/dev/null || echo "unset") + if [[ "$POD_TLS" != "false" ]]; then + warn "DSPA $DSPA_NS/$DSPA_NAME: podToPodTLS=$POD_TLS (set to false if pipeline pods crash with caCertPath error)" + fi + + # Check pipeline pods in that namespace + CRASH_PODS=$(oc get pods -n "$DSPA_NS" --no-headers 2>/dev/null | grep -E "CrashLoopBackOff|ImagePullBackOff" || echo "") + if [[ -n "$CRASH_PODS" ]]; then + fail "DSPA $DSPA_NS has crashing/failing pods:" + echo "$CRASH_PODS" | while read -r pline; do + echo " $(echo "$pline" | awk '{print $1, $3}')" + done + fi + done <<< "$DSPA_LIST" +else + echo " No DSPAs configured yet (create one to test pipelines)" +fi + +echo "" +``` + +### Step 7: Workbench / Notebook Controller Readiness + +```bash +echo "=== Test 6: Workbench / Notebook Controller ===" + +# 7a. Check notebook controller +NB_CONTROLLER=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "notebook-controller" | head -1 || echo "") + +if [[ -n "$NB_CONTROLLER" ]]; then + NB_NAME=$(echo "$NB_CONTROLLER" | awk '{print $1}') + NB_READY=$(echo "$NB_CONTROLLER" | awk '{print $2}') + READY_NUM=$(echo "$NB_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$NB_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "Notebook controller ready ($NB_READY)" + else + fail "Notebook controller not ready ($NB_READY)" + fi +else + warn "Notebook controller deployment not found" +fi + +# 7b. Check ODH notebook controller +ODH_NB=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "odh-notebook-controller" | head -1 || echo "") + +if [[ -n "$ODH_NB" ]]; then + ODH_NB_NAME=$(echo "$ODH_NB" | awk '{print $1}') + ODH_NB_READY=$(echo "$ODH_NB" | awk '{print $2}') + READY_NUM=$(echo "$ODH_NB_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$ODH_NB_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "ODH notebook controller ready ($ODH_NB_READY)" + else + fail "ODH notebook controller not ready ($ODH_NB_READY)" + fi +fi + +# 7c. Check workbench namespace +WB_NS=$(oc get datasciencecluster -o jsonpath='{.items[0].spec.components.workbenches.workbenchNamespace}' 2>/dev/null || echo "rhods-notebooks") +if oc get namespace "$WB_NS" &>/dev/null; then + pass "Workbench namespace $WB_NS exists" +else + warn "Workbench namespace $WB_NS not found" +fi + +echo "" +``` + +### Step 8: Model Serving Readiness (KServe / ModelMesh) + +```bash +echo "=== Test 7: Model Serving ===" + +# 8a. Check KServe controller +KSERVE=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "kserve-controller" | head -1 || echo "") + +if [[ -n "$KSERVE" ]]; then + KS_READY=$(echo "$KSERVE" | awk '{print $2}') + READY_NUM=$(echo "$KS_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$KS_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "KServe controller ready ($KS_READY)" + else + fail "KServe controller not ready ($KS_READY)" + fi +else + warn "KServe controller not found (kserve may be Removed)" +fi + +# 8b. Check ModelMesh controller +MODELMESH=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "modelmesh-controller" | head -1 || echo "") + +if [[ -n "$MODELMESH" ]]; then + MM_READY=$(echo "$MODELMESH" | awk '{print $2}') + READY_NUM=$(echo "$MM_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$MM_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "ModelMesh controller ready ($MM_READY)" + else + fail "ModelMesh controller not ready ($MM_READY)" + fi +else + echo " ModelMesh controller not found (may not be deployed)" +fi + +# 8c. Check ServingRuntimes exist +SR_COUNT=$(oc get servingruntimes -n redhat-ods-applications --no-headers 2>/dev/null | wc -l | tr -d ' ') +if [[ "$SR_COUNT" -gt 0 ]]; then + pass "Found $SR_COUNT ServingRuntime(s) in redhat-ods-applications" +else + warn "No ServingRuntimes found in redhat-ods-applications" +fi + +# 8d. Check InferenceServices across cluster +IS_COUNT=$(oc get inferenceservice --all-namespaces --no-headers 2>/dev/null | wc -l | tr -d ' ') +if [[ "$IS_COUNT" -gt 0 ]]; then + echo " Found $IS_COUNT InferenceService(s) across cluster" + # Check each for readiness + oc get inferenceservice --all-namespaces --no-headers 2>/dev/null | while read -r line; do + IS_NS=$(echo "$line" | awk '{print $1}') + IS_NAME=$(echo "$line" | awk '{print $2}') + IS_READY=$(echo "$line" | awk '{print $NF}') + echo " $IS_NS/$IS_NAME: $IS_READY" + done +else + echo " No InferenceServices deployed" +fi + +echo "" +``` + +### Step 9: Model Registry Readiness + +```bash +echo "=== Test 8: Model Registry ===" + +MR_OPERATOR=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "model-registry-operator" | head -1 || echo "") + +if [[ -n "$MR_OPERATOR" ]]; then + MR_READY=$(echo "$MR_OPERATOR" | awk '{print $2}') + READY_NUM=$(echo "$MR_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$MR_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "Model Registry operator ready ($MR_READY)" + else + fail "Model Registry operator not ready ($MR_READY)" + fi +else + warn "Model Registry operator not found (may be set to Removed)" +fi + +# Check registry namespace +MR_NS=$(oc get datasciencecluster -o jsonpath='{.items[0].spec.components.modelregistry.registriesNamespace}' 2>/dev/null || echo "") +if [[ -n "$MR_NS" ]]; then + if oc get namespace "$MR_NS" &>/dev/null; then + pass "Model Registry namespace $MR_NS exists" + else + warn "Model Registry namespace $MR_NS not found" + fi +fi + +echo "" +``` + +### Step 10: TrustyAI / EvalHub Readiness + +```bash +echo "=== Test 9: TrustyAI / EvalHub ===" + +# 10a. TrustyAI operator +TRUSTYAI=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "trustyai" | head -1 || echo "") + +if [[ -n "$TRUSTYAI" ]]; then + TA_READY=$(echo "$TRUSTYAI" | awk '{print $2}') + READY_NUM=$(echo "$TA_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$TA_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "TrustyAI operator ready ($TA_READY)" + else + fail "TrustyAI operator not ready ($TA_READY)" + fi +else + warn "TrustyAI operator not found (may be set to Removed)" +fi + +# 10b. Check EvalHub namespace and resources +if oc get namespace evalhub &>/dev/null; then + EVALHUB_PODS=$(oc get pods -n evalhub --no-headers 2>/dev/null || echo "") + EH_TOTAL=$(echo "$EVALHUB_PODS" | grep -c '.' || echo "0") + EH_RUNNING=$(echo "$EVALHUB_PODS" | grep -c "Running" || echo "0") + EH_ISSUES=$(echo "$EVALHUB_PODS" | grep -cE "CrashLoopBackOff|ImagePullBackOff|Error" || echo "0") + + if [[ "$EH_ISSUES" -eq 0 && "$EH_RUNNING" -gt 0 ]]; then + pass "EvalHub namespace: $EH_RUNNING/$EH_TOTAL pods running" + elif [[ "$EH_ISSUES" -gt 0 ]]; then + fail "EvalHub namespace: $EH_ISSUES pods with issues" + else + warn "EvalHub namespace exists but no running pods" + fi + + # Check EvalHub route + EH_ROUTE=$(oc get route -n evalhub --no-headers 2>/dev/null | head -1 | awk '{print $2}' || echo "") + if [[ -n "$EH_ROUTE" ]]; then + pass "EvalHub route: https://$EH_ROUTE" + fi +else + echo " EvalHub namespace not found (not configured)" +fi + +echo "" +``` + +### Step 11: Dependent Operator Health + +Check that key dependent operators (service mesh, serverless, pipelines, cert-manager) are installed and healthy. + +```bash +echo "=== Test 10: Dependent Operators ===" + +DEPENDENT_OPERATORS=( + "servicemeshoperator" + "openshift-pipelines-operator-rh" + "serverless-operator" + "openshift-cert-manager-operator" +) + +for op in "${DEPENDENT_OPERATORS[@]}"; do + OP_CSV=$(oc get csv --all-namespaces 2>/dev/null | grep "$op" | grep -v Replacing | head -1 || echo "") + + if [[ -n "$OP_CSV" ]]; then + OP_PHASE=$(echo "$OP_CSV" | awk '{print $NF}') + OP_NAME=$(echo "$OP_CSV" | awk '{print $2}') + if [[ "$OP_PHASE" == "Succeeded" ]]; then + pass "$op ($OP_NAME): Succeeded" + else + warn "$op ($OP_NAME): $OP_PHASE" + fi + else + warn "$op: not installed" + fi +done + +echo "" +``` + +### Step 12: Disconnected-Specific Checks (auto-detected) + +If running on a disconnected cluster (detected by IDMS presence), run additional checks. + +```bash +echo "=== Test 11: Disconnected Cluster Checks ===" + +IDMS_COUNT=$(oc get imagedigestmirrorset --no-headers 2>/dev/null | wc -l | tr -d ' ') + +if [[ "$IDMS_COUNT" -gt 0 ]]; then + echo " Detected disconnected cluster ($IDMS_COUNT IDMS entries)" + + # Check IDMS entries for key RHOAI sources + REQUIRED_SOURCES=("registry.redhat.io/rhoai" "registry.redhat.io/rhel9" "registry.redhat.io/ubi9") + IDMS_SOURCES=$(oc get imagedigestmirrorset -o jsonpath='{range .items[*]}{range .spec.imageDigestMirrors[*]}{.source}{"\n"}{end}{end}' 2>/dev/null | sort -u) + + for source in "${REQUIRED_SOURCES[@]}"; do + if echo "$IDMS_SOURCES" | grep -q "$source"; then + pass "IDMS entry exists for $source" + else + fail "IDMS entry missing for $source" + fi + done + + # Check for any ImagePullBackOff across ALL namespaces (not just RHOAI) + IPB_PODS=$(oc get pods --all-namespaces --no-headers 2>/dev/null | grep -E "ImagePullBackOff|ErrImagePull" | head -10 || echo "") + if [[ -n "$IPB_PODS" ]]; then + IPB_COUNT=$(echo "$IPB_PODS" | wc -l | tr -d ' ') + warn "$IPB_COUNT pods with ImagePullBackOff across cluster (may indicate missing mirrored images)" + echo "$IPB_PODS" | while read -r line; do + echo " $(echo "$line" | awk '{print $1"/"$2": "$4}')" + done + else + pass "No ImagePullBackOff pods across cluster" + fi +else + echo " Connected cluster detected (no IDMS entries) — skipping disconnected checks" +fi + +echo "" +``` + +### Step 13: Test Summary + +```bash +echo "==========================================" +echo " RHOAI Verification Summary" +echo "==========================================" +echo "" +echo " PASS: $PASS_COUNT" +echo " FAIL: $FAIL_COUNT" +echo " WARN: $WARN_COUNT" +echo "" + +if [[ $FAIL_COUNT -eq 0 ]]; then + echo " Result: ALL TESTS PASSED" +else + echo " Result: $FAIL_COUNT FAILURE(S) DETECTED" + echo "" + echo " Troubleshooting:" + echo " - ImagePullBackOff: Run /mirror-images to mirror missing images" + echo " - CrashLoopBackOff: Check pod logs (may need podToPodTLS workaround)" + echo " - DSC not Ready: Check component conditions with: oc get dsc -o yaml" + echo " - CSV not Succeeded: Check InstallPlan and operator logs" +fi + +echo "" +echo "Cluster: $(oc whoami --show-server 2>/dev/null)" +echo "RHOAI Version: ${CSV_VERSION:-Unknown}" +``` + +Write the test results to the report file in markdown format for archival. + +## Output + +Report saved to `artifacts/rhoai-manager/reports/verify-[timestamp].md` with: +- Cluster info and RHOAI version +- Per-test PASS/FAIL/WARN results +- Summary counts +- Troubleshooting guidance for failures diff --git a/workflows/rhoai-manager/.claude/commands/rhoai-version.md b/workflows/rhoai-manager/.claude/commands/rhoai-version.md new file mode 100644 index 00000000..87877110 --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/rhoai-version.md @@ -0,0 +1,169 @@ +# /rhoai-version - Detect RHOAI Version and Build Information + +Detect the Red Hat OpenShift AI (RHOAI) version and build information installed on the currently connected OpenShift cluster. + +## Purpose + +This command provides comprehensive version information about the RHOAI installation including operator version, component status, and deployed image digests. + +## Prerequisites + +- Must be logged into an OpenShift cluster (use `/oc-login` first if needed) +- RHOAI must be installed on the cluster + +## Steps + +### 1. Verify OpenShift Login + +Run `oc whoami` and `oc whoami --show-server` to confirm you are logged into an OpenShift cluster. If not logged in, stop and inform the user they need to authenticate first with `/oc-login`. + +### 2. Detect RHOAI Operator Subscription + +Run the following to extract subscription details directly (avoid `-o yaml` as it produces excessive output): + +```bash +oc get subscriptions.operators.coreos.com rhods-operator -n redhat-ods-operator -o jsonpath='Channel: {.spec.channel}{"\n"}Source: {.spec.source}{"\n"}Approval: {.spec.installPlanApproval}{"\n"}Current CSV: {.status.currentCSV}{"\n"}Installed CSV: {.status.installedCSV}{"\n"}Starting CSV: {.spec.startingCSV}{"\n"}' 2>/dev/null +``` + +If no subscription found in `redhat-ods-operator`, check `openshift-operators`: + +```bash +oc get subscriptions.operators.coreos.com -n openshift-operators -o jsonpath='{range .items[?(@.spec.name=="rhods-operator")]}Channel: {.spec.channel}{"\n"}Source: {.spec.source}{"\n"}Approval: {.spec.installPlanApproval}{"\n"}Current CSV: {.status.currentCSV}{"\n"}{end}' 2>/dev/null +``` + +### 3. Check ClusterServiceVersion (CSV) + +Get only the RHOAI operator CSV (filter by name to avoid noisy output): + +```bash +oc get csv -n redhat-ods-operator -o custom-columns=NAME:.metadata.name,DISPLAY:.spec.displayName,VERSION:.spec.version,PHASE:.status.phase 2>/dev/null | grep -E 'NAME|rhods-operator' +``` + +### 4. Check DataScienceCluster + +**Do NOT use `-o yaml` for the full DSC resource** — it is very large and the jsonpath for nested dynamic component keys does not extract cleanly. + +Instead, use these targeted commands: + +**Get component managementState values:** +```bash +oc get datasciencecluster default-dsc -o json 2>/dev/null | python3 -c " +import sys, json +dsc = json.load(sys.stdin) +comps = dsc.get('spec', {}).get('components', {}) +for name, cfg in sorted(comps.items()): + state = cfg.get('managementState', 'Unknown') if isinstance(cfg, dict) else 'Unknown' + print(f' {name}: {state}') +" +``` + +**Get status conditions:** +```bash +oc get datasciencecluster default-dsc -o jsonpath='{range .status.conditions[*]}{.type}: {.status} ({.reason}){"\n"}{end}' 2>/dev/null +``` + +### 5. Check DSCInitialization + +```bash +oc get dscinitializations default-dsci -o jsonpath='Name: {.metadata.name}{"\n"}Monitoring: {.spec.monitoring.managementState}{"\n"}' 2>/dev/null +``` + +### 6. Extract Operator Image + +```bash +oc get deployment rhods-operator -n redhat-ods-operator -o jsonpath='{.spec.template.spec.containers[*].image}' 2>/dev/null +``` + +If not found, try the ODH deployment name: +```bash +oc get deployment opendatahub-operator-controller-manager -n openshift-operators -o jsonpath='{.spec.template.spec.containers[*].image}' 2>/dev/null +``` + +### 7. Get Component Images (Always Run) + +Collect all deployed component images from `redhat-ods-applications`. This is NOT optional — always include this table. + +```bash +oc get deployments -n redhat-ods-applications -o custom-columns='COMPONENT:.metadata.name,IMAGE:.spec.template.spec.containers[0].image' 2>/dev/null +``` + +Parse each image to extract a short image name and the `sha256` digest. Present as a markdown table: + +``` +| Component | Image | Digest (short) | +|----------------------------------|------------------------------------------------------|-----------------| +| rhods-dashboard | odh-dashboard-rhel9 | sha256:db295f.. | +| kserve-controller-manager | odh-kserve-controller-rhel9 | sha256:e83b4b.. | +| ... | ... | ... | +``` + +To build this table: +- **Component** = the deployment name +- **Image** = the portion after the last `/` and before `@sha256:` (e.g., `odh-dashboard-rhel9`) +- **Digest (short)** = first 8 characters of the sha256 hash + +### 8. Present Summary + +Output a clear summary in this format: + +``` +== RHOAI Version Summary == + +Cluster: <server URL> +Logged in as: <username> + +Operator: + Name: <CSV name> + Version: <version> + Phase: <phase> + Channel: <subscription channel> + Source: <catalog source> + Approval: <install plan approval> + Operator Image: <image reference> + +DataScienceCluster: + Name: default-dsc + Status: <Ready/Not Ready> (<conditions summary>) + Components: + - <component>: <Managed|Removed> + ... + +DSCInitialization: + Name: default-dsci + Monitoring: <Managed/Removed> + +== Component Images (redhat-ods-applications) == + +| Component | Image | Digest (short) | +|-----------|-------|-----------------| +| ... | ... | ... | +``` + +If any resource is not found, note it clearly (e.g., "Not installed" or "Namespace not found") rather than failing silently. + +## Important Notes + +- **Do NOT use `oc get datasciencecluster -o yaml`** — the output is extremely large (hundreds of lines) and jsonpath with dynamic component keys fails to extract cleanly. Use the `python3 -c` approach in Step 4 or targeted jsonpath for conditions. +- **Do NOT use `-o yaml` for subscriptions** — use targeted jsonpath to extract only the fields you need. +- **The DSC resource name is `default-dsc`** and DSCI is `default-dsci` on standard RHOAI installs. Always reference by name for reliable extraction. +- **Component keys in `spec.components` are dynamic** and vary by RHOAI version. Do not hardcode a list — iterate over whatever keys exist. +- **Status conditions have changed across versions** — older versions used `Available/Progressing/Degraded/Upgradeable`, newer versions (3.x) use per-component `*Ready` conditions plus `Ready`, `ProvisioningSucceeded`, `ComponentsReady`. Handle both. + +## Example Usage + +**User**: `/rhoai-version` + +**Claude**: +1. Checks if user is logged into cluster +2. Queries RHOAI operator subscription and CSV +3. Checks DataScienceCluster and DSCInitialization status +4. Lists all component images with digests +5. Presents formatted summary with all version information + +## Integration with Other Commands + +This command is useful: +- Before running `/rhoai-update` to know current version +- After running `/rhoai-update` to verify the new version +- For troubleshooting RHOAI installations +- For documenting the current cluster state diff --git a/workflows/rhoai-manager/README.md b/workflows/rhoai-manager/README.md new file mode 100644 index 00000000..55406ab2 --- /dev/null +++ b/workflows/rhoai-manager/README.md @@ -0,0 +1,332 @@ +# RHOAI Manager + +Comprehensive workflow for managing the complete lifecycle of Red Hat OpenShift AI (RHOAI) and Open Data Hub (ODH): installation, updates, version detection, and uninstallation. + +## Overview + +This workflow provides an AI-powered pipeline for: +- Installing RHOAI or ODH from scratch on OpenShift clusters +- Updating RHOAI or ODH to latest nightly builds +- Detecting RHOAI version and build information +- Completely uninstalling RHOAI or ODH when needed +- Managing cluster connections and authentication +- Safely switching between RHOAI and ODH + +## Important: RHOAI and ODH Cannot Coexist + +RHOAI and ODH share cluster-scoped CRDs (`DataScienceCluster`, `DSCInitialization`) and overlapping operators. They **cannot** be installed on the same cluster simultaneously. Both `/rhoai-install` and `/odh-install` detect the other and block with a clear error. + +## Structure + +``` +workflows/rhoai-manager/ +├── .ambient/ +│ └── ambient.json # Workflow configuration +├── .claude/ +│ └── commands/ +│ ├── oc-login.md # OpenShift cluster login +│ ├── rhoai-install.md # RHOAI installation +│ ├── rhoai-version.md # RHOAI version detection +│ ├── rhoai-update.md # RHOAI update to latest nightly +│ ├── rhoai-uninstall.md # RHOAI uninstall +│ ├── odh-install.md # ODH installation +│ ├── odh-update.md # ODH update to latest nightly +│ ├── odh-uninstall.md # ODH uninstall +│ ├── odh-pr-tracker.md # Track ODH PRs in RHOAI builds +│ ├── mirror-images.md # Mirror images to disconnected bastions +│ ├── rhoai-disconnected.md # Install/update RHOAI on disconnected clusters +│ └── rhoai-verify.md # Post-install/update verification tests +└── README.md # This file +``` + +## Commands + +### /oc-login + +Login to OpenShift cluster using credentials from Ambient session. + +**Usage:** `/oc-login` + +**Required env vars:** `OCP_SERVER`, `OCP_USERNAME`, `OCP_PASSWORD` + +--- + +### /rhoai-install + +Install RHOAI from scratch on an OpenShift cluster. + +**Usage:** +```bash +/rhoai-install # Latest dev nightly (default) +/rhoai-install channel=stable-3.4 # GA stable-3.4 channel +/rhoai-install catalog=redhat-operators # GA production catalog +``` + +**Prerequisite:** No existing RHOAI **or ODH** installation (detected automatically). + +**What gets deployed:** +- Operator namespace: `redhat-ods-operator` +- Application namespace: `redhat-ods-applications` +- DataScienceCluster with all components + +--- + +### /rhoai-update + +Update RHOAI to the latest nightly or GA build. + +**Usage:** +```bash +/rhoai-update # Pull latest (preserves current channel) +/rhoai-update 3.4 -c stable-3.4 # Update with explicit channel +``` + +**Features:** Preserves channel, auto-detects newer component images, forces reinstall if needed. + +--- + +### /rhoai-version + +Check installed RHOAI version, CSV, catalog digest, and all component image SHAs. + +**Usage:** `/rhoai-version` + +--- + +### /rhoai-uninstall + +Completely uninstall RHOAI from an OpenShift cluster. + +**Usage:** +```bash +/rhoai-uninstall # Remove everything (use this before installing ODH) +/rhoai-uninstall graceful # Graceful then forceful cleanup +/rhoai-uninstall keep-crds # Keep CRDs +/rhoai-uninstall keep-all # Keep CRDs and user resources +``` + +--- + +### /odh-install + +Install Open Data Hub (ODH) nightly on an OpenShift cluster. + +**Usage:** +```bash +/odh-install # odh-stable-nightly catalog, fast channel (default) +/odh-install channel=fast image=quay.io/opendatahub/opendatahub-operator-catalog:latest +``` + +**Prerequisite:** No existing ODH **or RHOAI** installation (detected automatically). + +**Key differences from RHOAI:** + +| | RHOAI | ODH | +|-|-------|-----| +| Package | `rhods-operator` | `opendatahub-operator` | +| Operator namespace | `redhat-ods-operator` | `openshift-operators` | +| App namespace | `redhat-ods-applications` | `opendatahub` | +| Default channel | `stable-3.4` / `beta` | `fast` | +| Nightly tag | `rhoai-3.4` (floating) | `odh-stable-nightly` (floating) | + +--- + +### /odh-update + +Update ODH to the latest nightly build. + +**Usage:** +```bash +/odh-update # Pull latest odh-stable-nightly +/odh-update image=quay.io/opendatahub/opendatahub-operator-catalog:latest +``` + +**Note:** ODH nightlies typically bump the CSV version daily, so OLM auto-upgrades without a forced reinstall in most cases. + +--- + +### /odh-uninstall + +Completely uninstall ODH from an OpenShift cluster. + +**Usage:** +```bash +/odh-uninstall # Remove everything (use this before installing RHOAI) +/odh-uninstall keep-crds # Keep CRDs +/odh-uninstall keep-all # Keep CRDs and user resources +``` + +**Note:** Use the default (no flags) when switching to RHOAI — `keep-crds` or `keep-all` would leave conflicting CRDs. + +--- + +### /odh-pr-tracker + +Track whether an ODH pull request has been included in the latest RHOAI build. + +**Usage:** `/odh-pr-tracker <pr-number>` + +--- + +### /mirror-images + +Mirror all images needed for a complete disconnected RHOAI deployment from a connected cluster to one or more bastion registries. Includes RHOAI operator, all components, and infrastructure services. + +**Usage:** `/mirror-images` + +**What it does:** + +1. Extracts images from connected cluster's CSV relatedImages (all of them, no exclusions by default) +2. Scans all relevant namespaces for running pod images (minio, keycloak, postgres, milvus, vLLM, service mesh, etc.) +3. Captures catalog source images and module architecture images +4. Builds a combined pull secret with source registry and bastion credentials +5. Deploys a mirror pod on the connected cluster (fast AWS-internal transfers) +6. Mirrors all images to each bastion with `--keep-manifest-list=true --filter-by-os=".*"` +7. Tags destinations with `:latest` to prevent Quay tagless manifest GC +8. Verifies every image on each bastion, reports failures by category +9. Generates IDMS (ImageDigestMirrorSet) YAML for the disconnected cluster + +**Required inputs:** Bastion registry address(es), bastion credentials. RHOAI version is auto-detected. Optional exclude patterns (empty by default). + +--- + +### /rhoai-disconnected + +Install or update RHOAI on a disconnected (air-gapped) OpenShift cluster using a digest-pinned FBC catalog image. + +**Usage:** +```bash +/rhoai-disconnected fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:fe1157d5... +/rhoai-disconnected install fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:... +/rhoai-disconnected update fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:... +/rhoai-disconnected fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:... bastion=host:8443 channel=stable-3.4 +``` + +**Required input:** `fbc=<image@sha256:digest>` — the FBC catalog image (must be already mirrored to bastion via `/mirror-images`). + +**Optional inputs:** `bastion=<host:port>` (auto-detected from IDMS), `channel=<channel>` (default: `stable-3.4`), `install`/`update` (auto-detected). + +**What it does:** + +1. Auto-detects install vs update mode and bastion registry from IDMS +2. **Pre-flight verification**: checks that the FBC image and ALL relatedImages exist on the bastion before proceeding +3. Verifies IDMS entries cover all required source registries +4. Creates/updates OLM CatalogSource, namespace, OperatorGroup, and Subscription +5. For updates: forces CSV reinstall to pick up new component images +6. Waits for operator CSV and DataScienceCluster to reach Ready state +7. Post-install health check: detects ImagePullBackOff and CrashLoopBackOff pods +8. Applies known workarounds (podToPodTLS bug, persistenceagent TLS cert) +9. Configures dashboard feature flags (automl, autorag, genAiStudio) + +**Prerequisite:** All images mirrored to bastion (use `/mirror-images` on connected cluster first). IDMS configured on disconnected cluster. + +--- + +### /rhoai-verify + +Run post-install/update verification tests to confirm all RHOAI components are healthy and functional. + +**Usage:** +```bash +/rhoai-verify # Run all tests (default: full) +/rhoai-verify quick # Operator + DSC + pod health only +/rhoai-verify full # All tests including smoke tests +``` + +**What it checks:** + +1. Operator health — CSV phase, subscription state, CatalogSource readiness +2. DataScienceCluster — phase, component conditions +3. Pod health — scans all RHOAI namespaces for ImagePullBackOff, CrashLoopBackOff, not-ready containers +4. Dashboard — deployment readiness, route existence, HTTP response +5. Data Science Pipelines — DSP operator, DSPA health, podToPodTLS status +6. Workbenches — notebook controller, ODH notebook controller, workbench namespace +7. Model Serving — KServe controller, ModelMesh controller, ServingRuntimes, InferenceServices +8. Model Registry — operator readiness, registry namespace +9. TrustyAI / EvalHub — TrustyAI operator, EvalHub namespace/pods/route +10. Dependent operators — service mesh, serverless, pipelines, cert-manager +11. Disconnected checks (auto-detected) — IDMS entries, cluster-wide ImagePullBackOff scan + +**Output:** Report at `artifacts/rhoai-manager/reports/verify-[timestamp].md` with PASS/FAIL/WARN summary and troubleshooting guidance. + +--- + +## Typical Workflows + +### Fresh RHOAI Installation +``` +1. /oc-login +2. /rhoai-install +3. /rhoai-verify +``` + +### Fresh ODH Installation +``` +1. /oc-login +2. /odh-install +3. /rhoai-version # (check via version command — ODH has no dedicated version command yet) +``` + +### Pull Latest Nightly (RHOAI) +``` +1. /oc-login +2. /rhoai-update +3. /rhoai-verify +``` + +### Pull Latest Nightly (ODH) +``` +1. /oc-login +2. /odh-update +``` + +### Switch from RHOAI to ODH +``` +1. /oc-login +2. /rhoai-uninstall # Standard uninstall (removes CRDs) +3. /odh-install +``` + +### Switch from ODH to RHOAI +``` +1. /oc-login +2. /odh-uninstall # Standard uninstall (removes CRDs) +3. /rhoai-install +``` + +### Mirror Images to Disconnected Clusters +``` +1. /oc-login # Connect to the connected cluster +2. /mirror-images # Mirror all RHOAI + infrastructure images to bastion(s) +``` + +### Install/Update RHOAI on Disconnected Cluster +``` +1. /oc-login # Connect to the disconnected cluster +2. /rhoai-disconnected fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:... +3. /rhoai-verify # Verify everything is healthy +``` + +### Decommission +``` +1. /oc-login +2. /rhoai-uninstall # or /odh-uninstall +``` + +## Prerequisites + +- OpenShift cluster (version 4.12+) +- `oc` CLI installed (auto-installed if missing) +- Cluster credentials configured in Ambient session: + - `OCP_SERVER` - OpenShift cluster API URL + - `OCP_USERNAME` - Your OpenShift username + - `OCP_PASSWORD` - Your OpenShift password +- Cluster admin permissions + +## Output Artifacts + +All artifacts are stored in `artifacts/rhoai-manager/`: + +- `reports/*.md` - Installation and update reports +- `version/*.md` - Version detection summaries +- `logs/*.log` - Detailed execution logs