From 440b9f36c10a6cb2a2fdaff27ac9587dbece3e2c Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Sun, 29 Mar 2026 16:25:36 -0400 Subject: [PATCH 01/23] feat: add PR Guidance Generator workflow MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Analyzes a repository's merged and closed fix PRs to generate compact guidance files (.cve-fix/examples.md, .bugfix/guidance.md) that teach automated workflows how to create PRs matching that repo's conventions. Then opens a PR in the target repo with the generated files. Commands: - /guidance.generate — full pipeline: fetch, analyze, generate, PR - /guidance.update — refresh existing files with new PRs, open update PR Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../guidance-generator/.ambient/ambient.json | 10 + .../.claude/commands/guidance.generate.md | 519 ++++++++++++++++++ .../.claude/commands/guidance.update.md | 437 +++++++++++++++ .../guidance-generator/.claude/settings.json | 13 + workflows/guidance-generator/README.md | 129 +++++ 5 files changed, 1108 insertions(+) create mode 100644 workflows/guidance-generator/.ambient/ambient.json create mode 100644 workflows/guidance-generator/.claude/commands/guidance.generate.md create mode 100644 workflows/guidance-generator/.claude/commands/guidance.update.md create mode 100644 workflows/guidance-generator/.claude/settings.json create mode 100644 workflows/guidance-generator/README.md diff --git a/workflows/guidance-generator/.ambient/ambient.json b/workflows/guidance-generator/.ambient/ambient.json new file mode 100644 index 00000000..e2862ac8 --- /dev/null +++ b/workflows/guidance-generator/.ambient/ambient.json @@ -0,0 +1,10 @@ +{ + "name": "PR Guidance Generator", + "description": "Analyze a repository's merged and closed fix PRs to generate compact guidance files that teach automated workflows (CVE Fixer, Bugfix) how to create PRs that match that repo's conventions.", + "systemPrompt": "You are a PR pattern analyst for the Ambient Code Platform. Your role is to help teams generate and maintain guidance files that teach automated fix workflows how to create pull requests matching their repository's conventions.\n\nKEY RESPONSIBILITIES:\n- Fetch and analyze historical fix PRs from a GitHub repository\n- Extract patterns from merged PRs (what works) and closed PRs (what to avoid)\n- Generate compact, high-signal guidance files — no fluff, no verbose examples\n- Create pull requests in target repositories with the generated guidance files\n- Update existing guidance files with patterns from new PRs\n\nWORKFLOW METHODOLOGY:\n1. GENERATE - Analyze a repo's PR history and create guidance files, then open a PR\n2. UPDATE - Fetch PRs since last analysis, merge new patterns, open an update PR\n\nAVAILABLE COMMANDS:\n/guidance.generate - Analyze a repo and generate guidance files, then create a PR\n/guidance.update - Refresh existing guidance with recent PRs and create an update PR\n\nOUTPUT LOCATIONS:\n- Raw PR data: artifacts/guidance//raw/\n- Analysis output: artifacts/guidance//analysis/\n- Generated files: artifacts/guidance//output/\n\nCORE PRINCIPLES:\n- Guidance files must stay under 80 lines — dense signal, no narrative\n- Only include rules observed in 3+ PRs\n- Merged PRs = positive examples. Closed PRs = what to avoid.\n- Review REQUEST_CHANGES comments reveal what workflows should do proactively\n- Never guess patterns — only state what the data supports", + "startupPrompt": "Welcome! I'm your PR Guidance Generator.\n\nI analyze a repository's fix PR history to generate compact guidance files that teach automated workflows — like CVE Fixer and Bugfix — how to create PRs your team will actually merge.\n\nAVAILABLE COMMANDS:\n/guidance.generate - Analyze PR history and generate guidance files, then open a PR in the repo\n/guidance.update - Refresh existing guidance with recent PRs and open an update PR\n\nFlags for /guidance.generate:\n --cve-only Only generate CVE fix guidance\n --bugfix-only Only generate bugfix guidance\n --limit N Max PRs to analyze per bucket (default: 100)\n\nGETTING STARTED:\nRun /guidance.generate with a GitHub repository URL to begin.\n\nWhat repository would you like to analyze?", + "results": { + "Generated Guidance": "artifacts/guidance/**/output/*.md", + "PR Analysis": "artifacts/guidance/**/analysis/*.md" + } +} diff --git a/workflows/guidance-generator/.claude/commands/guidance.generate.md b/workflows/guidance-generator/.claude/commands/guidance.generate.md new file mode 100644 index 00000000..59158cec --- /dev/null +++ b/workflows/guidance-generator/.claude/commands/guidance.generate.md @@ -0,0 +1,519 @@ +# /guidance.generate - Generate PR Guidance Files + +## Purpose +Analyze a GitHub repository's fix PR history to generate compact guidance files +for the CVE Fixer (`.cve-fix/examples.md`) and Bugfix (`.bugfix/guidance.md`) +workflows, then open a PR in that repo adding those files. + +## Execution Style + +Be concise. Brief status per phase, full summary at end. + +Example: +``` +Fetching PRs from org/repo... 147 total + CVE bucket: 38 PRs (28 merged, 10 closed) + Bugfix bucket: 61 PRs (54 merged, 7 closed) + +Fetching per-PR details... Done +Synthesizing patterns... + CVE: 14 rules extracted (3-PR threshold applied) + Bugfix: 11 rules extracted + +Writing guidance files... Done +Creating PR in org/repo... https://github.com/org/repo/pull/88 + +Artifacts: artifacts/guidance/org-repo/ +``` + +## Prerequisites + +- GitHub CLI (`gh`) installed and authenticated: `gh auth status` +- `jq` installed +- Write access to the target repository (for PR creation) + +## Arguments + +``` +/guidance.generate [--cve-only] [--bugfix-only] [--limit N] +``` + +- `repo-url`: Full GitHub URL (e.g., `https://github.com/org/repo`) or `org/repo` +- `--cve-only`: Skip bugfix analysis +- `--bugfix-only`: Skip CVE analysis +- `--limit N`: Max PRs to fetch per bucket (default: 100, min: 20) + +## Process + +### 1. Parse Arguments and Validate + +Extract `REPO` in `org/repo` format from the provided URL or slug. +If not provided, ask: "What is the GitHub repository URL?" + +```bash +# Validate gh auth +gh auth status || { echo "ERROR: gh not authenticated. Run 'gh auth login'"; exit 1; } + +# Validate repo exists and is accessible +gh repo view "$REPO" --json name > /dev/null 2>&1 || { + echo "ERROR: Cannot access $REPO. Check URL and permissions." + exit 1 +} + +# Derive a safe slug for directory names (replace / with -) +REPO_SLUG=$(echo "$REPO" | tr '/' '-') + +# Setup directories +mkdir -p "artifacts/guidance/$REPO_SLUG/raw" +mkdir -p "artifacts/guidance/$REPO_SLUG/analysis" +mkdir -p "artifacts/guidance/$REPO_SLUG/output" +mkdir -p "/tmp/guidance-gen/$REPO_SLUG" +``` + +### 2. Fetch PR Metadata (Pass 1 — lightweight) + +Fetch compact metadata for all recent PRs. No PR bodies, no file lists yet. + +```bash +LIMIT="${LIMIT:-100}" + +gh pr list \ + --repo "$REPO" \ + --state all \ + --limit 200 \ + --json number,title,state,mergedAt,closedAt,labels,headRefName,latestReviews \ + > "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" + +TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/all-prs.json") +echo "Fetched $TOTAL PRs from $REPO" +``` + +### 3. Filter into Buckets + +Use jq to split into CVE and bugfix buckets based on title and branch patterns. +A PR cannot be in both buckets — CVE PRs take priority. + +```bash +# CVE bucket: title or branch matches CVE pattern +jq --argjson limit "$LIMIT" '[ + .[] | select( + (.title | test("CVE-[0-9]{4}-[0-9]+|^[Ss]ecurity:|^fix\\(cve\\):|^Fix CVE"; "i")) or + (.headRefName | test("^fix/cve-|^security/cve-"; "i")) + ) +] | .[:$limit]' \ + "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" + +# Bugfix bucket: title or branch matches bug pattern, exclude CVE PRs +jq --argjson limit "$LIMIT" '[ + .[] | select( + ( + (.title | test("^fix[:(]|^bugfix|^bug[[:space:]]fix|closes[[:space:]]#[0-9]+|fixes[[:space:]]#[0-9]+"; "i")) or + (.headRefName | test("^(bugfix|fix|bug)/"; "i")) + ) and + (.title | test("CVE-[0-9]{4}-[0-9]+"; "i") | not) and + (.headRefName | test("^fix/cve-"; "i") | not) + ) +] | .[:$limit]' \ + "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" + +CVE_TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json") +CVE_MERGED=$(jq '[.[] | select(.state == "MERGED")] | length' "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json") +CVE_CLOSED=$(jq '[.[] | select(.state == "CLOSED")] | length' "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json") + +BUGFIX_TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json") +BUGFIX_MERGED=$(jq '[.[] | select(.state == "MERGED")] | length' "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json") +BUGFIX_CLOSED=$(jq '[.[] | select(.state == "CLOSED")] | length' "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json") + +echo " CVE bucket: $CVE_TOTAL PRs ($CVE_MERGED merged, $CVE_CLOSED closed)" +echo " Bugfix bucket: $BUGFIX_TOTAL PRs ($BUGFIX_MERGED merged, $BUGFIX_CLOSED closed)" +``` + +If both buckets are empty, report this clearly and exit — the repo may not have +recognizable fix PR naming conventions. Suggest the user check PR title patterns. + +### 4. Fetch Per-PR Details (Pass 2 — targeted) + +For each PR in both buckets, fetch only: file paths changed and review data. +For closed PRs, also fetch the last 2 comments (closing context). + +Process each bucket the same way. Replace `$META_FILE` and `$OUT_FILE` accordingly. + +```bash +fetch_pr_details() { + local META_FILE="$1" + local OUT_FILE="$2" + local COUNT=$(jq 'length' "$META_FILE") + + echo "[]" > "$OUT_FILE" + + for i in $(seq 0 $((COUNT - 1))); do + NUMBER=$(jq -r ".[$i].number" "$META_FILE") + STATE=$(jq -r ".[$i].state" "$META_FILE") + TITLE=$(jq -r ".[$i].title" "$META_FILE") + BRANCH=$(jq -r ".[$i].headRefName" "$META_FILE") + LABELS=$(jq -c "[.[$i].labels[].name]" "$META_FILE") + + # Fetch files and reviews in one call + PR_DETAIL=$(gh pr view "$NUMBER" --repo "$REPO" \ + --json files,reviews 2>/dev/null) + + FILES=$(echo "$PR_DETAIL" | jq -c '[.files[].path]') + + # Extract only REQUEST_CHANGES review bodies (trimmed to 200 chars) + CHANGES_REQ=$(echo "$PR_DETAIL" | jq -c '[ + .reviews[] | + select(.state == "CHANGES_REQUESTED") | + .body | gsub("\\n"; " ") | .[0:200] + ]') + + # For closed PRs: get last 2 comments for closing context + CLOSE_REASON="null" + if [ "$STATE" = "CLOSED" ]; then + CLOSE_REASON=$(gh pr view "$NUMBER" --repo "$REPO" \ + --json comments \ + --jq '.comments | .[-2:] | map(.body | gsub("\\n"; " ") | .[0:200]) | join(" | ")' \ + 2>/dev/null | jq -Rs '.') + fi + + # Append compact record + RECORD=$(jq -n \ + --argjson number "$NUMBER" \ + --arg state "$STATE" \ + --arg title "$TITLE" \ + --arg branch "$BRANCH" \ + --argjson labels "$LABELS" \ + --argjson files "$FILES" \ + --argjson changes_requested "$CHANGES_REQ" \ + --argjson close_reason "$CLOSE_REASON" \ + '{number: $number, state: $state, title: $title, branch: $branch, + labels: $labels, files: $files, + changes_requested: $changes_requested, close_reason: $close_reason}') + + jq --argjson rec "$RECORD" '. + [$rec]' "$OUT_FILE" > "${OUT_FILE}.tmp" \ + && mv "${OUT_FILE}.tmp" "$OUT_FILE" + done +} + +fetch_pr_details \ + "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-details.json" + +fetch_pr_details \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-details.json" + +# Save raw data to artifacts for reference +cp "/tmp/guidance-gen/$REPO_SLUG/cve-details.json" \ + "artifacts/guidance/$REPO_SLUG/raw/cve-prs.json" +cp "/tmp/guidance-gen/$REPO_SLUG/bugfix-details.json" \ + "artifacts/guidance/$REPO_SLUG/raw/bugfix-prs.json" +``` + +### 5. Synthesize Patterns + +Read `cve-details.json` and `bugfix-details.json` from the artifacts. +Analyze them as the agent — do NOT write a script for this step. + +**For each bucket, identify patterns across the PR records. Apply these rules:** + +**Inclusion threshold**: Only include a rule if it appears in 3 or more PRs. +State the evidence count inline: `(8/9 merged PRs)`. + +**What to extract:** + +From merged PRs: +- **Title format**: What template do titles follow? Extract the pattern. + Example: `Security: Fix CVE-YYYY-XXXXX ()` or `fix(): ` +- **Branch format**: What naming pattern do branches use? +- **Files changed**: Which files appear together most often? Are there always-together groups? +- **Labels**: What labels are consistently applied? +- **Co-changes**: When package A changes, does package B always change too? +- **From changes_requested**: What did reviewers ask for that wasn't there? These are proactive rules. + +From closed PRs: +- **close_reason + changes_requested**: Why was the PR closed/rejected? Each reason becomes a "don't". +- Look for patterns across multiple closed PRs — single-occurrence rejections are excluded. + +**Output of synthesis step:** +Write an intermediate analysis file per bucket: + +``` +artifacts/guidance//analysis/cve-patterns.md +artifacts/guidance//analysis/bugfix-patterns.md +``` + +Each analysis file is a structured list: +``` +TITLE_FORMAT: "Security: Fix CVE-YYYY-XXXXX ()" (N/N merged) +BRANCH_FORMAT: "fix/cve-YYYY-XXXXX--attempt-N" (N/N merged) +FILES_GO_STDLIB: go.mod + Dockerfile + Dockerfile.konflux (N/N Go CVE PRs) +PROACTIVE_go_sum: Include go.sum — flagged missing in N closed PRs +DONT_multiple_cves: One CVE per PR — N closed PRs rejected for combining +... +``` + +### 6. Generate Guidance Files + +From the analysis files, generate the final guidance files. + +**Hard constraints:** +- Maximum 80 lines per file +- No narrative paragraphs — one rule per line or a tight code block +- Evidence counts are inline and terse: `(N/M merged)`, `(N closed PRs)` +- No full PR examples — only the distilled pattern + +**CVE guidance file template** — write to `artifacts/guidance//output/cve-fix-guidance.md`: + +```markdown +# CVE Fix Guidance — + + +## Titles +`` (N/N) + +## Branches +`` (N/N) + +## Files — + (N/N) + + +## PR Description +Required sections (missing caused REQUEST_CHANGES in N PRs): +-
+-
+... + +## Jira / Issue References + (N PRs flagged incorrect format) + +## Don'ts +- (N cases) +- (N cases) +... +``` + +**Bugfix guidance file template** — write to `artifacts/guidance//output/bugfix-guidance.md`: + +```markdown +# Bugfix Guidance — + + +## Titles +`` (N/N) + +## Branches +`` (N/N) + +## Scope Values + (from N PRs) + +## Test Requirements + (N/N merged PRs included this) + +## PR Must Include +- (N PRs) +... + +## Don'ts +- (N cases) +... +``` + +**If a section has no data meeting the 3-PR threshold, omit that section entirely.** +Do not write sections with placeholder text or "not enough data" notes — just omit them. + +**If only one bucket had data** (e.g., no CVE PRs found), only generate the file for +the bucket that had data. Log which file was skipped and why. + +### 7. Create Pull Request in Target Repository + +Clone the repository, add the guidance files, and open a PR. + +```bash +TODAY=$(date +%Y-%m-%d) +BRANCH_NAME="chore/add-pr-guidance-$TODAY" + +# Clone to /tmp +CLONE_DIR="/tmp/guidance-gen/$REPO_SLUG/repo" +git clone "https://github.com/$REPO.git" "$CLONE_DIR" +cd "$CLONE_DIR" + +# Configure git credentials +gh auth setup-git 2>/dev/null || true + +# Create branch +git checkout -b "$BRANCH_NAME" + +# Copy generated files +CVE_OUTPUT="$OLDPWD/artifacts/guidance/$REPO_SLUG/output/cve-fix-guidance.md" +BUGFIX_OUTPUT="$OLDPWD/artifacts/guidance/$REPO_SLUG/output/bugfix-guidance.md" + +if [ -f "$CVE_OUTPUT" ]; then + mkdir -p .cve-fix + cp "$CVE_OUTPUT" .cve-fix/examples.md +fi + +if [ -f "$BUGFIX_OUTPUT" ]; then + mkdir -p .bugfix + cp "$BUGFIX_OUTPUT" .bugfix/guidance.md +fi + +# Commit +git add .cve-fix .bugfix +git commit -m "chore: add automated PR guidance files + +Guidance files generated by the PR Guidance Generator workflow. +These files teach automated fix workflows how this repo expects +PRs to be structured, based on analysis of merged and closed PRs. + +Files added: +$([ -f "$CVE_OUTPUT" ] && echo " - .cve-fix/examples.md (CVE fix conventions)") +$([ -f "$BUGFIX_OUTPUT" ] && echo " - .bugfix/guidance.md (Bugfix conventions)") + +Co-Authored-By: PR Guidance Generator " + +# Push +git push origin "$BRANCH_NAME" +``` + +**Create PR using gh:** + +```bash +# Count stats for PR body +CVE_MERGED_COUNT=$(jq '[.[] | select(.state == "MERGED")] | length' \ + "$OLDPWD/artifacts/guidance/$REPO_SLUG/raw/cve-prs.json" 2>/dev/null || echo 0) +CVE_CLOSED_COUNT=$(jq '[.[] | select(.state == "CLOSED")] | length' \ + "$OLDPWD/artifacts/guidance/$REPO_SLUG/raw/cve-prs.json" 2>/dev/null || echo 0) +BUGFIX_MERGED_COUNT=$(jq '[.[] | select(.state == "MERGED")] | length' \ + "$OLDPWD/artifacts/guidance/$REPO_SLUG/raw/bugfix-prs.json" 2>/dev/null || echo 0) +BUGFIX_CLOSED_COUNT=$(jq '[.[] | select(.state == "CLOSED")] | length' \ + "$OLDPWD/artifacts/guidance/$REPO_SLUG/raw/bugfix-prs.json" 2>/dev/null || echo 0) + +PR_BODY=$(cat <\` periodically to refresh with new PRs. + +--- +Generated by PR Guidance Generator workflow +EOF +) + +PR_URL=$(gh pr create \ + --repo "$REPO" \ + --base "$(gh repo view "$REPO" --json defaultBranchRef --jq '.defaultBranchRef.name')" \ + --title "chore: add automated PR guidance files" \ + --body "$PR_BODY") + +echo "PR created: $PR_URL" +``` + +**If PR creation fails** (no push access, auth issue), save the branch state and +report the error clearly. Tell the user to create the PR manually and provide the +branch name. + +### 8. Cleanup + +```bash +cd / +rm -rf "/tmp/guidance-gen/$REPO_SLUG" +echo "Cleaned up /tmp/guidance-gen/$REPO_SLUG" +``` + +### 9. Print Summary + +``` +Done. + +Repository: https://github.com/ +Analyzed: CVE PRs ( merged, closed) + Bugfix PRs ( merged, closed) +Rules: CVE rules, bugfix rules (3-PR threshold applied) + +Files generated: + artifacts/guidance//output/cve-fix-guidance.md + artifacts/guidance//output/bugfix-guidance.md + +PR: + +Artifacts: artifacts/guidance// +``` + +## Output + +- `artifacts/guidance//raw/cve-prs.json` — raw compact PR data +- `artifacts/guidance//raw/bugfix-prs.json` +- `artifacts/guidance//analysis/cve-patterns.md` — intermediate patterns +- `artifacts/guidance//analysis/bugfix-patterns.md` +- `artifacts/guidance//output/cve-fix-guidance.md` — final CVE guidance +- `artifacts/guidance//output/bugfix-guidance.md` — final bugfix guidance +- Pull request in target repository + +## Success Criteria + +- [ ] Both buckets filtered from PR metadata +- [ ] Per-PR details fetched (files + review REQUEST_CHANGES) +- [ ] Closed PRs have closing context fetched +- [ ] Patterns synthesized with 3-PR minimum applied +- [ ] Guidance files are under 80 lines each +- [ ] Files written to artifacts/output/ +- [ ] PR created in target repo with correct files in .cve-fix/ and .bugfix/ +- [ ] /tmp cleaned up +- [ ] PR URL printed to console + +## Notes + +### Empty Buckets +If a bucket has fewer than 3 merged PRs, skip that guidance file entirely. +Log: "Skipping CVE guidance — only N merged CVE PRs found (minimum 3 required)." + +### Repos with No Matching PRs +If neither bucket has data, the repo likely uses non-standard PR naming. +Report this and ask the user to provide example PR numbers or title patterns +so the filters can be adjusted. + +### GitHub API Rate Limits +`gh` uses authenticated calls (5000 req/hr). The per-PR detail fetch makes +2 API calls per PR (files+reviews, and comments for closed PRs). +At the default limit of 100 per bucket, worst case is ~400 API calls — well +within limits. If the user hits rate limits, reduce with `--limit 50`. + +### If .cve-fix/ or .bugfix/ Already Exist in Repo +If these directories already exist in the default branch, do not overwrite silently. +Warn the user: "Existing guidance files found in repo. Use /guidance.update instead, +or pass --force to overwrite." +Check with: +```bash +gh api repos/$REPO/contents/.cve-fix/examples.md > /dev/null 2>&1 && EXISTING_CVE=true +gh api repos/$REPO/contents/.bugfix/guidance.md > /dev/null 2>&1 && EXISTING_BUGFIX=true +``` diff --git a/workflows/guidance-generator/.claude/commands/guidance.update.md b/workflows/guidance-generator/.claude/commands/guidance.update.md new file mode 100644 index 00000000..4c161955 --- /dev/null +++ b/workflows/guidance-generator/.claude/commands/guidance.update.md @@ -0,0 +1,437 @@ +# /guidance.update - Update Existing PR Guidance Files + +## Purpose +Fetch PRs created since the last analysis, extract new patterns, merge them +into existing guidance files, and open a PR in the repository with the updates. + +## Execution Style + +Be concise. Brief status per phase, full summary at end. + +Example: +``` +Reading existing guidance from org/repo... + .cve-fix/examples.md — last analyzed: 2026-01-15 + .bugfix/guidance.md — last analyzed: 2026-01-15 + +Fetching PRs since 2026-01-15... 23 new PRs + CVE bucket: 8 PRs (6 merged, 2 closed) + Bugfix bucket: 12 PRs (11 merged, 1 closed) + +Synthesizing new patterns... + CVE: 2 new rules, 3 evidence counts updated, 1 contradiction flagged + Bugfix: 1 new rule, 2 evidence counts updated + +Updating files and creating PR... https://github.com/org/repo/pull/103 +``` + +## Prerequisites + +- GitHub CLI (`gh`) installed and authenticated +- `jq` installed +- Guidance files must already exist in the repo (run `/guidance.generate` first) + +## Arguments + +``` +/guidance.update +``` + +- `repo-url`: Full GitHub URL or `org/repo` slug + +## Process + +### 1. Parse Arguments and Validate + +```bash +gh auth status || { echo "ERROR: gh not authenticated. Run 'gh auth login'"; exit 1; } + +gh repo view "$REPO" --json name > /dev/null 2>&1 || { + echo "ERROR: Cannot access $REPO" + exit 1 +} + +REPO_SLUG=$(echo "$REPO" | tr '/' '-') +mkdir -p "artifacts/guidance/$REPO_SLUG/raw" +mkdir -p "artifacts/guidance/$REPO_SLUG/analysis" +mkdir -p "artifacts/guidance/$REPO_SLUG/output" +mkdir -p "/tmp/guidance-gen/$REPO_SLUG" +``` + +### 2. Read Existing Guidance Files from Repository + +Clone the repo and read the existing guidance files. Extract the +`last-analyzed` date from each file's header comment. + +```bash +CLONE_DIR="/tmp/guidance-gen/$REPO_SLUG/repo" +git clone "https://github.com/$REPO.git" "$CLONE_DIR" +cd "$CLONE_DIR" +gh auth setup-git 2>/dev/null || true + +CVE_FILE="$CLONE_DIR/.cve-fix/examples.md" +BUGFIX_FILE="$CLONE_DIR/.bugfix/guidance.md" + +FOUND_CVE=false +FOUND_BUGFIX=false +LAST_DATE="" + +if [ -f "$CVE_FILE" ]; then + FOUND_CVE=true + # Extract date from: + CVE_DATE=$(grep -m1 'last-analyzed:' "$CVE_FILE" | \ + grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2}' | head -1) + echo " .cve-fix/examples.md — last analyzed: ${CVE_DATE:-unknown}" + LAST_DATE="$CVE_DATE" +fi + +if [ -f "$BUGFIX_FILE" ]; then + FOUND_BUGFIX=true + BUGFIX_DATE=$(grep -m1 'last-analyzed:' "$BUGFIX_FILE" | \ + grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2}' | head -1) + echo " .bugfix/guidance.md — last analyzed: ${BUGFIX_DATE:-unknown}" + # Use earlier of the two dates to avoid missing PRs + if [ -n "$BUGFIX_DATE" ] && [ -n "$LAST_DATE" ]; then + LAST_DATE=$(echo -e "$LAST_DATE\n$BUGFIX_DATE" | sort | head -1) + elif [ -n "$BUGFIX_DATE" ]; then + LAST_DATE="$BUGFIX_DATE" + fi +fi +``` + +**If neither file exists**, stop and redirect: + +``` +Neither .cve-fix/examples.md nor .bugfix/guidance.md found in . +Run /guidance.generate to create them first. +``` + +**If `last-analyzed` date cannot be parsed**, warn the user and default to +fetching the last 90 days of PRs, then proceed. + +```bash +if [ -z "$LAST_DATE" ]; then + echo "WARNING: Could not parse last-analyzed date. Defaulting to last 90 days." + LAST_DATE=$(date -d "90 days ago" +%Y-%m-%d 2>/dev/null || \ + date -v-90d +%Y-%m-%d 2>/dev/null) +fi + +echo "Fetching PRs since $LAST_DATE..." +``` + +### 3. Fetch New PRs Since Last Analysis (Pass 1) + +```bash +gh pr list \ + --repo "$REPO" \ + --state all \ + --limit 200 \ + --search "created:>$LAST_DATE" \ + --json number,title,state,mergedAt,closedAt,labels,headRefName,latestReviews \ + > "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" + +NEW_TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json") +echo "Fetched $NEW_TOTAL new PRs since $LAST_DATE" + +if [ "$NEW_TOTAL" -eq 0 ]; then + echo "No new PRs found since $LAST_DATE. Guidance files are already up to date." + rm -rf "/tmp/guidance-gen/$REPO_SLUG" + exit 0 +fi +``` + +### 4. Filter New PRs into Buckets + +Apply the same filters as `/guidance.generate`: + +```bash +# CVE bucket +jq '[ + .[] | select( + (.title | test("CVE-[0-9]{4}-[0-9]+|^[Ss]ecurity:|^fix\\(cve\\):|^Fix CVE"; "i")) or + (.headRefName | test("^fix/cve-|^security/cve-"; "i")) + ) +]' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" + +# Bugfix bucket (excluding CVE) +jq '[ + .[] | select( + ( + (.title | test("^fix[:(]|^bugfix|^bug[[:space:]]fix|closes[[:space:]]#[0-9]+|fixes[[:space:]]#[0-9]+"; "i")) or + (.headRefName | test("^(bugfix|fix|bug)/"; "i")) + ) and + (.title | test("CVE-[0-9]{4}-[0-9]+"; "i") | not) and + (.headRefName | test("^fix/cve-"; "i") | not) + ) +]' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" + +NEW_CVE=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json") +NEW_BUGFIX=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json") +echo " CVE bucket: $NEW_CVE new PRs" +echo " Bugfix bucket: $NEW_BUGFIX new PRs" +``` + +### 5. Fetch Per-PR Details (Pass 2) + +Same as `/guidance.generate` — files + reviews per PR, closing context for closed PRs. + +```bash +fetch_pr_details() { + local META_FILE="$1" + local OUT_FILE="$2" + local COUNT=$(jq 'length' "$META_FILE") + + echo "[]" > "$OUT_FILE" + + for i in $(seq 0 $((COUNT - 1))); do + NUMBER=$(jq -r ".[$i].number" "$META_FILE") + STATE=$(jq -r ".[$i].state" "$META_FILE") + TITLE=$(jq -r ".[$i].title" "$META_FILE") + BRANCH=$(jq -r ".[$i].headRefName" "$META_FILE") + LABELS=$(jq -c "[.[$i].labels[].name]" "$META_FILE") + + PR_DETAIL=$(gh pr view "$NUMBER" --repo "$REPO" \ + --json files,reviews 2>/dev/null) + + FILES=$(echo "$PR_DETAIL" | jq -c '[.files[].path]') + CHANGES_REQ=$(echo "$PR_DETAIL" | jq -c '[ + .reviews[] | + select(.state == "CHANGES_REQUESTED") | + .body | gsub("\\n"; " ") | .[0:200] + ]') + + CLOSE_REASON="null" + if [ "$STATE" = "CLOSED" ]; then + CLOSE_REASON=$(gh pr view "$NUMBER" --repo "$REPO" \ + --json comments \ + --jq '.comments | .[-2:] | map(.body | gsub("\\n"; " ") | .[0:200]) | join(" | ")' \ + 2>/dev/null | jq -Rs '.') + fi + + RECORD=$(jq -n \ + --argjson number "$NUMBER" \ + --arg state "$STATE" \ + --arg title "$TITLE" \ + --arg branch "$BRANCH" \ + --argjson labels "$LABELS" \ + --argjson files "$FILES" \ + --argjson changes_requested "$CHANGES_REQ" \ + --argjson close_reason "$CLOSE_REASON" \ + '{number: $number, state: $state, title: $title, branch: $branch, + labels: $labels, files: $files, + changes_requested: $changes_requested, close_reason: $close_reason}') + + jq --argjson rec "$RECORD" '. + [$rec]' "$OUT_FILE" > "${OUT_FILE}.tmp" \ + && mv "${OUT_FILE}.tmp" "$OUT_FILE" + done +} + +fetch_pr_details \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-details.json" + +fetch_pr_details \ + "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details.json" + +cp "/tmp/guidance-gen/$REPO_SLUG/new-cve-details.json" \ + "artifacts/guidance/$REPO_SLUG/raw/new-cve-prs.json" +cp "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details.json" \ + "artifacts/guidance/$REPO_SLUG/raw/new-bugfix-prs.json" +``` + +### 6. Synthesize New Patterns + +Read both the new PR detail files AND the existing guidance files. + +As the agent, analyze the new PR data for patterns. For each pattern found: + +**A. New rule** — a pattern seen in 3+ of the new PRs that does not already +exist in the guidance file. Add it to the appropriate section. + +**B. Reinforced rule** — a pattern that already exists in the guidance file. +Update its evidence count. For example: `(8/9 merged)` → `(14/15 merged)`. + +**C. Contradicting rule** — a pattern in new merged PRs that directly contradicts +a "don't" in the existing guidance file (e.g., a merged PR combined two CVEs despite +the guidance saying not to). Flag this with a comment in the guidance file: +``` +- [REVIEW NEEDED] Multiple CVEs per PR — previously flagged as a don't, + but PR #N was merged combining CVEs. Policy may have changed. (N/N new merged) +``` + +**D. New don't** — a pattern from newly closed PRs (3+ cases) not already in the +don'ts section. Add it. + +Write findings to: +- `artifacts/guidance//analysis/cve-update-patterns.md` +- `artifacts/guidance//analysis/bugfix-update-patterns.md` + +Format: same structured list as in `/guidance.generate` step 5. + +### 7. Merge Patterns into Existing Guidance Files + +Read the cloned guidance files and apply the changes from step 6. + +**Editing rules:** +- Update evidence counts in-place: find the line, update the `(N/M ...)` count +- Append new rules to the bottom of the appropriate section +- Append new don'ts to the Don'ts section +- Add `[REVIEW NEEDED]` lines at the bottom of the relevant section for contradictions +- Update the `last-analyzed` date in the header comment +- Update the merged/closed counts in the header comment +- Do NOT reorder existing rules — preserve the file structure + +After editing, verify the file is still under 80 lines. If adding new rules +would push it over 80 lines, prioritize: keep all don'ts, keep rules with +highest evidence counts, drop rules with lowest counts (below 5%). + +**Update the header:** +``` + +``` + +Copy the updated files to artifacts output: +```bash +cp "$CVE_FILE" "artifacts/guidance/$REPO_SLUG/output/cve-fix-guidance.md" +cp "$BUGFIX_FILE" "artifacts/guidance/$REPO_SLUG/output/bugfix-guidance.md" +``` + +### 8. Create Pull Request with Updates + +```bash +TODAY=$(date +%Y-%m-%d) +BRANCH_NAME="chore/update-pr-guidance-$TODAY" + +cd "$CLONE_DIR" +git checkout -b "$BRANCH_NAME" + +# Files are already updated in-place in the clone from step 7 +git add .cve-fix .bugfix +git commit -m "chore: update PR guidance files ($TODAY) + +Refreshed guidance based on PRs merged/closed since last analysis. + +Changes: +- Updated evidence counts for existing rules +- Added new rules (if any new patterns emerged) +- Updated last-analyzed date to $TODAY + +Co-Authored-By: PR Guidance Generator " + +git push origin "$BRANCH_NAME" +``` + +Construct PR body summarizing what changed: + +```bash +PR_BODY=$(cat < +New PRs: CVE PRs, bugfix PRs (since ) +Changes: new rules, counts updated, contradictions flagged + +PR: + +Artifacts: artifacts/guidance// +``` + +## Output + +- `artifacts/guidance//raw/new-cve-prs.json` +- `artifacts/guidance//raw/new-bugfix-prs.json` +- `artifacts/guidance//analysis/cve-update-patterns.md` +- `artifacts/guidance//analysis/bugfix-update-patterns.md` +- `artifacts/guidance//output/cve-fix-guidance.md` (updated) +- `artifacts/guidance//output/bugfix-guidance.md` (updated) +- Pull request in target repository + +## Success Criteria + +- [ ] Existing guidance files found and last-analyzed date extracted +- [ ] New PRs fetched since last-analyzed date +- [ ] Per-PR details fetched for new PRs +- [ ] New patterns synthesized (new rules, updated counts, contradictions flagged) +- [ ] Existing files updated in-place (no rewrites, structure preserved) +- [ ] Both files remain under 80 lines +- [ ] Header timestamps updated +- [ ] PR created in target repo +- [ ] /tmp cleaned up +- [ ] PR URL printed to console + +## Notes + +### No New PRs Found +If 0 new PRs since the last-analyzed date, report this and exit cleanly. +Do not create a PR with no changes. + +### Only One File Exists +If only `.cve-fix/examples.md` exists (no `.bugfix/guidance.md`), update only +the CVE file. Log that bugfix guidance was skipped. + +### Contradictions Require Human Review +Do not automatically remove a "don't" rule just because a new merged PR +contradicts it. Flag it with `[REVIEW NEEDED]` and let the repo owner decide +if the convention changed. The PR reviewer will see the flag and can edit +the file before merging. + +### Date Parsing Cross-Platform +`date -d` (Linux) and `date -v` (macOS) differ. Use both with fallback: +```bash +LAST_DATE=$(date -d "90 days ago" +%Y-%m-%d 2>/dev/null || \ + date -v-90d +%Y-%m-%d 2>/dev/null || \ + echo "2000-01-01") +``` diff --git a/workflows/guidance-generator/.claude/settings.json b/workflows/guidance-generator/.claude/settings.json new file mode 100644 index 00000000..ff4140e4 --- /dev/null +++ b/workflows/guidance-generator/.claude/settings.json @@ -0,0 +1,13 @@ +{ + "permissions": { + "allow": [ + "Bash", + "Read", + "Write", + "Edit" + ], + "deny": [ + "Bash(rm -rf /)" + ] + } +} diff --git a/workflows/guidance-generator/README.md b/workflows/guidance-generator/README.md new file mode 100644 index 00000000..0d558edb --- /dev/null +++ b/workflows/guidance-generator/README.md @@ -0,0 +1,129 @@ +# PR Guidance Generator + +Analyzes a GitHub repository's fix PR history to generate compact guidance files +that teach automated workflows — CVE Fixer and Bugfix — how to create pull requests +that match that repo's conventions. Opens a PR in the target repo with the generated files. + +## Problem It Solves + +Automated fix workflows (CVE Fixer, Bugfix) create PRs without knowing a repo's +specific conventions: how titles should read, which files always change together, +what reviewers will ask for, what gets PRs closed. This leads to PRs that get +closed or require many review cycles. + +This workflow learns those conventions directly from the repo's PR history and +encodes them into guidance files that automated workflows read before making changes. + +## How It Works + +1. Fetches PR metadata from the target repo (titles, branches, labels) +2. Filters into CVE and bugfix buckets based on title/branch patterns +3. Fetches targeted details per PR: files changed + review REQUEST_CHANGES comments +4. For closed PRs: fetches the closing context to extract "don'ts" +5. Synthesizes rules — only patterns observed in 3+ PRs are included +6. Generates compact guidance files (80-line cap, one rule per line) +7. Opens a PR in the target repo adding the files + +## Commands + +### `/guidance.generate ` + +Full pipeline for a fresh repo. + +``` +/guidance.generate https://github.com/org/repo +/guidance.generate org/repo --cve-only +/guidance.generate org/repo --bugfix-only +/guidance.generate org/repo --limit 50 +``` + +Generates: +- `.cve-fix/examples.md` — read by the CVE Fixer workflow (step 4.5) +- `.bugfix/guidance.md` — read by the Bugfix workflow + +### `/guidance.update ` + +Refreshes existing guidance with PRs merged/closed since the last analysis. +Reads the `last-analyzed` date from existing files, fetches only newer PRs, +merges new patterns, and opens a PR with the updates. + +``` +/guidance.update https://github.com/org/repo +``` + +## Generated File Format + +Files are intentionally compact. Example `.cve-fix/examples.md`: + +```markdown +# CVE Fix Guidance — org/repo + + +## Titles +`Security: Fix CVE-YYYY-XXXXX ()` (47/47) + +## Branches +`fix/cve-YYYY-XXXXX--attempt-N` (47/47) + +## Files — Go stdlib CVEs +Always update go.mod + Dockerfile + Dockerfile.konflux together (8/8) +Run go mod tidy — missing go.sum was flagged in 3 closed PRs + +## Files — Node.js CVEs +Use overrides in package.json, not direct npm update (5/5) + +## Co-upgrades +fastapi must be co-upgraded with starlette (2 closed PRs lacked this) + +## PR Description +Required sections (missing caused REQUEST_CHANGES in 6 PRs): +- CVE Details, Test Results, Breaking Changes, Jira refs (plain text IDs only) + +## Don'ts +- One CVE per PR — combined PRs were closed (4 cases) +- Don't target release branches — target main (3 cases) +``` + +## Token Efficiency + +The workflow uses a two-pass fetch strategy to minimize API calls and context size: + +- **Pass 1**: Lightweight metadata for all PRs (title, branch, labels, state) +- **Pass 2**: Per-PR detail only for PRs in the CVE/bugfix buckets (files + reviews) +- **Closed PRs only**: Fetch closing context (last 2 comments) + +This avoids fetching full PR bodies and review threads for irrelevant PRs, +keeping the analysis input compact (structured JSON, ~200 tokens/PR). + +## How Automated Workflows Use the Files + +**CVE Fixer** (`/cve.fix`): In step 4.5, after cloning repos and before making +any fixes, the workflow reads all files in `.cve-fix/` and builds a knowledge base +from them. The guidance from `examples.md` applies to every subsequent decision — +PR title format, branch naming, which files to update, co-upgrade requirements, +Jira reference format, and known pitfalls. + +**Bugfix workflow**: Reads `.bugfix/guidance.md` before implementing fixes. + +## Prerequisites + +- GitHub CLI (`gh`) installed and authenticated (`gh auth login`) +- `jq` installed +- Write access to the target repository (to open a PR) + +## Artifacts + +All artifacts are saved to `artifacts/guidance//`: + +``` +artifacts/guidance// +├── raw/ +│ ├── cve-prs.json # Compact per-PR records for CVE bucket +│ └── bugfix-prs.json # Compact per-PR records for bugfix bucket +├── analysis/ +│ ├── cve-patterns.md # Intermediate pattern extraction +│ └── bugfix-patterns.md +└── output/ + ├── cve-fix-guidance.md # Final file (placed at .cve-fix/examples.md) + └── bugfix-guidance.md # Final file (placed at .bugfix/guidance.md) +``` From e6b329310b3d7c55661479518d8b8a2343f9f938 Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Sun, 29 Mar 2026 17:55:21 -0400 Subject: [PATCH 02/23] fix: use adaptive threshold instead of hard skip for small PR buckets Previously the workflow skipped generating guidance entirely if a bucket had fewer than 3 merged PRs. This left users with nothing for repos that have few fix PRs. Now uses an adaptive per-rule threshold based on available data: - 10+ merged PRs: require 3+ PRs per rule - 3-9 merged PRs: require 2+ PRs per rule - 1-2 merged PRs: require 1+ PR per rule + limited-data warning in header - 0 merged PRs: skip that file (no data to learn from) Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../.claude/commands/guidance.generate.md | 30 +++++++++++++++---- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/workflows/guidance-generator/.claude/commands/guidance.generate.md b/workflows/guidance-generator/.claude/commands/guidance.generate.md index 59158cec..c4ff5197 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.generate.md +++ b/workflows/guidance-generator/.claude/commands/guidance.generate.md @@ -17,7 +17,7 @@ Fetching PRs from org/repo... 147 total Fetching per-PR details... Done Synthesizing patterns... - CVE: 14 rules extracted (3-PR threshold applied) + CVE: 14 rules extracted (threshold: 3 PRs, or 1 if limited data) Bugfix: 11 rules extracted Writing guidance files... Done @@ -322,9 +322,16 @@ Required sections (missing caused REQUEST_CHANGES in N PRs): ... ``` -**If a section has no data meeting the 3-PR threshold, omit that section entirely.** +**Threshold rules — adapt based on available data:** +- 10+ merged PRs in bucket → require 3+ PRs per rule (standard threshold) +- 3–9 merged PRs → require 2+ PRs per rule +- 1–2 merged PRs → require 1+ PR per rule; add a `limited-data` warning in the file header + +**If a section has no rules meeting the applicable threshold, omit that section entirely.** Do not write sections with placeholder text or "not enough data" notes — just omit them. +**If a bucket has 0 merged PRs**, skip that guidance file entirely and log why. + **If only one bucket had data** (e.g., no CVE PRs found), only generate the file for the bucket that had data. Log which file was skipped and why. @@ -458,7 +465,7 @@ Done. Repository: https://github.com/ Analyzed: CVE PRs ( merged, closed) Bugfix PRs ( merged, closed) -Rules: CVE rules, bugfix rules (3-PR threshold applied) +Rules: CVE rules, bugfix rules (adaptive threshold applied) Files generated: artifacts/guidance//output/cve-fix-guidance.md @@ -493,9 +500,20 @@ Artifacts: artifacts/guidance// ## Notes -### Empty Buckets -If a bucket has fewer than 3 merged PRs, skip that guidance file entirely. -Log: "Skipping CVE guidance — only N merged CVE PRs found (minimum 3 required)." +### Limited Data +Never skip a guidance file just because a bucket has few merged PRs. +Only skip if the bucket has **0 merged PRs**. + +For small datasets, apply an adaptive threshold and add a warning to the file header: + +```markdown + +``` + +This gives the workflow something to work with while signalling to reviewers +that the file should be revisited once more PRs accumulate. + +Log: "CVE bucket has N merged PR(s) — generating with limited-data warning." ### Repos with No Matching PRs If neither bucket has data, the repo likely uses non-standard PR naming. From b0716a2e411f1ecf192e40e04cf6ef175ff9fb4a Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Sun, 29 Mar 2026 18:00:27 -0400 Subject: [PATCH 03/23] fix: sanitize control characters in PR fetch to prevent silent failures PR review comments and closing comments can contain control characters (null bytes, form feeds, etc.) that break jq JSON construction when passed through shell variables via --arg. Changes: - Add sanitize_str() helper using tr to strip non-printable chars from title and branch fields at extraction time - Add gsub for U+0000-U+001F control chars inside jq for review bodies and closing comments before they enter shell variables - Add explicit error handling per PR: failed PRs now print a WARNING with the jq error instead of being silently dropped - Both guidance.generate and guidance.update updated Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../.claude/commands/guidance.generate.md | 42 +++++++++++++++---- .../.claude/commands/guidance.update.md | 40 +++++++++++++++--- 2 files changed, 69 insertions(+), 13 deletions(-) diff --git a/workflows/guidance-generator/.claude/commands/guidance.generate.md b/workflows/guidance-generator/.claude/commands/guidance.generate.md index c4ff5197..e31f2930 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.generate.md +++ b/workflows/guidance-generator/.claude/commands/guidance.generate.md @@ -141,18 +141,25 @@ For closed PRs, also fetch the last 2 comments (closing context). Process each bucket the same way. Replace `$META_FILE` and `$OUT_FILE` accordingly. ```bash +# Strip control characters from a string (keeps printable ASCII + tab + newline) +sanitize_str() { + tr -cd '[:print:]\t\n' +} + fetch_pr_details() { local META_FILE="$1" local OUT_FILE="$2" local COUNT=$(jq 'length' "$META_FILE") + local FAILED=0 echo "[]" > "$OUT_FILE" for i in $(seq 0 $((COUNT - 1))); do NUMBER=$(jq -r ".[$i].number" "$META_FILE") STATE=$(jq -r ".[$i].state" "$META_FILE") - TITLE=$(jq -r ".[$i].title" "$META_FILE") - BRANCH=$(jq -r ".[$i].headRefName" "$META_FILE") + # Sanitize string fields at extraction time to strip control characters + TITLE=$(jq -r ".[$i].title" "$META_FILE" | sanitize_str) + BRANCH=$(jq -r ".[$i].headRefName" "$META_FILE" | sanitize_str) LABELS=$(jq -c "[.[$i].labels[].name]" "$META_FILE") # Fetch files and reviews in one call @@ -161,23 +168,31 @@ fetch_pr_details() { FILES=$(echo "$PR_DETAIL" | jq -c '[.files[].path]') - # Extract only REQUEST_CHANGES review bodies (trimmed to 200 chars) + # Extract REQUEST_CHANGES review bodies — sanitize inside jq before truncating CHANGES_REQ=$(echo "$PR_DETAIL" | jq -c '[ .reviews[] | select(.state == "CHANGES_REQUESTED") | - .body | gsub("\\n"; " ") | .[0:200] + .body | + gsub("[\\u0000-\\u0008\\u000b-\\u001f\\u007f]"; "") | + gsub("\\n|\\r"; " ") | + .[0:200] ]') - # For closed PRs: get last 2 comments for closing context + # For closed PRs: get last 2 comments, sanitize inside jq CLOSE_REASON="null" if [ "$STATE" = "CLOSED" ]; then CLOSE_REASON=$(gh pr view "$NUMBER" --repo "$REPO" \ --json comments \ - --jq '.comments | .[-2:] | map(.body | gsub("\\n"; " ") | .[0:200]) | join(" | ")' \ + --jq '.comments | .[-2:] | map( + .body | + gsub("[\\u0000-\\u0008\\u000b-\\u001f\\u007f]"; "") | + gsub("\\n|\\r"; " ") | + .[0:200] + ) | join(" | ")' \ 2>/dev/null | jq -Rs '.') fi - # Append compact record + # Build compact record — capture jq errors per PR instead of silently dropping RECORD=$(jq -n \ --argjson number "$NUMBER" \ --arg state "$STATE" \ @@ -189,11 +204,22 @@ fetch_pr_details() { --argjson close_reason "$CLOSE_REASON" \ '{number: $number, state: $state, title: $title, branch: $branch, labels: $labels, files: $files, - changes_requested: $changes_requested, close_reason: $close_reason}') + changes_requested: $changes_requested, close_reason: $close_reason}' \ + 2>/tmp/guidance-jq-err.txt) + + if [ $? -ne 0 ]; then + echo " WARNING: PR #$NUMBER skipped — jq error: $(cat /tmp/guidance-jq-err.txt)" + FAILED=$((FAILED + 1)) + continue + fi jq --argjson rec "$RECORD" '. + [$rec]' "$OUT_FILE" > "${OUT_FILE}.tmp" \ && mv "${OUT_FILE}.tmp" "$OUT_FILE" done + + if [ "$FAILED" -gt 0 ]; then + echo " WARNING: $FAILED PR(s) skipped due to unparseable content. Check raw data in artifacts." + fi } fetch_pr_details \ diff --git a/workflows/guidance-generator/.claude/commands/guidance.update.md b/workflows/guidance-generator/.claude/commands/guidance.update.md index 4c161955..cdb00c02 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.update.md +++ b/workflows/guidance-generator/.claude/commands/guidance.update.md @@ -178,38 +178,57 @@ echo " Bugfix bucket: $NEW_BUGFIX new PRs" Same as `/guidance.generate` — files + reviews per PR, closing context for closed PRs. ```bash +# Strip control characters from a string (keeps printable ASCII + tab + newline) +sanitize_str() { + tr -cd '[:print:]\t\n' +} + fetch_pr_details() { local META_FILE="$1" local OUT_FILE="$2" local COUNT=$(jq 'length' "$META_FILE") + local FAILED=0 echo "[]" > "$OUT_FILE" for i in $(seq 0 $((COUNT - 1))); do NUMBER=$(jq -r ".[$i].number" "$META_FILE") STATE=$(jq -r ".[$i].state" "$META_FILE") - TITLE=$(jq -r ".[$i].title" "$META_FILE") - BRANCH=$(jq -r ".[$i].headRefName" "$META_FILE") + # Sanitize string fields at extraction time to strip control characters + TITLE=$(jq -r ".[$i].title" "$META_FILE" | sanitize_str) + BRANCH=$(jq -r ".[$i].headRefName" "$META_FILE" | sanitize_str) LABELS=$(jq -c "[.[$i].labels[].name]" "$META_FILE") PR_DETAIL=$(gh pr view "$NUMBER" --repo "$REPO" \ --json files,reviews 2>/dev/null) FILES=$(echo "$PR_DETAIL" | jq -c '[.files[].path]') + + # Extract REQUEST_CHANGES review bodies — sanitize inside jq before truncating CHANGES_REQ=$(echo "$PR_DETAIL" | jq -c '[ .reviews[] | select(.state == "CHANGES_REQUESTED") | - .body | gsub("\\n"; " ") | .[0:200] + .body | + gsub("[\\u0000-\\u0008\\u000b-\\u001f\\u007f]"; "") | + gsub("\\n|\\r"; " ") | + .[0:200] ]') + # For closed PRs: get last 2 comments, sanitize inside jq CLOSE_REASON="null" if [ "$STATE" = "CLOSED" ]; then CLOSE_REASON=$(gh pr view "$NUMBER" --repo "$REPO" \ --json comments \ - --jq '.comments | .[-2:] | map(.body | gsub("\\n"; " ") | .[0:200]) | join(" | ")' \ + --jq '.comments | .[-2:] | map( + .body | + gsub("[\\u0000-\\u0008\\u000b-\\u001f\\u007f]"; "") | + gsub("\\n|\\r"; " ") | + .[0:200] + ) | join(" | ")' \ 2>/dev/null | jq -Rs '.') fi + # Build compact record — capture jq errors per PR instead of silently dropping RECORD=$(jq -n \ --argjson number "$NUMBER" \ --arg state "$STATE" \ @@ -221,11 +240,22 @@ fetch_pr_details() { --argjson close_reason "$CLOSE_REASON" \ '{number: $number, state: $state, title: $title, branch: $branch, labels: $labels, files: $files, - changes_requested: $changes_requested, close_reason: $close_reason}') + changes_requested: $changes_requested, close_reason: $close_reason}' \ + 2>/tmp/guidance-jq-err.txt) + + if [ $? -ne 0 ]; then + echo " WARNING: PR #$NUMBER skipped — jq error: $(cat /tmp/guidance-jq-err.txt)" + FAILED=$((FAILED + 1)) + continue + fi jq --argjson rec "$RECORD" '. + [$rec]' "$OUT_FILE" > "${OUT_FILE}.tmp" \ && mv "${OUT_FILE}.tmp" "$OUT_FILE" done + + if [ "$FAILED" -gt 0 ]; then + echo " WARNING: $FAILED PR(s) skipped due to unparseable content. Check raw data in artifacts." + fi } fetch_pr_details \ From 1e56d83d2db4878dc295404422a8d6b61898ca9d Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Sun, 29 Mar 2026 18:07:50 -0400 Subject: [PATCH 04/23] feat: add --pr flag to guidance.generate and guidance.update Allows users to analyze specific PRs of their choice instead of fetching the full repo history. Usage: /guidance.generate --pr , /guidance.update --pr Accepts full GitHub PR URLs or plain PR numbers, comma-separated. Behavior: - Skips bulk gh pr list entirely; fetches only the specified PRs - Still auto-classifies into CVE/bugfix buckets by title/branch pattern - PRs that match neither pattern are included in both buckets with a NOTE, letting Claude classify them during synthesis - Generated file header includes manual-selection field listing the PRs - last-analyzed date is still updated to today in guidance.update Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../.claude/commands/guidance.generate.md | 163 ++++++++++++++---- .../.claude/commands/guidance.update.md | 134 ++++++++++---- 2 files changed, 231 insertions(+), 66 deletions(-) diff --git a/workflows/guidance-generator/.claude/commands/guidance.generate.md b/workflows/guidance-generator/.claude/commands/guidance.generate.md index e31f2930..6c99833f 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.generate.md +++ b/workflows/guidance-generator/.claude/commands/guidance.generate.md @@ -36,12 +36,17 @@ Artifacts: artifacts/guidance/org-repo/ ``` /guidance.generate [--cve-only] [--bugfix-only] [--limit N] +/guidance.generate --pr [,...] ``` - `repo-url`: Full GitHub URL (e.g., `https://github.com/org/repo`) or `org/repo` - `--cve-only`: Skip bugfix analysis - `--bugfix-only`: Skip CVE analysis - `--limit N`: Max PRs to fetch per bucket (default: 100, min: 20) +- `--pr `: Comma-separated PR URLs or numbers to analyze instead of fetching all PRs. + Skips bulk fetch entirely. Accepts full URLs (`https://github.com/org/repo/pull/123`) + or plain numbers (`123`). The generated file will include a `manual-selection` note + in its header. ## Process @@ -50,6 +55,9 @@ Artifacts: artifacts/guidance/org-repo/ Extract `REPO` in `org/repo` format from the provided URL or slug. If not provided, ask: "What is the GitHub repository URL?" +Parse `--pr` into a comma-separated list of PR numbers. Accept both full GitHub +PR URLs and plain numbers: + ```bash # Validate gh auth gh auth status || { echo "ERROR: gh not authenticated. Run 'gh auth login'"; exit 1; } @@ -63,6 +71,24 @@ gh repo view "$REPO" --json name > /dev/null 2>&1 || { # Derive a safe slug for directory names (replace / with -) REPO_SLUG=$(echo "$REPO" | tr '/' '-') +# Parse --pr flag: extract PR numbers from URLs or plain numbers +SPECIFIC_PR_NUMBERS="" +if [ -n "$PR_REFS" ]; then + IFS=',' read -ra PR_LIST <<< "$PR_REFS" + for PR_REF in "${PR_LIST[@]}"; do + PR_REF=$(echo "$PR_REF" | tr -d ' ') + if [[ "$PR_REF" =~ github\.com/[^/]+/[^/]+/pull/([0-9]+) ]]; then + SPECIFIC_PR_NUMBERS="$SPECIFIC_PR_NUMBERS ${BASH_REMATCH[1]}" + elif [[ "$PR_REF" =~ ^[0-9]+$ ]]; then + SPECIFIC_PR_NUMBERS="$SPECIFIC_PR_NUMBERS $PR_REF" + else + echo "WARNING: Could not parse PR reference '$PR_REF' — skipping" + fi + done + SPECIFIC_PR_NUMBERS=$(echo "$SPECIFIC_PR_NUMBERS" | tr -s ' ' | sed 's/^ //') + echo "Manual PR mode: analyzing PR(s) $SPECIFIC_PR_NUMBERS" +fi + # Setup directories mkdir -p "artifacts/guidance/$REPO_SLUG/raw" mkdir -p "artifacts/guidance/$REPO_SLUG/analysis" @@ -72,51 +98,117 @@ mkdir -p "/tmp/guidance-gen/$REPO_SLUG" ### 2. Fetch PR Metadata (Pass 1 — lightweight) -Fetch compact metadata for all recent PRs. No PR bodies, no file lists yet. +**If `--pr` was specified**, skip bulk fetch and build the metadata list directly +from the given PR numbers: ```bash LIMIT="${LIMIT:-100}" -gh pr list \ - --repo "$REPO" \ - --state all \ - --limit 200 \ - --json number,title,state,mergedAt,closedAt,labels,headRefName,latestReviews \ - > "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" - -TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/all-prs.json") -echo "Fetched $TOTAL PRs from $REPO" +if [ -n "$SPECIFIC_PR_NUMBERS" ]; then + # Manual mode: fetch metadata only for the specified PRs + echo "[]" > "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" + for NUMBER in $SPECIFIC_PR_NUMBERS; do + PR_META=$(gh pr view "$NUMBER" --repo "$REPO" \ + --json number,title,state,mergedAt,closedAt,labels,headRefName,latestReviews \ + 2>/dev/null) + if [ $? -ne 0 ] || [ -z "$PR_META" ]; then + echo "WARNING: Could not fetch PR #$NUMBER — skipping" + continue + fi + jq --argjson meta "$PR_META" '. + [$meta]' \ + "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/all-prs.json.tmp" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/all-prs.json.tmp" \ + "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" + done + TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/all-prs.json") + echo "Loaded $TOTAL specified PR(s) from $REPO" +else + # Auto mode: bulk fetch all recent PRs + gh pr list \ + --repo "$REPO" \ + --state all \ + --limit 200 \ + --json number,title,state,mergedAt,closedAt,labels,headRefName,latestReviews \ + > "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" + TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/all-prs.json") + echo "Fetched $TOTAL PRs from $REPO" +fi ``` ### 3. Filter into Buckets Use jq to split into CVE and bugfix buckets based on title and branch patterns. -A PR cannot be in both buckets — CVE PRs take priority. + +In **auto mode**: CVE PRs take priority — a PR cannot be in both buckets. +In **manual mode (`--pr`)**: classify normally, but if a specified PR matches +neither pattern, include it in both buckets and let Claude determine during +synthesis which guidance file it informs. Never silently drop a user-specified PR. ```bash -# CVE bucket: title or branch matches CVE pattern -jq --argjson limit "$LIMIT" '[ - .[] | select( - (.title | test("CVE-[0-9]{4}-[0-9]+|^[Ss]ecurity:|^fix\\(cve\\):|^Fix CVE"; "i")) or - (.headRefName | test("^fix/cve-|^security/cve-"; "i")) - ) -] | .[:$limit]' \ - "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ - > "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" - -# Bugfix bucket: title or branch matches bug pattern, exclude CVE PRs -jq --argjson limit "$LIMIT" '[ - .[] | select( +CVE_PATTERN='CVE-[0-9]{4}-[0-9]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE' +CVE_BRANCH_PATTERN='^fix/cve-|^security/cve-' +BUGFIX_PATTERN='^fix[:(]|^bugfix|^bug[[:space:]]fix|closes[[:space:]]#[0-9]+|fixes[[:space:]]#[0-9]+' +BUGFIX_BRANCH_PATTERN='^(bugfix|fix|bug)/' + +if [ -n "$SPECIFIC_PR_NUMBERS" ]; then + # Manual mode: classify each PR, fallback to both buckets if unmatched + jq '[.[] | select( + (.title | test("'"$CVE_PATTERN"'"; "i")) or + (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i")) + )]' "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" + + jq '[.[] | select( ( - (.title | test("^fix[:(]|^bugfix|^bug[[:space:]]fix|closes[[:space:]]#[0-9]+|fixes[[:space:]]#[0-9]+"; "i")) or - (.headRefName | test("^(bugfix|fix|bug)/"; "i")) + (.title | test("'"$BUGFIX_PATTERN"'"; "i")) or + (.headRefName | test("'"$BUGFIX_BRANCH_PATTERN"'"; "i")) ) and (.title | test("CVE-[0-9]{4}-[0-9]+"; "i") | not) and (.headRefName | test("^fix/cve-"; "i") | not) - ) -] | .[:$limit]' \ - "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ - > "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" + )]' "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" + + # Any PR that matched neither bucket: add to both with a warning + UNMATCHED=$(jq '[.[] | select( + ((.title | test("'"$CVE_PATTERN"'"; "i")) or (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i")) | not) and + ((.title | test("'"$BUGFIX_PATTERN"'"; "i")) or (.headRefName | test("'"$BUGFIX_BRANCH_PATTERN"'"; "i")) | not) + )]' "/tmp/guidance-gen/$REPO_SLUG/all-prs.json") + UNMATCHED_COUNT=$(echo "$UNMATCHED" | jq 'length') + if [ "$UNMATCHED_COUNT" -gt 0 ]; then + UNMATCHED_NUMS=$(echo "$UNMATCHED" | jq -r '.[].number' | tr '\n' ',' | sed 's/,$//') + echo " NOTE: PR(s) #$UNMATCHED_NUMS did not match CVE or bugfix patterns — included in both buckets for Claude to classify" + jq --argjson extra "$UNMATCHED" '. + $extra' \ + "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" > "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json.tmp" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json.tmp" "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" + jq --argjson extra "$UNMATCHED" '. + $extra' \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" > "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json.tmp" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json.tmp" "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" + fi +else + # Auto mode: strict filtering, CVE takes priority + jq --argjson limit "$LIMIT" '[ + .[] | select( + (.title | test("'"$CVE_PATTERN"'"; "i")) or + (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i")) + ) + ] | .[:$limit]' \ + "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" + + jq --argjson limit "$LIMIT" '[ + .[] | select( + ( + (.title | test("'"$BUGFIX_PATTERN"'"; "i")) or + (.headRefName | test("'"$BUGFIX_BRANCH_PATTERN"'"; "i")) + ) and + (.title | test("CVE-[0-9]{4}-[0-9]+"; "i") | not) and + (.headRefName | test("^fix/cve-"; "i") | not) + ) + ] | .[:$limit]' \ + "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" +fi CVE_TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json") CVE_MERGED=$(jq '[.[] | select(.state == "MERGED")] | length' "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json") @@ -290,7 +382,16 @@ From the analysis files, generate the final guidance files. - Evidence counts are inline and terse: `(N/M merged)`, `(N closed PRs)` - No full PR examples — only the distilled pattern -**CVE guidance file template** — write to `artifacts/guidance//output/cve-fix-guidance.md`: +**CVE guidance file template** — write to `artifacts/guidance//output/cve-fix-guidance.md`. + +When in manual PR mode, the header must note which PRs were used: + +```markdown +# CVE Fix Guidance — + +``` + +In auto mode, omit the `manual-selection` field: ```markdown # CVE Fix Guidance — diff --git a/workflows/guidance-generator/.claude/commands/guidance.update.md b/workflows/guidance-generator/.claude/commands/guidance.update.md index cdb00c02..2df1c018 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.update.md +++ b/workflows/guidance-generator/.claude/commands/guidance.update.md @@ -35,9 +35,14 @@ Updating files and creating PR... https://github.com/org/repo/pull/103 ``` /guidance.update +/guidance.update --pr [,...] ``` - `repo-url`: Full GitHub URL or `org/repo` slug +- `--pr `: Comma-separated PR URLs or numbers to merge into existing guidance + instead of fetching all PRs since the last-analyzed date. Accepts full URLs + (`https://github.com/org/repo/pull/123`) or plain numbers (`123`). The + `last-analyzed` date in the file header is still updated to today. ## Process @@ -52,6 +57,25 @@ gh repo view "$REPO" --json name > /dev/null 2>&1 || { } REPO_SLUG=$(echo "$REPO" | tr '/' '-') + +# Parse --pr flag: extract PR numbers from URLs or plain numbers +SPECIFIC_PR_NUMBERS="" +if [ -n "$PR_REFS" ]; then + IFS=',' read -ra PR_LIST <<< "$PR_REFS" + for PR_REF in "${PR_LIST[@]}"; do + PR_REF=$(echo "$PR_REF" | tr -d ' ') + if [[ "$PR_REF" =~ github\.com/[^/]+/[^/]+/pull/([0-9]+) ]]; then + SPECIFIC_PR_NUMBERS="$SPECIFIC_PR_NUMBERS ${BASH_REMATCH[1]}" + elif [[ "$PR_REF" =~ ^[0-9]+$ ]]; then + SPECIFIC_PR_NUMBERS="$SPECIFIC_PR_NUMBERS $PR_REF" + else + echo "WARNING: Could not parse PR reference '$PR_REF' — skipping" + fi + done + SPECIFIC_PR_NUMBERS=$(echo "$SPECIFIC_PR_NUMBERS" | tr -s ' ' | sed 's/^ //') + echo "Manual PR mode: merging PR(s) $SPECIFIC_PR_NUMBERS into existing guidance" +fi + mkdir -p "artifacts/guidance/$REPO_SLUG/raw" mkdir -p "artifacts/guidance/$REPO_SLUG/analysis" mkdir -p "artifacts/guidance/$REPO_SLUG/output" @@ -119,54 +143,94 @@ fi echo "Fetching PRs since $LAST_DATE..." ``` -### 3. Fetch New PRs Since Last Analysis (Pass 1) +### 3. Fetch New PRs (Pass 1) + +**If `--pr` was specified**, skip the date-based bulk fetch and load only the given PRs: ```bash -gh pr list \ - --repo "$REPO" \ - --state all \ - --limit 200 \ - --search "created:>$LAST_DATE" \ - --json number,title,state,mergedAt,closedAt,labels,headRefName,latestReviews \ - > "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" - -NEW_TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json") -echo "Fetched $NEW_TOTAL new PRs since $LAST_DATE" - -if [ "$NEW_TOTAL" -eq 0 ]; then - echo "No new PRs found since $LAST_DATE. Guidance files are already up to date." - rm -rf "/tmp/guidance-gen/$REPO_SLUG" - exit 0 +if [ -n "$SPECIFIC_PR_NUMBERS" ]; then + # Manual mode: fetch only specified PRs + echo "[]" > "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" + for NUMBER in $SPECIFIC_PR_NUMBERS; do + PR_META=$(gh pr view "$NUMBER" --repo "$REPO" \ + --json number,title,state,mergedAt,closedAt,labels,headRefName,latestReviews \ + 2>/dev/null) + if [ $? -ne 0 ] || [ -z "$PR_META" ]; then + echo "WARNING: Could not fetch PR #$NUMBER — skipping" + continue + fi + jq --argjson meta "$PR_META" '. + [$meta]' \ + "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json.tmp" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json.tmp" \ + "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" + done + NEW_TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json") + echo "Loaded $NEW_TOTAL specified PR(s)" +else + # Auto mode: fetch all PRs since last-analyzed date + gh pr list \ + --repo "$REPO" \ + --state all \ + --limit 200 \ + --search "created:>$LAST_DATE" \ + --json number,title,state,mergedAt,closedAt,labels,headRefName,latestReviews \ + > "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" + NEW_TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json") + echo "Fetched $NEW_TOTAL new PRs since $LAST_DATE" + if [ "$NEW_TOTAL" -eq 0 ]; then + echo "No new PRs found since $LAST_DATE. Guidance files are already up to date." + rm -rf "/tmp/guidance-gen/$REPO_SLUG" + exit 0 + fi fi ``` ### 4. Filter New PRs into Buckets -Apply the same filters as `/guidance.generate`: +In **auto mode**: CVE PRs take priority. In **manual mode (`--pr`)**: if a +specified PR matches neither pattern, include it in both buckets for Claude to classify. ```bash -# CVE bucket -jq '[ - .[] | select( - (.title | test("CVE-[0-9]{4}-[0-9]+|^[Ss]ecurity:|^fix\\(cve\\):|^Fix CVE"; "i")) or - (.headRefName | test("^fix/cve-|^security/cve-"; "i")) - ) -]' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" \ +CVE_PATTERN='CVE-[0-9]{4}-[0-9]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE' +CVE_BRANCH_PATTERN='^fix/cve-|^security/cve-' +BUGFIX_PATTERN='^fix[:(]|^bugfix|^bug[[:space:]]fix|closes[[:space:]]#[0-9]+|fixes[[:space:]]#[0-9]+' +BUGFIX_BRANCH_PATTERN='^(bugfix|fix|bug)/' + +jq '[.[] | select( + (.title | test("'"$CVE_PATTERN"'"; "i")) or + (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i")) +)]' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" \ > "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" -# Bugfix bucket (excluding CVE) -jq '[ - .[] | select( - ( - (.title | test("^fix[:(]|^bugfix|^bug[[:space:]]fix|closes[[:space:]]#[0-9]+|fixes[[:space:]]#[0-9]+"; "i")) or - (.headRefName | test("^(bugfix|fix|bug)/"; "i")) - ) and - (.title | test("CVE-[0-9]{4}-[0-9]+"; "i") | not) and - (.headRefName | test("^fix/cve-"; "i") | not) - ) -]' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" \ +jq '[.[] | select( + ( + (.title | test("'"$BUGFIX_PATTERN"'"; "i")) or + (.headRefName | test("'"$BUGFIX_BRANCH_PATTERN"'"; "i")) + ) and + (.title | test("CVE-[0-9]{4}-[0-9]+"; "i") | not) and + (.headRefName | test("^fix/cve-"; "i") | not) +)]' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" \ > "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" +# In manual mode: add unmatched PRs to both buckets +if [ -n "$SPECIFIC_PR_NUMBERS" ]; then + UNMATCHED=$(jq '[.[] | select( + ((.title | test("'"$CVE_PATTERN"'"; "i")) or (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i")) | not) and + ((.title | test("'"$BUGFIX_PATTERN"'"; "i")) or (.headRefName | test("'"$BUGFIX_BRANCH_PATTERN"'"; "i")) | not) + )]' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json") + UNMATCHED_COUNT=$(echo "$UNMATCHED" | jq 'length') + if [ "$UNMATCHED_COUNT" -gt 0 ]; then + UNMATCHED_NUMS=$(echo "$UNMATCHED" | jq -r '.[].number' | tr '\n' ',' | sed 's/,$//') + echo " NOTE: PR(s) #$UNMATCHED_NUMS did not match CVE or bugfix patterns — included in both buckets" + for META_FILE in "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json"; do + jq --argjson extra "$UNMATCHED" '. + $extra' "$META_FILE" > "${META_FILE}.tmp" \ + && mv "${META_FILE}.tmp" "$META_FILE" + done + fi +fi + NEW_CVE=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json") NEW_BUGFIX=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json") echo " CVE bucket: $NEW_CVE new PRs" From 7118bb92081f009cc01a79d669b3129a0b1b41ea Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Sun, 29 Mar 2026 18:10:30 -0400 Subject: [PATCH 05/23] docs: update README and ambient.json with --pr flag and adaptive threshold - README: document --pr flag usage for both commands, add rule threshold table, note --pr mode skips Pass 1 bulk fetch - ambient.json: update description, systemPrompt core principles, and startupPrompt to reflect --pr flag, adaptive threshold, and control character sanitization Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../guidance-generator/.ambient/ambient.json | 6 +-- workflows/guidance-generator/README.md | 38 +++++++++++++++++-- 2 files changed, 38 insertions(+), 6 deletions(-) diff --git a/workflows/guidance-generator/.ambient/ambient.json b/workflows/guidance-generator/.ambient/ambient.json index e2862ac8..850efe46 100644 --- a/workflows/guidance-generator/.ambient/ambient.json +++ b/workflows/guidance-generator/.ambient/ambient.json @@ -1,8 +1,8 @@ { "name": "PR Guidance Generator", - "description": "Analyze a repository's merged and closed fix PRs to generate compact guidance files that teach automated workflows (CVE Fixer, Bugfix) how to create PRs that match that repo's conventions.", - "systemPrompt": "You are a PR pattern analyst for the Ambient Code Platform. Your role is to help teams generate and maintain guidance files that teach automated fix workflows how to create pull requests matching their repository's conventions.\n\nKEY RESPONSIBILITIES:\n- Fetch and analyze historical fix PRs from a GitHub repository\n- Extract patterns from merged PRs (what works) and closed PRs (what to avoid)\n- Generate compact, high-signal guidance files — no fluff, no verbose examples\n- Create pull requests in target repositories with the generated guidance files\n- Update existing guidance files with patterns from new PRs\n\nWORKFLOW METHODOLOGY:\n1. GENERATE - Analyze a repo's PR history and create guidance files, then open a PR\n2. UPDATE - Fetch PRs since last analysis, merge new patterns, open an update PR\n\nAVAILABLE COMMANDS:\n/guidance.generate - Analyze a repo and generate guidance files, then create a PR\n/guidance.update - Refresh existing guidance with recent PRs and create an update PR\n\nOUTPUT LOCATIONS:\n- Raw PR data: artifacts/guidance//raw/\n- Analysis output: artifacts/guidance//analysis/\n- Generated files: artifacts/guidance//output/\n\nCORE PRINCIPLES:\n- Guidance files must stay under 80 lines — dense signal, no narrative\n- Only include rules observed in 3+ PRs\n- Merged PRs = positive examples. Closed PRs = what to avoid.\n- Review REQUEST_CHANGES comments reveal what workflows should do proactively\n- Never guess patterns — only state what the data supports", - "startupPrompt": "Welcome! I'm your PR Guidance Generator.\n\nI analyze a repository's fix PR history to generate compact guidance files that teach automated workflows — like CVE Fixer and Bugfix — how to create PRs your team will actually merge.\n\nAVAILABLE COMMANDS:\n/guidance.generate - Analyze PR history and generate guidance files, then open a PR in the repo\n/guidance.update - Refresh existing guidance with recent PRs and open an update PR\n\nFlags for /guidance.generate:\n --cve-only Only generate CVE fix guidance\n --bugfix-only Only generate bugfix guidance\n --limit N Max PRs to analyze per bucket (default: 100)\n\nGETTING STARTED:\nRun /guidance.generate with a GitHub repository URL to begin.\n\nWhat repository would you like to analyze?", + "description": "Analyze a repository's merged and closed fix PRs to generate compact guidance files that teach automated workflows (CVE Fixer, Bugfix) how to create PRs that match that repo's conventions. Supports full history analysis or targeted analysis of specific PRs.", + "systemPrompt": "You are a PR pattern analyst for the Ambient Code Platform. Your role is to help teams generate and maintain guidance files that teach automated fix workflows how to create pull requests matching their repository's conventions.\n\nKEY RESPONSIBILITIES:\n- Fetch and analyze historical fix PRs from a GitHub repository\n- Extract patterns from merged PRs (what works) and closed PRs (what to avoid)\n- Generate compact, high-signal guidance files — no fluff, no verbose examples\n- Create pull requests in target repositories with the generated guidance files\n- Update existing guidance files with patterns from new PRs\n\nWORKFLOW METHODOLOGY:\n1. GENERATE - Analyze a repo's PR history (or specific PRs) and create guidance files, then open a PR\n2. UPDATE - Fetch PRs since last analysis (or specific PRs), merge new patterns, open an update PR\n\nAVAILABLE COMMANDS:\n/guidance.generate - Analyze a repo and generate guidance files, then create a PR\n/guidance.update - Refresh existing guidance with recent PRs and create an update PR\n\nBoth commands accept --pr [,...] to analyze specific PRs instead of fetching all.\n\nOUTPUT LOCATIONS:\n- Raw PR data: artifacts/guidance//raw/\n- Analysis output: artifacts/guidance//analysis/\n- Generated files: artifacts/guidance//output/\n\nCORE PRINCIPLES:\n- Guidance files must stay under 80 lines — dense signal, no narrative\n- Use adaptive rule threshold based on bucket size: 3+ PRs (10+ merged), 2+ PRs (3-9 merged), 1+ PR with limited-data warning (1-2 merged). Skip file only if 0 merged PRs.\n- In --pr mode: analyze only the specified PRs. Never drop a user-specified PR even if it does not match CVE or bugfix bucket patterns — include it in both buckets and let synthesis classify it.\n- Merged PRs = positive examples. Closed PRs = what to avoid.\n- Review REQUEST_CHANGES comments reveal what workflows should do proactively.\n- Never guess patterns — only state what the PR data supports.\n- Sanitize control characters from all PR text fields before JSON construction to prevent silent failures.", + "startupPrompt": "Welcome! I'm your PR Guidance Generator.\n\nI analyze a repository's fix PR history to generate compact guidance files that teach automated workflows — like CVE Fixer and Bugfix — how to create PRs your team will actually merge.\n\nAVAILABLE COMMANDS:\n/guidance.generate - Analyze PR history and generate guidance files, then open a PR\n/guidance.update - Refresh existing guidance with recent PRs and open an update PR\n\nFlags (both commands):\n --pr , Analyze specific PRs instead of fetching all (URLs or numbers, comma-separated)\n\nFlags for /guidance.generate only:\n --cve-only Only generate CVE fix guidance\n --bugfix-only Only generate bugfix guidance\n --limit N Max PRs to analyze per bucket (default: 100)\n\nEXAMPLES:\n /guidance.generate org/repo\n /guidance.generate org/repo --pr https://github.com/org/repo/pull/42,https://github.com/org/repo/pull/87\n /guidance.update org/repo --pr 103\n\nWhat repository would you like to analyze?", "results": { "Generated Guidance": "artifacts/guidance/**/output/*.md", "PR Analysis": "artifacts/guidance/**/analysis/*.md" diff --git a/workflows/guidance-generator/README.md b/workflows/guidance-generator/README.md index 0d558edb..edc01f5e 100644 --- a/workflows/guidance-generator/README.md +++ b/workflows/guidance-generator/README.md @@ -20,7 +20,7 @@ encodes them into guidance files that automated workflows read before making cha 2. Filters into CVE and bugfix buckets based on title/branch patterns 3. Fetches targeted details per PR: files changed + review REQUEST_CHANGES comments 4. For closed PRs: fetches the closing context to extract "don'ts" -5. Synthesizes rules — only patterns observed in 3+ PRs are included +5. Synthesizes rules using an adaptive threshold based on available data 6. Generates compact guidance files (80-line cap, one rule per line) 7. Opens a PR in the target repo adding the files @@ -28,15 +28,25 @@ encodes them into guidance files that automated workflows read before making cha ### `/guidance.generate ` -Full pipeline for a fresh repo. +Full pipeline for a fresh repo. Analyzes all recent fix PRs automatically, +or analyze specific PRs of your choice with `--pr`. ``` /guidance.generate https://github.com/org/repo /guidance.generate org/repo --cve-only /guidance.generate org/repo --bugfix-only /guidance.generate org/repo --limit 50 +/guidance.generate org/repo --pr https://github.com/org/repo/pull/42,https://github.com/org/repo/pull/87 +/guidance.generate org/repo --pr 42,87 ``` +Flags: +- `--cve-only` / `--bugfix-only`: generate only one of the two guidance files +- `--limit N`: cap the number of PRs fetched per bucket (default: 100) +- `--pr `: comma-separated PR URLs or numbers — skips bulk fetch and + analyzes only these PRs. Useful for seeding guidance from a curated set of + representative PRs. The generated file header records which PRs were used. + Generates: - `.cve-fix/examples.md` — read by the CVE Fixer workflow (step 4.5) - `.bugfix/guidance.md` — read by the Bugfix workflow @@ -49,8 +59,15 @@ merges new patterns, and opens a PR with the updates. ``` /guidance.update https://github.com/org/repo +/guidance.update org/repo --pr https://github.com/org/repo/pull/103 +/guidance.update org/repo --pr 103,104 ``` +Flags: +- `--pr `: instead of fetching all PRs since the last-analyzed date, + merge only the specified PRs into existing guidance. The `last-analyzed` + date is still updated to today. + ## Generated File Format Files are intentionally compact. Example `.cve-fix/examples.md`: @@ -84,11 +101,26 @@ Required sections (missing caused REQUEST_CHANGES in 6 PRs): - Don't target release branches — target main (3 cases) ``` +## Rule Threshold + +Rules use an adaptive threshold based on how much data is available in each bucket: + +| Merged PRs in bucket | Min PRs per rule | +|----------------------|-----------------| +| 10+ | 3 | +| 3–9 | 2 | +| 1–2 | 1 + `WARNING: limited data` in header | +| 0 | File skipped entirely | + +This means the workflow always produces something useful, even for repos with +few fix PRs — while flagging low-confidence output clearly. + ## Token Efficiency The workflow uses a two-pass fetch strategy to minimize API calls and context size: -- **Pass 1**: Lightweight metadata for all PRs (title, branch, labels, state) +- **Pass 1**: Lightweight metadata for all PRs (title, branch, labels, state). + In `--pr` mode this pass is skipped — only the specified PRs are fetched. - **Pass 2**: Per-PR detail only for PRs in the CVE/bugfix buckets (files + reviews) - **Closed PRs only**: Fetch closing context (last 2 comments) From 4768e210622add5652bb55a8e257d37e93a0dd0f Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Mon, 30 Mar 2026 09:17:25 -0400 Subject: [PATCH 06/23] fix: treat 80-line cap as soft target, never silently drop rules on update MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The hard 80-line enforcement in guidance.update was problematic: with a file already at 74 lines, adding 3 new rules would silently drop the lowest-evidence existing rules — including ones deliberately added via --pr. Changes: - guidance.generate: 80 lines is now a formatting target, not a hard cap. All rules meeting the evidence threshold are always included. - guidance.update: remove silent drop logic entirely. New rules are always appended. If the file exceeds 80 lines, the line count is flagged in the PR description with a suggestion to rebuild via /guidance.generate. - README: add Line Count Behaviour section explaining the distinction. - Success criteria updated in both command files. Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../.claude/commands/guidance.generate.md | 8 +- .../.claude/commands/guidance.update.md | 27 +- workflows/guidance-generator/README.md | 13 + .../analysis/bugfix-patterns.md | 44 + .../analysis/cve-patterns.md | 41 + .../output/bugfix-guidance.md | 34 + .../output/cve-fix-guidance.md | 21 + .../eval-hub-eval-hub/raw/bugfix-prs.json | 885 ++++++++++++++++++ .../eval-hub-eval-hub/raw/cve-prs.json | 67 ++ 9 files changed, 1133 insertions(+), 7 deletions(-) create mode 100644 workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/bugfix-patterns.md create mode 100644 workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/cve-patterns.md create mode 100644 workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/bugfix-guidance.md create mode 100644 workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/cve-fix-guidance.md create mode 100644 workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/bugfix-prs.json create mode 100644 workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/cve-prs.json diff --git a/workflows/guidance-generator/.claude/commands/guidance.generate.md b/workflows/guidance-generator/.claude/commands/guidance.generate.md index 6c99833f..0f4c4eed 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.generate.md +++ b/workflows/guidance-generator/.claude/commands/guidance.generate.md @@ -376,11 +376,13 @@ DONT_multiple_cves: One CVE per PR — N closed PRs rejected for combining From the analysis files, generate the final guidance files. -**Hard constraints:** -- Maximum 80 lines per file +**Formatting constraints:** +- Target 80 lines per file — this is a guideline for fresh generation, not a hard truncation - No narrative paragraphs — one rule per line or a tight code block - Evidence counts are inline and terse: `(N/M merged)`, `(N closed PRs)` - No full PR examples — only the distilled pattern +- If the synthesized output naturally exceeds 80 lines (many strong patterns), + include all rules that meet the threshold. Note the line count in the PR description. **CVE guidance file template** — write to `artifacts/guidance//output/cve-fix-guidance.md`. @@ -619,7 +621,7 @@ Artifacts: artifacts/guidance// - [ ] Per-PR details fetched (files + review REQUEST_CHANGES) - [ ] Closed PRs have closing context fetched - [ ] Patterns synthesized with 3-PR minimum applied -- [ ] Guidance files are under 80 lines each +- [ ] Guidance files aim for ~80 lines (all threshold-passing rules included regardless) - [ ] Files written to artifacts/output/ - [ ] PR created in target repo with correct files in .cve-fix/ and .bugfix/ - [ ] /tmp cleaned up diff --git a/workflows/guidance-generator/.claude/commands/guidance.update.md b/workflows/guidance-generator/.claude/commands/guidance.update.md index 2df1c018..ae90cfcf 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.update.md +++ b/workflows/guidance-generator/.claude/commands/guidance.update.md @@ -378,9 +378,27 @@ Read the cloned guidance files and apply the changes from step 6. - Update the merged/closed counts in the header comment - Do NOT reorder existing rules — preserve the file structure -After editing, verify the file is still under 80 lines. If adding new rules -would push it over 80 lines, prioritize: keep all don'ts, keep rules with -highest evidence counts, drop rules with lowest counts (below 5%). +After editing, count the lines in each file. Never drop existing rules to +make room — always append new rules in full. If the file now exceeds 80 lines, +note it but do not truncate: + +```bash +CVE_LINES=$(wc -l < "$CVE_FILE") +BUGFIX_LINES=$(wc -l < "$BUGFIX_FILE") + +OVERSIZE_NOTE="" +if [ "$CVE_LINES" -gt 80 ]; then + echo " NOTE: .cve-fix/examples.md is now ${CVE_LINES} lines (target: 80)" + OVERSIZE_NOTE="${OVERSIZE_NOTE}\n- \`.cve-fix/examples.md\` is ${CVE_LINES} lines. Consider running \`/guidance.generate\` to rebuild and consolidate." +fi +if [ "$BUGFIX_LINES" -gt 80 ]; then + echo " NOTE: .bugfix/guidance.md is now ${BUGFIX_LINES} lines (target: 80)" + OVERSIZE_NOTE="${OVERSIZE_NOTE}\n- \`.bugfix/guidance.md\` is ${BUGFIX_LINES} lines. Consider running \`/guidance.generate\` to rebuild and consolidate." +fi +``` + +Include `$OVERSIZE_NOTE` in the PR description if non-empty so the reviewer +knows the file has grown and can decide whether to trigger a full rebuild. **Update the header:** ``` @@ -500,7 +518,8 @@ Artifacts: artifacts/guidance// - [ ] Per-PR details fetched for new PRs - [ ] New patterns synthesized (new rules, updated counts, contradictions flagged) - [ ] Existing files updated in-place (no rewrites, structure preserved) -- [ ] Both files remain under 80 lines +- [ ] No existing rules were dropped — only additions and count updates +- [ ] Files exceeding 80 lines are flagged in PR description (not silently truncated) - [ ] Header timestamps updated - [ ] PR created in target repo - [ ] /tmp cleaned up diff --git a/workflows/guidance-generator/README.md b/workflows/guidance-generator/README.md index edc01f5e..ec20b6ae 100644 --- a/workflows/guidance-generator/README.md +++ b/workflows/guidance-generator/README.md @@ -115,6 +115,19 @@ Rules use an adaptive threshold based on how much data is available in each buck This means the workflow always produces something useful, even for repos with few fix PRs — while flagging low-confidence output clearly. +## Line Count Behaviour + +The 80-line target applies differently depending on the command: + +**`/guidance.generate`** — treats 80 lines as a formatting target for new files. +All rules that meet the evidence threshold are included regardless. If the natural +output exceeds 80 lines, all rules are kept and the line count is noted in the PR. + +**`/guidance.update`** — never drops existing rules to stay under 80 lines. +New rules are always appended in full. If the file grows past 80 lines, the PR +description flags it with a suggestion to run `/guidance.generate` to rebuild +and consolidate the guidance from scratch with the full updated history. + ## Token Efficiency The workflow uses a two-pass fetch strategy to minimize API calls and context size: diff --git a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/bugfix-patterns.md b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/bugfix-patterns.md new file mode 100644 index 00000000..7db724c0 --- /dev/null +++ b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/bugfix-patterns.md @@ -0,0 +1,44 @@ +# Bugfix PR Pattern Analysis — eval-hub/eval-hub + + +## Data Summary +- 48 merged, 4 closed bugfix PRs analyzed +- Threshold applied: 3+ PRs per rule + +## TITLE_FORMAT +Primary: `fix: ` (39/48 merged) +Scoped: `fix(): ` (9/48 merged) +No other prefix patterns observed. + +## SCOPE_VALUES +sidecar, collections, storage, k8s, cli, build, cancel, ci, mlflow (9 scoped PRs) + +## BRANCH_FORMAT +No strict convention. Patterns: +- Short descriptive (no prefix): ~33/48 (adapter-mode, storage-fix, tenant-filter) +- `fix-`: ~10/48 (fix-sidecar-shutdown, fix-lighteval) +- `fix/`: ~5/48 (fix/cancel-should-fail-unit-benchmarks) + +## TEST_FILES_REQUIRED +Go source changes almost always paired with `_test.go` counterparts (~36/48 merged) +- evaluations.go + evaluations_test.go (6+ PRs) +- job_builders.go + job_builders_test.go (6+ PRs) +- collections.go + collections_test.go (3+ PRs) + +## FEATURE_TESTS +PRs touching handler/runtime behavior include BDD test files (10+ PRs): +- tests/features/*.feature +- tests/features/step_definitions_test.go + +## CO_CHANGE_k8s +job_builders.go always paired with job_builders_test.go (6/6 PRs) + +## CO_CHANGE_storage +SQL storage changes always paired with *_test.go files (5/5 PRs) + +## LABELS +kind/fix: 11/48 PRs (23%) — optional, inconsistently applied + +## DONT_docs_only_via_fix_branch +2 closed PRs rejected for docs-only content via fix/ branches: +- #369 closed "not required", #368 closed "not relevant here" diff --git a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/cve-patterns.md b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/cve-patterns.md new file mode 100644 index 00000000..23367d2c --- /dev/null +++ b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/cve-patterns.md @@ -0,0 +1,41 @@ +# CVE PR Pattern Analysis — eval-hub/eval-hub + + + +## Data Summary +- Merged: #384 ("fix(cve): Update Go stdlib to 1.25.8 - fix 3 vulnerabilities") +- Closed: #383, #382, #379 (all Go stdlib upgrade attempts targeting 1.25.8) +- Open: #390 ("fix(cve): upgrade Go from 1.25.8 to 1.26.1") + +All 5 CVE PRs address Go stdlib vulnerabilities. This repo is a Go project. + +## TITLE_FORMAT +`fix(cve): to - ` (5/5 CVE PRs) +Examples: + - "fix(cve): Update Go stdlib to 1.25.8 - fix 3 vulnerabilities" ← merged + - "fix(cve): upgrade Go from 1.25.8 to 1.26.1" + - "fix(cve): update Go toolchain to 1.25.8 for stdlib CVE fixes" + +## BRANCH_FORMAT +Manual: `fix/cve--go-stdlib--attempt-N` (3/3 manual CVE PRs) +Bot-generated: `ambient/cve-fixer-nightly-` (1/1 bot PR) + +## FILES_GO_CVE +Merged PR (#384) changed: go.mod only +Open PR (#390) changes: Containerfile + go.mod + go.sum ← more complete +Closed #383: Containerfile + go.mod (missing go.sum) +Closed #382, #379: go.mod + go.sum (missing Containerfile) + +Close reason for #382: "I am still working on this - I will get it to also update the other files" +→ Suggests Containerfile + go.mod + go.sum should all be updated together (1 closed PR rejection) + +## PROACTIVE_go_sum +go.sum should accompany go.mod changes for dependency updates. +Evidence: 3 of 5 PRs include go.sum; the 2 that don't are older attempts. + +## DONT_incomplete_files +Don't update only go.mod or only go.mod+go.sum — update Containerfile too. +(1 explicit rejection: #382 closed for missing files) + +## LABELS +No labels applied to any CVE PRs (0/5 — labels not used for CVE fixes) diff --git a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/bugfix-guidance.md b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/bugfix-guidance.md new file mode 100644 index 00000000..09172e5a --- /dev/null +++ b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/bugfix-guidance.md @@ -0,0 +1,34 @@ +# Bugfix Guidance — eval-hub/eval-hub + + +## Titles +`fix: ` (39/48 merged) +`fix(): ` (9/48 merged — use when change is scoped) + +## Branches +No strict convention enforced. Common patterns: +- Short descriptive: `adapter-mode`, `storage-fix`, `tenant-filter` (~33/48) +- Prefixed: `fix-` (~10/48) or `fix/` (~5/48) + +## Scope Values +Use fix() when the change is confined to one subsystem (from 9 merged PRs): +`sidecar` | `collections` | `storage` | `k8s` | `cli` | `build` | `cancel` | `ci` | `mlflow` + +## Test Requirements +Go source file changes must include corresponding `_test.go` files (~36/48 merged): +- `evaluations.go` → `evaluations_test.go` (6+ PRs) +- `job_builders.go` → `job_builders_test.go` (6+ PRs) +- `collections.go` → `collections_test.go` (3+ PRs) +- `loader.go` → `loader_test.go` (2+ PRs) + +Handler/runtime changes should also update BDD feature files (10+ PRs): +- `tests/features/*.feature` +- `tests/features/step_definitions_test.go` +- `tests/kubernetes/features/` for k8s runtime changes (4+ PRs) + +## Co-Changes +- `internal/runtimes/k8s/job_builders.go` → always include `job_builders_test.go` (6/6 PRs) +- `internal/storage/sql/*.go` → always include `*_test.go` counterpart (5/5 PRs) + +## Don'ts +- Don't submit docs-only changes via fix/ branches — 2 closed PRs rejected as "not required" / "not relevant here" (#369, #368) diff --git a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/cve-fix-guidance.md b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/cve-fix-guidance.md new file mode 100644 index 00000000..fc32402f --- /dev/null +++ b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/cve-fix-guidance.md @@ -0,0 +1,21 @@ +# CVE Fix Guidance — eval-hub/eval-hub + + +## Titles +`fix(cve): to - fix vulnerabilities` (5/5 CVE PRs) +- "fix(cve): Update Go stdlib to 1.25.8 - fix 3 vulnerabilities" ← merged +- "fix(cve): upgrade Go from 1.25.8 to 1.26.1" ← open + +## Branches +Manual: `fix/cve--go-stdlib--attempt-N` (3/3 manual CVE PRs) +Bot-generated: `ambient/cve-fixer-nightly-` + +## Files — Go stdlib CVE +Update all three files together (1 closed PR rejected for updating only some): +- `go.mod` — toolchain/dependency version +- `go.sum` — dependency checksums +- `Containerfile` — base image / Go toolchain version + +## Don'ts +- Don't update go.mod+go.sum without also updating Containerfile (1 rejection: #382 closed "I will get it to also update the other files") +- Don't update only Containerfile+go.mod without go.sum (pattern from closed #383) diff --git a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/bugfix-prs.json b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/bugfix-prs.json new file mode 100644 index 00000000..98eafe30 --- /dev/null +++ b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/bugfix-prs.json @@ -0,0 +1,885 @@ +[ + { + "number": 388, + "state": "OPEN", + "title": "fix(deps): update Go toolchain from 1.25.8 to 1.26.1", + "branch": "ambient/cve-fixer-nightly-1774675800", + "labels": [], + "files": [ + "Containerfile", + "go.mod", + "go.sum" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 383, + "state": "CLOSED", + "title": "fix(cve): update Go to 1.25.8 to address 3 stdlib CVEs", + "branch": "ambient/cve-fixer-nightly-1774600830", + "labels": [], + "files": [ + "Containerfile", + "go.mod" + ], + "changes_requested": [], + "close_reason": "
📝 Walkthrough ## Walkthrough The Go toolchain version has been updated | ## [Codecov](https://app.codecov.io/gh/eval-hub/eval-hub/pull/383?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=eval-hub) R" + }, + { + "number": 374, + "state": "MERGED", + "title": "fix: remove github action for Ambient CVE fixer", + "branch": "fix-cve-fixer", + "labels": [], + "files": [ + ".cve-fixer.json", + ".github/workflows/daily-cve-fixer.yml" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 372, + "state": "MERGED", + "title": "fix: the ambient cli fails with download error", + "branch": "fix-ambient-cli", + "labels": [], + "files": [ + ".github/workflows/daily-cve-fixer.yml" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 369, + "state": "CLOSED", + "title": "docs: Add adapter model URL usage guide", + "branch": "fix/RHOAIENG-54862-model-url-ignored", + "labels": [], + "files": [ + "docs/adapter-model-url-usage.md" + ], + "changes_requested": [], + "close_reason": "## [Codecov](https://app.codecov.io/gh/eval-hub/eval-hub/pull/369?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=eval-hub) R | Closing this for now as not required." + }, + { + "number": 368, + "state": "CLOSED", + "title": "docs: Add MLflow artifact logging guide for adapters", + "branch": "fix/RHOAIENG-54539-mlflow-artifacts", + "labels": [], + "files": [ + "docs/mlflow-artifact-logging-guide.md" + ], + "changes_requested": [], + "close_reason": "## [Codecov](https://app.codecov.io/gh/eval-hub/eval-hub/pull/368?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=eval-hub) R | Closing this as not relevant here." + }, + { + "number": 367, + "state": "CLOSED", + "title": "fix: Pass mlflow_experiment_id to adapters for run creation", + "branch": "fix/RHOAIENG-54869-mlflow-run-id", + "labels": [], + "files": [ + "docs/mlflow-run-id-integration.md", + "internal/eval_hub/runtimes/shared/jobspec.go" + ], + "changes_requested": [], + "close_reason": "
📝 Walkthrough ## Walkthrough This PR introduces MLflow integration sup" + }, + { + "number": 366, + "state": "MERGED", + "title": "fix(sidecar): exit cleanly on graceful shutdown by implementing Is() on ServerClosedError", + "branch": "fix-sidecar-shutdown", + "labels": [ + "kind/fix" + ], + "files": [ + "cmd/eval_runtime_sidecar/main.go", + "internal/eval_hub/server/server.go", + "internal/eval_runtime_sidecar/server/server.go", + "internal/eval_runtime_sidecar/server/server_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 363, + "state": "MERGED", + "title": "fix: set 'EVALHUB_MODE' env var on adapter contaienr", + "branch": "adapter-mode", + "labels": [], + "files": [ + "internal/eval_hub/runtimes/k8s/job_builders.go", + "internal/eval_hub/runtimes/k8s/job_builders_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 359, + "state": "MERGED", + "title": "fix: Update resource requests and limits for lighteval", + "branch": "fix-lighteval", + "labels": [ + "kind/fix" + ], + "files": [ + "config/providers/lighteval.yaml" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 354, + "state": "MERGED", + "title": "fix: allow '/mlflow-artifacts' path in proxy", + "branch": "sidecar-mlflow", + "labels": [], + "files": [ + "internal/eval_hub/runtimes/k8s/job_builders_test.go", + "internal/eval_runtime_sidecar/handlers/handlers.go", + "internal/eval_runtime_sidecar/handlers/handlers_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 348, + "state": "MERGED", + "title": "fix: TLS disabled in tests", + "branch": "sidecar-tests", + "labels": [], + "files": [ + "internal/eval_runtime_sidecar/handlers/handlers_test.go", + "internal/eval_runtime_sidecar/proxy/http_client.go", + "internal/eval_runtime_sidecar/proxy/http_client_test.go", + "internal/eval_runtime_sidecar/server/server_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 347, + "state": "MERGED", + "title": "fix: sidecar container termination", + "branch": "sidecar-termination", + "labels": [], + "files": [ + "cmd/eval-runtime-sidecar/main.go", + "internal/eval_hub/runtimes/k8s/examples/eval-job.yaml", + "internal/eval_hub/runtimes/k8s/job_builders.go", + "internal/eval_hub/runtimes/k8s/job_builders_test.go", + "internal/eval_runtime_sidecar/termination/watch.go", + "internal/eval_runtime_sidecar/termination/watch_test.go", + "tests/kubernetes/features/kubernetes_resources.feature", + "tests/kubernetes/features/step_definitions_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 320, + "state": "MERGED", + "title": "fix: tidy collections APIs", + "branch": "collection-traces", + "labels": [], + "files": [ + "cmd/eval_hub/server/server_test.go", + "internal/handlers/collections.go", + "internal/handlers/collections_test.go", + "internal/handlers/evaluations.go", + "internal/handlers/helpers.go", + "internal/handlers/providers.go", + "tests/features/collections.feature", + "tests/features/evaluations.feature" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 319, + "state": "MERGED", + "title": "fix: remove the tenant check in the storage layer and simplify the delete evaluation", + "branch": "storage-fix", + "labels": [], + "files": [ + "internal/storage/sql/collections.go", + "internal/storage/sql/evaluations.go", + "internal/storage/sql/providers.go", + "internal/storage/sql/sql.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 318, + "state": "MERGED", + "title": "fix: added experiment tag context: eval-hub", + "branch": "marius-experiment-tag", + "labels": [], + "files": [ + "internal/mlflow/mlflow.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 315, + "state": "MERGED", + "title": "fix: remove system resources from the runtime", + "branch": "system-lists", + "labels": [], + "files": [ + "Containerfile", + "README-GO.md", + "README.md", + "VERSION", + "cmd/eval_hub/main.go", + "cmd/eval_hub/server/server_test.go", + "go.mod", + "go.sum", + "internal/common/common.go", + "internal/common/resolve_provider_test.go", + "internal/config/loader_test.go", + "internal/handlers/collections_test.go", + "internal/handlers/health_test.go", + "internal/runtimes/k8s/k8s_runtime.go", + "internal/runtimes/k8s/k8s_runtime_test.go", + "internal/runtimes/k8s/k8s_runtime_unit_test.go", + "internal/runtimes/local/local_runtime.go", + "internal/runtimes/local/local_runtime_test.go", + "internal/runtimes/runtime.go", + "internal/runtimes/shared/runtime_utils.go", + "internal/runtimes/shared/runtime_utils_test.go", + "internal/validation/validator.go", + "internal/validation/validator_test.go", + "tests/features/evaluations.feature", + "tests/features/providers.feature", + "tests/features/step_definitions_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 306, + "state": "MERGED", + "title": "fix(collections): resolve system-defined collections in GET endpoint and event callbacks", + "branch": "missing-collections", + "labels": [ + "kind/fix" + ], + "files": [ + "internal/abstractions/storage.go", + "internal/handlers/collections.go", + "internal/handlers/evaluations.go", + "internal/handlers/evaluations_test.go", + "internal/runtimes/k8s/k8s_runtime.go", + "internal/runtimes/k8s/k8s_runtime_unit_test.go", + "internal/runtimes/local/local_runtime.go", + "internal/runtimes/local/local_runtime_test.go", + "internal/storage/sql/evaluations.go", + "internal/storage/sql/evaluations_test.go", + "tests/features/step_definitions_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 303, + "state": "MERGED", + "title": "fix: clean up the storage abstraction", + "branch": "storage-cleanup", + "labels": [], + "files": [ + "cmd/eval_hub/main.go", + "cmd/eval_hub/server/server_test.go", + "internal/abstractions/storage.go", + "internal/config/loader.go", + "internal/config/loader_test.go", + "internal/handlers/collections.go", + "internal/handlers/collections_test.go", + "internal/handlers/providers.go", + "internal/handlers/providers_test.go", + "internal/runtimes/k8s/k8s_runtime_unit_test.go", + "internal/runtimes/local/local_runtime_test.go", + "internal/storage/sql/collections.go", + "internal/storage/sql/providers.go", + "internal/storage/sql/providers_test.go", + "internal/validation/validator.go", + "internal/validation/validator_test.go", + "pkg/api/common.go", + "tests/features/step_definitions_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 302, + "state": "MERGED", + "title": "fix: add missing mapstructure annotations and validate system providers and collections", + "branch": "system-validation", + "labels": [], + "files": [ + "config/collections/leaderboard-v2.yaml", + "internal/config/loader.go", + "internal/config/loader_test.go", + "internal/handlers/collections_test.go", + "pkg/api/collections.go", + "pkg/api/common.go", + "pkg/api/evaluations.go", + "pkg/api/providers.go", + "tests/features/collections.feature" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 296, + "state": "MERGED", + "title": "fix: Remove the job_id,evalhub_url envs from job pod", + "branch": "follow-up", + "labels": [], + "files": [ + "internal/runtimes/k8s/examples/eval-job.yaml", + "internal/runtimes/k8s/job_builders.go", + "internal/runtimes/k8s/job_config.go", + "internal/runtimes/k8s/k8s_runtime.go", + "tests/kubernetes/features/kubernetes_resources.feature", + "tests/kubernetes/features/step_definitions_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 291, + "state": "MERGED", + "title": "fix: fix jobs from a collection", + "branch": "job-with-collection", + "labels": [], + "files": [ + "cmd/eval_hub/server/authorization.go", + "cmd/eval_hub/server/execution_context.go", + "cmd/eval_hub/server/server.go", + "internal/common/common.go", + "internal/constants/log_fields.go", + "internal/handlers/collections.go", + "internal/handlers/evaluations_test.go", + "internal/messages/messages.go", + "internal/runtimes/k8s/job_builders_test.go", + "internal/runtimes/k8s/k8s_runtime_unit_test.go", + "internal/serviceerrors/error_test.go", + "internal/storage/common/common.go", + "internal/storage/sql/collections.go", + "internal/storage/sql/evaluations.go", + "pkg/api/evaluations.go", + "tests/features/evaluations.feature", + "tests/features/step_definitions_test.go", + "tests/features/test_data/collection_pass_criteria_from_provider_test.json", + "tests/features/test_data/evaluation_job_pass_criteria_from_provider_test.json", + "tests/features/test_data/evaluation_job_with_collection.json", + "tests/features/test_data/patch_collection_benchmark.json", + "tests/features/test_data/patch_collection_benchmark_full.json", + "tests/features/test_data/patch_collection_description.json", + "tests/features/test_data/patch_collection_invalid.json", + "tests/features/test_data/patch_collection_name.json" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 290, + "state": "MERGED", + "title": "fix: update S3 test data init image handling in job configuration", + "branch": "fix-init-container-default", + "labels": [ + "kind/fix" + ], + "files": [ + "internal/runtimes/k8s/job_builders.go", + "internal/runtimes/k8s/job_builders_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 287, + "state": "MERGED", + "title": "fix: return bad request instead of 403 if the request is missing requ…", + "branch": "marius", + "labels": [], + "files": [ + "auth/authorization.go", + "auth/rules.go", + "auth/rules_test.go", + "auth/testdata/rbac_mixed.yaml", + "cmd/eval_hub/server/authorization.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 284, + "state": "MERGED", + "title": "fix: test-all-coverage failure fixes", + "branch": "fix-tests", + "labels": [], + "files": [ + "tests/features/evaluations.feature" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 282, + "state": "MERGED", + "title": "fix(storage): scope GetEvaluationJobs list query to WithTenant", + "branch": "tenant-filter", + "labels": [ + "kind/fix" + ], + "files": [ + "internal/storage/sql/evaluations.go", + "internal/storage/sql/evaluations_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 281, + "state": "MERGED", + "title": "fix(k8s): resolve namespace using tenant from evaluation resource", + "branch": "ns-fix", + "labels": [ + "kind/fix" + ], + "files": [ + "internal/runtimes/k8s/job_config.go", + "internal/runtimes/k8s/job_config_test.go", + "internal/runtimes/k8s/k8s_runtime.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 275, + "state": "MERGED", + "title": "fix: auth rules not enforced", + "branch": "fix-sar", + "labels": [ + "kind/fix" + ], + "files": [ + "auth/rules.go", + "internal/config/loader.go", + "internal/runtimes/k8s/examples/eval-job.yaml", + "internal/runtimes/k8s/job_config.go", + "tests/kubernetes/features/step_definitions_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 271, + "state": "MERGED", + "title": "fix(cli): eval-hub-server cli not picking up args", + "branch": "fix/eval-hub-server-cli-args", + "labels": [], + "files": [ + "python-server/DEVELOPMENT.md", + "python-server/evalhub_server/main.py", + "python-server/pyproject.toml", + "python-server/tests/test_main.py", + "python-server/uv.lock" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 266, + "state": "MERGED", + "title": "fix: more fixes related to searching lists of resources", + "branch": "list-tests", + "labels": [], + "files": [ + "Makefile", + "internal/handlers/providers.go", + "internal/storage/sql/collections.go", + "internal/storage/sql/evaluations.go", + "internal/storage/sql/postgres/statements.go", + "internal/storage/sql/providers.go", + "internal/storage/sql/shared/statements.go", + "internal/storage/sql/sqlite/statements.go", + "tests/features/evaluations.feature", + "tests/features/step_definitions_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 265, + "state": "MERGED", + "title": "fix(build): fix windows platform build", + "branch": "fix/windows-build", + "labels": [], + "files": [ + "auth/testdata/rbac_mixed.yaml", + "internal/runtimes/local/local_runtime.go", + "internal/runtimes/local/process_unix.go", + "internal/runtimes/local/process_windows.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 263, + "state": "MERGED", + "title": "fix: added get verb for mlflow experiments + auth config loading", + "branch": "marius", + "labels": [], + "files": [ + "cmd/eval_hub/main.go", + "cmd/eval_hub/server/server.go", + "config/auth.yaml" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 255, + "state": "MERGED", + "title": "fix(cancel): cancellation of job should reflect at benchmark level", + "branch": "fix/cancel-should-fail-unit-benchmarks", + "labels": [], + "files": [ + ".coderabbit.yaml", + ".gitignore", + "Makefile", + "internal/storage/sql/evaluations.go", + "internal/storage/sql/evaluations_test.go", + "pkg/api/evaluations.go", + "pkg/api/evaluations_test.go", + "tests/features/evaluations.feature" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 247, + "state": "MERGED", + "title": "fix: fix the cancellation code to always call the runtime (unless already cancelled)", + "branch": "evaluations-cancel", + "labels": [], + "files": [ + "internal/handlers/evaluations.go", + "internal/handlers/evaluations_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 245, + "state": "MERGED", + "title": "fix: loadProvider ignores explicit config dir for provider file reads", + "branch": "config-fix", + "labels": [ + "kind/fix", + "area/configuration" + ], + "files": [ + "internal/config/loader.go", + "internal/config/loader_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 243, + "state": "MERGED", + "title": "fix: Remove AUTH related envs from job pod", + "branch": "model_auth_fixes", + "labels": [], + "files": [ + "internal/runtimes/k8s/job_builders.go", + "internal/runtimes/k8s/job_builders_test.go", + "internal/runtimes/k8s/k8s_runtime_test.go", + "internal/runtimes/k8s/k8s_runtime_unit_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 240, + "state": "MERGED", + "title": "fix: fix the update logic (and error message) when a job is in a final state", + "branch": "update-return-code", + "labels": [], + "files": [ + "internal/handlers/evaluations.go", + "internal/messages/messages.go", + "internal/storage/sql/evaluations.go", + "internal/storage/sql/evaluations_test.go", + "tests/features/evaluations.feature" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 234, + "state": "MERGED", + "title": "fix: fix code so that the reported caller is not the helper function", + "branch": "log-caller", + "labels": [], + "files": [ + "Makefile", + "cmd/eval_hub/server/execution_context.go", + "internal/logging/logger.go", + "package-lock.json" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 231, + "state": "MERGED", + "title": "fix: add filtering by tenant_id", + "branch": "query-evaluations", + "labels": [], + "files": [ + "cmd/eval_hub/server/execution_context.go", + "cmd/eval_hub/server/server.go", + "internal/abstractions/storage.go", + "internal/handlers/evaluations.go", + "internal/storage/sql/collections.go", + "internal/storage/sql/evaluations.go", + "internal/storage/sql/evaluations_test.go", + "internal/storage/sql/helper.go", + "tests/features/evaluations.feature", + "tests/features/step_definitions_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 226, + "state": "MERGED", + "title": "fix: benchmark results are stored per index", + "branch": "benchmark-index", + "labels": [], + "files": [ + "internal/storage/common/common.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 225, + "state": "MERGED", + "title": "fix: local runtime directory structure with benchmark_index", + "branch": "fix-local-dir-path", + "labels": [], + "files": [ + "docs/src/components/schemas/BenchmarkResult.yaml", + "docs/src/components/schemas/BenchmarkTest.yaml", + "docs/src/components/schemas/EvaluationJobResults.yaml", + "docs/src/components/schemas/EvaluationTest.yaml", + "docs/src/components/schemas/ModelAuth.yaml", + "internal/runtimes/local/local_runtime.go", + "internal/runtimes/local/local_runtime_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 223, + "state": "MERGED", + "title": "fix: remove redundant message and update swagger", + "branch": "redundant-message", + "labels": [], + "files": [ + "docs/index-private.html", + "docs/index-public.html", + "docs/index.html", + "docs/openapi-internal.json", + "docs/openapi-internal.yaml", + "docs/openapi.json", + "docs/openapi.yaml", + "docs/src/components/schemas/EvaluationResource.yaml", + "internal/storage/sql/evaluations.go", + "pkg/api/evaluations.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 217, + "state": "MERGED", + "title": "fix: run k8s benchmark creation sequentially", + "branch": "kube_fixes", + "labels": [], + "files": [ + "internal/runtimes/k8s/k8s_runtime.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 216, + "state": "MERGED", + "title": "fix: Detach runtime context from request cancellation", + "branch": "context_fix", + "labels": [], + "files": [ + "internal/runtimes/k8s/k8s_runtime.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 214, + "state": "MERGED", + "title": "fix: name kube resources as -", + "branch": "guid_fix", + "labels": [], + "files": [ + "internal/runtimes/k8s/job_builders.go", + "internal/runtimes/k8s/job_builders_test.go", + "internal/runtimes/k8s/job_config.go", + "internal/runtimes/k8s/k8s_helper.go", + "internal/runtimes/k8s/k8s_runtime.go", + "internal/runtimes/k8s/k8s_runtime_test.go", + "internal/runtimes/k8s/k8s_runtime_unit_test.go", + "tests/kubernetes/features/kubernetes_resources.feature", + "tests/kubernetes/features/step_definitions_test.go", + "tests/kubernetes/features/test_data/evaluation_job_multi_benchmark.json" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 203, + "state": "MERGED", + "title": "fix(ci): install redocly from package-lock.json in CI", + "branch": "tarilabs-20260219-inCIuseLOCK", + "labels": [], + "files": [ + ".github/workflows/ci.yml" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 200, + "state": "MERGED", + "title": "fix(mlflow): re-read auth token from disk on each request to handle k8s token rotation", + "branch": "token-refresh", + "labels": [ + "kind/fix" + ], + "files": [ + "internal/config/mlflow_config.go", + "internal/mlflow/mlflow.go", + "pkg/mlflowclient/client.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 199, + "state": "MERGED", + "title": "fix: add annotations for the kube resources", + "branch": "k8s_name", + "labels": [], + "files": [ + "internal/runtimes/k8s/examples/eval-job-configmap.yaml", + "internal/runtimes/k8s/examples/eval-job.yaml", + "internal/runtimes/k8s/job_builders.go", + "internal/runtimes/k8s/job_builders_test.go", + "internal/runtimes/k8s/k8s_runtime_test.go", + "internal/runtimes/k8s/k8s_runtime_unit_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 198, + "state": "MERGED", + "title": "fix: remove CA bundle as default", + "branch": "certs-fix", + "labels": [ + "kind/fix" + ], + "files": [ + "internal/runtimes/k8s/job_builders.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 197, + "state": "MERGED", + "title": "fix: collision in Job naming", + "branch": "k8s_name", + "labels": [], + "files": [ + "internal/runtimes/k8s/job_builders.go", + "internal/runtimes/k8s/job_builders_test.go", + "internal/runtimes/k8s/k8s_runtime.go", + "internal/runtimes/k8s/k8s_runtime_test.go", + "internal/runtimes/k8s/k8s_runtime_unit_test.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 189, + "state": "MERGED", + "title": "fix: Add missing provider id from Job Spec", + "branch": "provider-id-fix", + "labels": [ + "kind/fix" + ], + "files": [ + "internal/runtimes/k8s/job_config.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 185, + "state": "MERGED", + "title": "fix: validate benchmark/provider IDs and status events", + "branch": "few_fixes", + "labels": [], + "files": [ + "cmd/eval_hub/server/server_test.go", + "internal/handlers/evaluations.go", + "internal/handlers/evaluations_test.go", + "internal/handlers/providers_test.go", + "internal/messages/messages.go", + "pkg/api/evaluations.go" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 183, + "state": "MERGED", + "title": "fix: do not rollback when not needed", + "branch": "eval-sql-txn", + "labels": [], + "files": [ + "Makefile", + "internal/abstractions/storage.go", + "internal/handlers/evaluations.go", + "internal/storage/sql/evaluations.go", + "package-lock.json", + "package.json", + "pkg/api/evaluations.go" + ], + "changes_requested": [], + "close_reason": null + } +] \ No newline at end of file diff --git a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/cve-prs.json b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/cve-prs.json new file mode 100644 index 00000000..ce53332b --- /dev/null +++ b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/cve-prs.json @@ -0,0 +1,67 @@ +[ + { + "number": 390, + "state": "OPEN", + "title": "fix(cve): upgrade Go from 1.25.8 to 1.26.1", + "branch": "fix/cve-2026-go-stdlib-upgrade-attempt-1", + "labels": [], + "files": [ + "Containerfile", + "go.mod", + "go.sum" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 384, + "state": "MERGED", + "title": "fix(cve): Update Go stdlib to 1.25.8 - fix 3 vulnerabilities", + "branch": "fix/cve-2026-go-stdlib-vulns-attempt-1", + "labels": [], + "files": [ + "go.mod" + ], + "changes_requested": [], + "close_reason": null + }, + { + "number": 383, + "state": "CLOSED", + "title": "fix(cve): update Go to 1.25.8 to address 3 stdlib CVEs", + "branch": "ambient/cve-fixer-nightly-1774600830", + "labels": [], + "files": [ + "Containerfile", + "go.mod" + ], + "changes_requested": [], + "close_reason": "
📝 Walkthrough ## Walkthrough The Go toolchain version has been updated | ## [Codecov](https://app.codecov.io/gh/eval-hub/eval-hub/pull/383?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=eval-hub) R" + }, + { + "number": 382, + "state": "CLOSED", + "title": "fix(cve): update Go stdlib to 1.25.8 for CVE fixes", + "branch": "fix/cve-go-stdlib-1.25.8-attempt-1", + "labels": [], + "files": [ + "go.mod", + "go.sum" + ], + "changes_requested": [], + "close_reason": "## [Codecov](https://app.codecov.io/gh/eval-hub/eval-hub/pull/382?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=eval-hub) R | I am still working on this - I will get it to also update the other files" + }, + { + "number": 379, + "state": "CLOSED", + "title": "fix(cve): Update Go toolchain to 1.25.8 for stdlib CVE fixes", + "branch": "fix/cve-go-stdlib-go1.25.8-attempt-1", + "labels": [], + "files": [ + "go.mod", + "go.sum" + ], + "changes_requested": [], + "close_reason": "
📝 Walkthrough ## Walkthrough The `go.mod` file is updated to include a | ## [Codecov](https://app.codecov.io/gh/eval-hub/eval-hub/pull/379?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=eval-hub) R" + } +] \ No newline at end of file From a6895eb464b8f4f75e76fb66ed7bfe648c8aaad6 Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Mon, 30 Mar 2026 09:17:55 -0400 Subject: [PATCH 07/23] chore: add .gitignore and remove committed test artifacts Artifacts are generated output and should not be tracked in the repo, following the same convention as the cve-fixer workflow. Co-Authored-By: Claude Sonnet 4.6 (1M context) --- workflows/guidance-generator/.gitignore | 2 + .../analysis/bugfix-patterns.md | 44 - .../analysis/cve-patterns.md | 41 - .../output/bugfix-guidance.md | 34 - .../output/cve-fix-guidance.md | 21 - .../eval-hub-eval-hub/raw/bugfix-prs.json | 885 ------------------ .../eval-hub-eval-hub/raw/cve-prs.json | 67 -- 7 files changed, 2 insertions(+), 1092 deletions(-) create mode 100644 workflows/guidance-generator/.gitignore delete mode 100644 workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/bugfix-patterns.md delete mode 100644 workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/cve-patterns.md delete mode 100644 workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/bugfix-guidance.md delete mode 100644 workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/cve-fix-guidance.md delete mode 100644 workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/bugfix-prs.json delete mode 100644 workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/cve-prs.json diff --git a/workflows/guidance-generator/.gitignore b/workflows/guidance-generator/.gitignore new file mode 100644 index 00000000..bc94e122 --- /dev/null +++ b/workflows/guidance-generator/.gitignore @@ -0,0 +1,2 @@ +# PR Guidance Generator artifacts - generated output, not tracked in repo +artifacts/ diff --git a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/bugfix-patterns.md b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/bugfix-patterns.md deleted file mode 100644 index 7db724c0..00000000 --- a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/bugfix-patterns.md +++ /dev/null @@ -1,44 +0,0 @@ -# Bugfix PR Pattern Analysis — eval-hub/eval-hub - - -## Data Summary -- 48 merged, 4 closed bugfix PRs analyzed -- Threshold applied: 3+ PRs per rule - -## TITLE_FORMAT -Primary: `fix: ` (39/48 merged) -Scoped: `fix(): ` (9/48 merged) -No other prefix patterns observed. - -## SCOPE_VALUES -sidecar, collections, storage, k8s, cli, build, cancel, ci, mlflow (9 scoped PRs) - -## BRANCH_FORMAT -No strict convention. Patterns: -- Short descriptive (no prefix): ~33/48 (adapter-mode, storage-fix, tenant-filter) -- `fix-`: ~10/48 (fix-sidecar-shutdown, fix-lighteval) -- `fix/`: ~5/48 (fix/cancel-should-fail-unit-benchmarks) - -## TEST_FILES_REQUIRED -Go source changes almost always paired with `_test.go` counterparts (~36/48 merged) -- evaluations.go + evaluations_test.go (6+ PRs) -- job_builders.go + job_builders_test.go (6+ PRs) -- collections.go + collections_test.go (3+ PRs) - -## FEATURE_TESTS -PRs touching handler/runtime behavior include BDD test files (10+ PRs): -- tests/features/*.feature -- tests/features/step_definitions_test.go - -## CO_CHANGE_k8s -job_builders.go always paired with job_builders_test.go (6/6 PRs) - -## CO_CHANGE_storage -SQL storage changes always paired with *_test.go files (5/5 PRs) - -## LABELS -kind/fix: 11/48 PRs (23%) — optional, inconsistently applied - -## DONT_docs_only_via_fix_branch -2 closed PRs rejected for docs-only content via fix/ branches: -- #369 closed "not required", #368 closed "not relevant here" diff --git a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/cve-patterns.md b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/cve-patterns.md deleted file mode 100644 index 23367d2c..00000000 --- a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/analysis/cve-patterns.md +++ /dev/null @@ -1,41 +0,0 @@ -# CVE PR Pattern Analysis — eval-hub/eval-hub - - - -## Data Summary -- Merged: #384 ("fix(cve): Update Go stdlib to 1.25.8 - fix 3 vulnerabilities") -- Closed: #383, #382, #379 (all Go stdlib upgrade attempts targeting 1.25.8) -- Open: #390 ("fix(cve): upgrade Go from 1.25.8 to 1.26.1") - -All 5 CVE PRs address Go stdlib vulnerabilities. This repo is a Go project. - -## TITLE_FORMAT -`fix(cve): to - ` (5/5 CVE PRs) -Examples: - - "fix(cve): Update Go stdlib to 1.25.8 - fix 3 vulnerabilities" ← merged - - "fix(cve): upgrade Go from 1.25.8 to 1.26.1" - - "fix(cve): update Go toolchain to 1.25.8 for stdlib CVE fixes" - -## BRANCH_FORMAT -Manual: `fix/cve--go-stdlib--attempt-N` (3/3 manual CVE PRs) -Bot-generated: `ambient/cve-fixer-nightly-` (1/1 bot PR) - -## FILES_GO_CVE -Merged PR (#384) changed: go.mod only -Open PR (#390) changes: Containerfile + go.mod + go.sum ← more complete -Closed #383: Containerfile + go.mod (missing go.sum) -Closed #382, #379: go.mod + go.sum (missing Containerfile) - -Close reason for #382: "I am still working on this - I will get it to also update the other files" -→ Suggests Containerfile + go.mod + go.sum should all be updated together (1 closed PR rejection) - -## PROACTIVE_go_sum -go.sum should accompany go.mod changes for dependency updates. -Evidence: 3 of 5 PRs include go.sum; the 2 that don't are older attempts. - -## DONT_incomplete_files -Don't update only go.mod or only go.mod+go.sum — update Containerfile too. -(1 explicit rejection: #382 closed for missing files) - -## LABELS -No labels applied to any CVE PRs (0/5 — labels not used for CVE fixes) diff --git a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/bugfix-guidance.md b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/bugfix-guidance.md deleted file mode 100644 index 09172e5a..00000000 --- a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/bugfix-guidance.md +++ /dev/null @@ -1,34 +0,0 @@ -# Bugfix Guidance — eval-hub/eval-hub - - -## Titles -`fix: ` (39/48 merged) -`fix(): ` (9/48 merged — use when change is scoped) - -## Branches -No strict convention enforced. Common patterns: -- Short descriptive: `adapter-mode`, `storage-fix`, `tenant-filter` (~33/48) -- Prefixed: `fix-` (~10/48) or `fix/` (~5/48) - -## Scope Values -Use fix() when the change is confined to one subsystem (from 9 merged PRs): -`sidecar` | `collections` | `storage` | `k8s` | `cli` | `build` | `cancel` | `ci` | `mlflow` - -## Test Requirements -Go source file changes must include corresponding `_test.go` files (~36/48 merged): -- `evaluations.go` → `evaluations_test.go` (6+ PRs) -- `job_builders.go` → `job_builders_test.go` (6+ PRs) -- `collections.go` → `collections_test.go` (3+ PRs) -- `loader.go` → `loader_test.go` (2+ PRs) - -Handler/runtime changes should also update BDD feature files (10+ PRs): -- `tests/features/*.feature` -- `tests/features/step_definitions_test.go` -- `tests/kubernetes/features/` for k8s runtime changes (4+ PRs) - -## Co-Changes -- `internal/runtimes/k8s/job_builders.go` → always include `job_builders_test.go` (6/6 PRs) -- `internal/storage/sql/*.go` → always include `*_test.go` counterpart (5/5 PRs) - -## Don'ts -- Don't submit docs-only changes via fix/ branches — 2 closed PRs rejected as "not required" / "not relevant here" (#369, #368) diff --git a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/cve-fix-guidance.md b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/cve-fix-guidance.md deleted file mode 100644 index fc32402f..00000000 --- a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/output/cve-fix-guidance.md +++ /dev/null @@ -1,21 +0,0 @@ -# CVE Fix Guidance — eval-hub/eval-hub - - -## Titles -`fix(cve): to - fix vulnerabilities` (5/5 CVE PRs) -- "fix(cve): Update Go stdlib to 1.25.8 - fix 3 vulnerabilities" ← merged -- "fix(cve): upgrade Go from 1.25.8 to 1.26.1" ← open - -## Branches -Manual: `fix/cve--go-stdlib--attempt-N` (3/3 manual CVE PRs) -Bot-generated: `ambient/cve-fixer-nightly-` - -## Files — Go stdlib CVE -Update all three files together (1 closed PR rejected for updating only some): -- `go.mod` — toolchain/dependency version -- `go.sum` — dependency checksums -- `Containerfile` — base image / Go toolchain version - -## Don'ts -- Don't update go.mod+go.sum without also updating Containerfile (1 rejection: #382 closed "I will get it to also update the other files") -- Don't update only Containerfile+go.mod without go.sum (pattern from closed #383) diff --git a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/bugfix-prs.json b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/bugfix-prs.json deleted file mode 100644 index 98eafe30..00000000 --- a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/bugfix-prs.json +++ /dev/null @@ -1,885 +0,0 @@ -[ - { - "number": 388, - "state": "OPEN", - "title": "fix(deps): update Go toolchain from 1.25.8 to 1.26.1", - "branch": "ambient/cve-fixer-nightly-1774675800", - "labels": [], - "files": [ - "Containerfile", - "go.mod", - "go.sum" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 383, - "state": "CLOSED", - "title": "fix(cve): update Go to 1.25.8 to address 3 stdlib CVEs", - "branch": "ambient/cve-fixer-nightly-1774600830", - "labels": [], - "files": [ - "Containerfile", - "go.mod" - ], - "changes_requested": [], - "close_reason": "
📝 Walkthrough ## Walkthrough The Go toolchain version has been updated | ## [Codecov](https://app.codecov.io/gh/eval-hub/eval-hub/pull/383?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=eval-hub) R" - }, - { - "number": 374, - "state": "MERGED", - "title": "fix: remove github action for Ambient CVE fixer", - "branch": "fix-cve-fixer", - "labels": [], - "files": [ - ".cve-fixer.json", - ".github/workflows/daily-cve-fixer.yml" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 372, - "state": "MERGED", - "title": "fix: the ambient cli fails with download error", - "branch": "fix-ambient-cli", - "labels": [], - "files": [ - ".github/workflows/daily-cve-fixer.yml" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 369, - "state": "CLOSED", - "title": "docs: Add adapter model URL usage guide", - "branch": "fix/RHOAIENG-54862-model-url-ignored", - "labels": [], - "files": [ - "docs/adapter-model-url-usage.md" - ], - "changes_requested": [], - "close_reason": "## [Codecov](https://app.codecov.io/gh/eval-hub/eval-hub/pull/369?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=eval-hub) R | Closing this for now as not required." - }, - { - "number": 368, - "state": "CLOSED", - "title": "docs: Add MLflow artifact logging guide for adapters", - "branch": "fix/RHOAIENG-54539-mlflow-artifacts", - "labels": [], - "files": [ - "docs/mlflow-artifact-logging-guide.md" - ], - "changes_requested": [], - "close_reason": "## [Codecov](https://app.codecov.io/gh/eval-hub/eval-hub/pull/368?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=eval-hub) R | Closing this as not relevant here." - }, - { - "number": 367, - "state": "CLOSED", - "title": "fix: Pass mlflow_experiment_id to adapters for run creation", - "branch": "fix/RHOAIENG-54869-mlflow-run-id", - "labels": [], - "files": [ - "docs/mlflow-run-id-integration.md", - "internal/eval_hub/runtimes/shared/jobspec.go" - ], - "changes_requested": [], - "close_reason": "
📝 Walkthrough ## Walkthrough This PR introduces MLflow integration sup" - }, - { - "number": 366, - "state": "MERGED", - "title": "fix(sidecar): exit cleanly on graceful shutdown by implementing Is() on ServerClosedError", - "branch": "fix-sidecar-shutdown", - "labels": [ - "kind/fix" - ], - "files": [ - "cmd/eval_runtime_sidecar/main.go", - "internal/eval_hub/server/server.go", - "internal/eval_runtime_sidecar/server/server.go", - "internal/eval_runtime_sidecar/server/server_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 363, - "state": "MERGED", - "title": "fix: set 'EVALHUB_MODE' env var on adapter contaienr", - "branch": "adapter-mode", - "labels": [], - "files": [ - "internal/eval_hub/runtimes/k8s/job_builders.go", - "internal/eval_hub/runtimes/k8s/job_builders_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 359, - "state": "MERGED", - "title": "fix: Update resource requests and limits for lighteval", - "branch": "fix-lighteval", - "labels": [ - "kind/fix" - ], - "files": [ - "config/providers/lighteval.yaml" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 354, - "state": "MERGED", - "title": "fix: allow '/mlflow-artifacts' path in proxy", - "branch": "sidecar-mlflow", - "labels": [], - "files": [ - "internal/eval_hub/runtimes/k8s/job_builders_test.go", - "internal/eval_runtime_sidecar/handlers/handlers.go", - "internal/eval_runtime_sidecar/handlers/handlers_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 348, - "state": "MERGED", - "title": "fix: TLS disabled in tests", - "branch": "sidecar-tests", - "labels": [], - "files": [ - "internal/eval_runtime_sidecar/handlers/handlers_test.go", - "internal/eval_runtime_sidecar/proxy/http_client.go", - "internal/eval_runtime_sidecar/proxy/http_client_test.go", - "internal/eval_runtime_sidecar/server/server_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 347, - "state": "MERGED", - "title": "fix: sidecar container termination", - "branch": "sidecar-termination", - "labels": [], - "files": [ - "cmd/eval-runtime-sidecar/main.go", - "internal/eval_hub/runtimes/k8s/examples/eval-job.yaml", - "internal/eval_hub/runtimes/k8s/job_builders.go", - "internal/eval_hub/runtimes/k8s/job_builders_test.go", - "internal/eval_runtime_sidecar/termination/watch.go", - "internal/eval_runtime_sidecar/termination/watch_test.go", - "tests/kubernetes/features/kubernetes_resources.feature", - "tests/kubernetes/features/step_definitions_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 320, - "state": "MERGED", - "title": "fix: tidy collections APIs", - "branch": "collection-traces", - "labels": [], - "files": [ - "cmd/eval_hub/server/server_test.go", - "internal/handlers/collections.go", - "internal/handlers/collections_test.go", - "internal/handlers/evaluations.go", - "internal/handlers/helpers.go", - "internal/handlers/providers.go", - "tests/features/collections.feature", - "tests/features/evaluations.feature" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 319, - "state": "MERGED", - "title": "fix: remove the tenant check in the storage layer and simplify the delete evaluation", - "branch": "storage-fix", - "labels": [], - "files": [ - "internal/storage/sql/collections.go", - "internal/storage/sql/evaluations.go", - "internal/storage/sql/providers.go", - "internal/storage/sql/sql.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 318, - "state": "MERGED", - "title": "fix: added experiment tag context: eval-hub", - "branch": "marius-experiment-tag", - "labels": [], - "files": [ - "internal/mlflow/mlflow.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 315, - "state": "MERGED", - "title": "fix: remove system resources from the runtime", - "branch": "system-lists", - "labels": [], - "files": [ - "Containerfile", - "README-GO.md", - "README.md", - "VERSION", - "cmd/eval_hub/main.go", - "cmd/eval_hub/server/server_test.go", - "go.mod", - "go.sum", - "internal/common/common.go", - "internal/common/resolve_provider_test.go", - "internal/config/loader_test.go", - "internal/handlers/collections_test.go", - "internal/handlers/health_test.go", - "internal/runtimes/k8s/k8s_runtime.go", - "internal/runtimes/k8s/k8s_runtime_test.go", - "internal/runtimes/k8s/k8s_runtime_unit_test.go", - "internal/runtimes/local/local_runtime.go", - "internal/runtimes/local/local_runtime_test.go", - "internal/runtimes/runtime.go", - "internal/runtimes/shared/runtime_utils.go", - "internal/runtimes/shared/runtime_utils_test.go", - "internal/validation/validator.go", - "internal/validation/validator_test.go", - "tests/features/evaluations.feature", - "tests/features/providers.feature", - "tests/features/step_definitions_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 306, - "state": "MERGED", - "title": "fix(collections): resolve system-defined collections in GET endpoint and event callbacks", - "branch": "missing-collections", - "labels": [ - "kind/fix" - ], - "files": [ - "internal/abstractions/storage.go", - "internal/handlers/collections.go", - "internal/handlers/evaluations.go", - "internal/handlers/evaluations_test.go", - "internal/runtimes/k8s/k8s_runtime.go", - "internal/runtimes/k8s/k8s_runtime_unit_test.go", - "internal/runtimes/local/local_runtime.go", - "internal/runtimes/local/local_runtime_test.go", - "internal/storage/sql/evaluations.go", - "internal/storage/sql/evaluations_test.go", - "tests/features/step_definitions_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 303, - "state": "MERGED", - "title": "fix: clean up the storage abstraction", - "branch": "storage-cleanup", - "labels": [], - "files": [ - "cmd/eval_hub/main.go", - "cmd/eval_hub/server/server_test.go", - "internal/abstractions/storage.go", - "internal/config/loader.go", - "internal/config/loader_test.go", - "internal/handlers/collections.go", - "internal/handlers/collections_test.go", - "internal/handlers/providers.go", - "internal/handlers/providers_test.go", - "internal/runtimes/k8s/k8s_runtime_unit_test.go", - "internal/runtimes/local/local_runtime_test.go", - "internal/storage/sql/collections.go", - "internal/storage/sql/providers.go", - "internal/storage/sql/providers_test.go", - "internal/validation/validator.go", - "internal/validation/validator_test.go", - "pkg/api/common.go", - "tests/features/step_definitions_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 302, - "state": "MERGED", - "title": "fix: add missing mapstructure annotations and validate system providers and collections", - "branch": "system-validation", - "labels": [], - "files": [ - "config/collections/leaderboard-v2.yaml", - "internal/config/loader.go", - "internal/config/loader_test.go", - "internal/handlers/collections_test.go", - "pkg/api/collections.go", - "pkg/api/common.go", - "pkg/api/evaluations.go", - "pkg/api/providers.go", - "tests/features/collections.feature" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 296, - "state": "MERGED", - "title": "fix: Remove the job_id,evalhub_url envs from job pod", - "branch": "follow-up", - "labels": [], - "files": [ - "internal/runtimes/k8s/examples/eval-job.yaml", - "internal/runtimes/k8s/job_builders.go", - "internal/runtimes/k8s/job_config.go", - "internal/runtimes/k8s/k8s_runtime.go", - "tests/kubernetes/features/kubernetes_resources.feature", - "tests/kubernetes/features/step_definitions_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 291, - "state": "MERGED", - "title": "fix: fix jobs from a collection", - "branch": "job-with-collection", - "labels": [], - "files": [ - "cmd/eval_hub/server/authorization.go", - "cmd/eval_hub/server/execution_context.go", - "cmd/eval_hub/server/server.go", - "internal/common/common.go", - "internal/constants/log_fields.go", - "internal/handlers/collections.go", - "internal/handlers/evaluations_test.go", - "internal/messages/messages.go", - "internal/runtimes/k8s/job_builders_test.go", - "internal/runtimes/k8s/k8s_runtime_unit_test.go", - "internal/serviceerrors/error_test.go", - "internal/storage/common/common.go", - "internal/storage/sql/collections.go", - "internal/storage/sql/evaluations.go", - "pkg/api/evaluations.go", - "tests/features/evaluations.feature", - "tests/features/step_definitions_test.go", - "tests/features/test_data/collection_pass_criteria_from_provider_test.json", - "tests/features/test_data/evaluation_job_pass_criteria_from_provider_test.json", - "tests/features/test_data/evaluation_job_with_collection.json", - "tests/features/test_data/patch_collection_benchmark.json", - "tests/features/test_data/patch_collection_benchmark_full.json", - "tests/features/test_data/patch_collection_description.json", - "tests/features/test_data/patch_collection_invalid.json", - "tests/features/test_data/patch_collection_name.json" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 290, - "state": "MERGED", - "title": "fix: update S3 test data init image handling in job configuration", - "branch": "fix-init-container-default", - "labels": [ - "kind/fix" - ], - "files": [ - "internal/runtimes/k8s/job_builders.go", - "internal/runtimes/k8s/job_builders_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 287, - "state": "MERGED", - "title": "fix: return bad request instead of 403 if the request is missing requ…", - "branch": "marius", - "labels": [], - "files": [ - "auth/authorization.go", - "auth/rules.go", - "auth/rules_test.go", - "auth/testdata/rbac_mixed.yaml", - "cmd/eval_hub/server/authorization.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 284, - "state": "MERGED", - "title": "fix: test-all-coverage failure fixes", - "branch": "fix-tests", - "labels": [], - "files": [ - "tests/features/evaluations.feature" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 282, - "state": "MERGED", - "title": "fix(storage): scope GetEvaluationJobs list query to WithTenant", - "branch": "tenant-filter", - "labels": [ - "kind/fix" - ], - "files": [ - "internal/storage/sql/evaluations.go", - "internal/storage/sql/evaluations_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 281, - "state": "MERGED", - "title": "fix(k8s): resolve namespace using tenant from evaluation resource", - "branch": "ns-fix", - "labels": [ - "kind/fix" - ], - "files": [ - "internal/runtimes/k8s/job_config.go", - "internal/runtimes/k8s/job_config_test.go", - "internal/runtimes/k8s/k8s_runtime.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 275, - "state": "MERGED", - "title": "fix: auth rules not enforced", - "branch": "fix-sar", - "labels": [ - "kind/fix" - ], - "files": [ - "auth/rules.go", - "internal/config/loader.go", - "internal/runtimes/k8s/examples/eval-job.yaml", - "internal/runtimes/k8s/job_config.go", - "tests/kubernetes/features/step_definitions_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 271, - "state": "MERGED", - "title": "fix(cli): eval-hub-server cli not picking up args", - "branch": "fix/eval-hub-server-cli-args", - "labels": [], - "files": [ - "python-server/DEVELOPMENT.md", - "python-server/evalhub_server/main.py", - "python-server/pyproject.toml", - "python-server/tests/test_main.py", - "python-server/uv.lock" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 266, - "state": "MERGED", - "title": "fix: more fixes related to searching lists of resources", - "branch": "list-tests", - "labels": [], - "files": [ - "Makefile", - "internal/handlers/providers.go", - "internal/storage/sql/collections.go", - "internal/storage/sql/evaluations.go", - "internal/storage/sql/postgres/statements.go", - "internal/storage/sql/providers.go", - "internal/storage/sql/shared/statements.go", - "internal/storage/sql/sqlite/statements.go", - "tests/features/evaluations.feature", - "tests/features/step_definitions_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 265, - "state": "MERGED", - "title": "fix(build): fix windows platform build", - "branch": "fix/windows-build", - "labels": [], - "files": [ - "auth/testdata/rbac_mixed.yaml", - "internal/runtimes/local/local_runtime.go", - "internal/runtimes/local/process_unix.go", - "internal/runtimes/local/process_windows.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 263, - "state": "MERGED", - "title": "fix: added get verb for mlflow experiments + auth config loading", - "branch": "marius", - "labels": [], - "files": [ - "cmd/eval_hub/main.go", - "cmd/eval_hub/server/server.go", - "config/auth.yaml" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 255, - "state": "MERGED", - "title": "fix(cancel): cancellation of job should reflect at benchmark level", - "branch": "fix/cancel-should-fail-unit-benchmarks", - "labels": [], - "files": [ - ".coderabbit.yaml", - ".gitignore", - "Makefile", - "internal/storage/sql/evaluations.go", - "internal/storage/sql/evaluations_test.go", - "pkg/api/evaluations.go", - "pkg/api/evaluations_test.go", - "tests/features/evaluations.feature" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 247, - "state": "MERGED", - "title": "fix: fix the cancellation code to always call the runtime (unless already cancelled)", - "branch": "evaluations-cancel", - "labels": [], - "files": [ - "internal/handlers/evaluations.go", - "internal/handlers/evaluations_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 245, - "state": "MERGED", - "title": "fix: loadProvider ignores explicit config dir for provider file reads", - "branch": "config-fix", - "labels": [ - "kind/fix", - "area/configuration" - ], - "files": [ - "internal/config/loader.go", - "internal/config/loader_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 243, - "state": "MERGED", - "title": "fix: Remove AUTH related envs from job pod", - "branch": "model_auth_fixes", - "labels": [], - "files": [ - "internal/runtimes/k8s/job_builders.go", - "internal/runtimes/k8s/job_builders_test.go", - "internal/runtimes/k8s/k8s_runtime_test.go", - "internal/runtimes/k8s/k8s_runtime_unit_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 240, - "state": "MERGED", - "title": "fix: fix the update logic (and error message) when a job is in a final state", - "branch": "update-return-code", - "labels": [], - "files": [ - "internal/handlers/evaluations.go", - "internal/messages/messages.go", - "internal/storage/sql/evaluations.go", - "internal/storage/sql/evaluations_test.go", - "tests/features/evaluations.feature" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 234, - "state": "MERGED", - "title": "fix: fix code so that the reported caller is not the helper function", - "branch": "log-caller", - "labels": [], - "files": [ - "Makefile", - "cmd/eval_hub/server/execution_context.go", - "internal/logging/logger.go", - "package-lock.json" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 231, - "state": "MERGED", - "title": "fix: add filtering by tenant_id", - "branch": "query-evaluations", - "labels": [], - "files": [ - "cmd/eval_hub/server/execution_context.go", - "cmd/eval_hub/server/server.go", - "internal/abstractions/storage.go", - "internal/handlers/evaluations.go", - "internal/storage/sql/collections.go", - "internal/storage/sql/evaluations.go", - "internal/storage/sql/evaluations_test.go", - "internal/storage/sql/helper.go", - "tests/features/evaluations.feature", - "tests/features/step_definitions_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 226, - "state": "MERGED", - "title": "fix: benchmark results are stored per index", - "branch": "benchmark-index", - "labels": [], - "files": [ - "internal/storage/common/common.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 225, - "state": "MERGED", - "title": "fix: local runtime directory structure with benchmark_index", - "branch": "fix-local-dir-path", - "labels": [], - "files": [ - "docs/src/components/schemas/BenchmarkResult.yaml", - "docs/src/components/schemas/BenchmarkTest.yaml", - "docs/src/components/schemas/EvaluationJobResults.yaml", - "docs/src/components/schemas/EvaluationTest.yaml", - "docs/src/components/schemas/ModelAuth.yaml", - "internal/runtimes/local/local_runtime.go", - "internal/runtimes/local/local_runtime_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 223, - "state": "MERGED", - "title": "fix: remove redundant message and update swagger", - "branch": "redundant-message", - "labels": [], - "files": [ - "docs/index-private.html", - "docs/index-public.html", - "docs/index.html", - "docs/openapi-internal.json", - "docs/openapi-internal.yaml", - "docs/openapi.json", - "docs/openapi.yaml", - "docs/src/components/schemas/EvaluationResource.yaml", - "internal/storage/sql/evaluations.go", - "pkg/api/evaluations.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 217, - "state": "MERGED", - "title": "fix: run k8s benchmark creation sequentially", - "branch": "kube_fixes", - "labels": [], - "files": [ - "internal/runtimes/k8s/k8s_runtime.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 216, - "state": "MERGED", - "title": "fix: Detach runtime context from request cancellation", - "branch": "context_fix", - "labels": [], - "files": [ - "internal/runtimes/k8s/k8s_runtime.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 214, - "state": "MERGED", - "title": "fix: name kube resources as -", - "branch": "guid_fix", - "labels": [], - "files": [ - "internal/runtimes/k8s/job_builders.go", - "internal/runtimes/k8s/job_builders_test.go", - "internal/runtimes/k8s/job_config.go", - "internal/runtimes/k8s/k8s_helper.go", - "internal/runtimes/k8s/k8s_runtime.go", - "internal/runtimes/k8s/k8s_runtime_test.go", - "internal/runtimes/k8s/k8s_runtime_unit_test.go", - "tests/kubernetes/features/kubernetes_resources.feature", - "tests/kubernetes/features/step_definitions_test.go", - "tests/kubernetes/features/test_data/evaluation_job_multi_benchmark.json" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 203, - "state": "MERGED", - "title": "fix(ci): install redocly from package-lock.json in CI", - "branch": "tarilabs-20260219-inCIuseLOCK", - "labels": [], - "files": [ - ".github/workflows/ci.yml" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 200, - "state": "MERGED", - "title": "fix(mlflow): re-read auth token from disk on each request to handle k8s token rotation", - "branch": "token-refresh", - "labels": [ - "kind/fix" - ], - "files": [ - "internal/config/mlflow_config.go", - "internal/mlflow/mlflow.go", - "pkg/mlflowclient/client.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 199, - "state": "MERGED", - "title": "fix: add annotations for the kube resources", - "branch": "k8s_name", - "labels": [], - "files": [ - "internal/runtimes/k8s/examples/eval-job-configmap.yaml", - "internal/runtimes/k8s/examples/eval-job.yaml", - "internal/runtimes/k8s/job_builders.go", - "internal/runtimes/k8s/job_builders_test.go", - "internal/runtimes/k8s/k8s_runtime_test.go", - "internal/runtimes/k8s/k8s_runtime_unit_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 198, - "state": "MERGED", - "title": "fix: remove CA bundle as default", - "branch": "certs-fix", - "labels": [ - "kind/fix" - ], - "files": [ - "internal/runtimes/k8s/job_builders.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 197, - "state": "MERGED", - "title": "fix: collision in Job naming", - "branch": "k8s_name", - "labels": [], - "files": [ - "internal/runtimes/k8s/job_builders.go", - "internal/runtimes/k8s/job_builders_test.go", - "internal/runtimes/k8s/k8s_runtime.go", - "internal/runtimes/k8s/k8s_runtime_test.go", - "internal/runtimes/k8s/k8s_runtime_unit_test.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 189, - "state": "MERGED", - "title": "fix: Add missing provider id from Job Spec", - "branch": "provider-id-fix", - "labels": [ - "kind/fix" - ], - "files": [ - "internal/runtimes/k8s/job_config.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 185, - "state": "MERGED", - "title": "fix: validate benchmark/provider IDs and status events", - "branch": "few_fixes", - "labels": [], - "files": [ - "cmd/eval_hub/server/server_test.go", - "internal/handlers/evaluations.go", - "internal/handlers/evaluations_test.go", - "internal/handlers/providers_test.go", - "internal/messages/messages.go", - "pkg/api/evaluations.go" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 183, - "state": "MERGED", - "title": "fix: do not rollback when not needed", - "branch": "eval-sql-txn", - "labels": [], - "files": [ - "Makefile", - "internal/abstractions/storage.go", - "internal/handlers/evaluations.go", - "internal/storage/sql/evaluations.go", - "package-lock.json", - "package.json", - "pkg/api/evaluations.go" - ], - "changes_requested": [], - "close_reason": null - } -] \ No newline at end of file diff --git a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/cve-prs.json b/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/cve-prs.json deleted file mode 100644 index ce53332b..00000000 --- a/workflows/guidance-generator/artifacts/guidance/eval-hub-eval-hub/raw/cve-prs.json +++ /dev/null @@ -1,67 +0,0 @@ -[ - { - "number": 390, - "state": "OPEN", - "title": "fix(cve): upgrade Go from 1.25.8 to 1.26.1", - "branch": "fix/cve-2026-go-stdlib-upgrade-attempt-1", - "labels": [], - "files": [ - "Containerfile", - "go.mod", - "go.sum" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 384, - "state": "MERGED", - "title": "fix(cve): Update Go stdlib to 1.25.8 - fix 3 vulnerabilities", - "branch": "fix/cve-2026-go-stdlib-vulns-attempt-1", - "labels": [], - "files": [ - "go.mod" - ], - "changes_requested": [], - "close_reason": null - }, - { - "number": 383, - "state": "CLOSED", - "title": "fix(cve): update Go to 1.25.8 to address 3 stdlib CVEs", - "branch": "ambient/cve-fixer-nightly-1774600830", - "labels": [], - "files": [ - "Containerfile", - "go.mod" - ], - "changes_requested": [], - "close_reason": "
📝 Walkthrough ## Walkthrough The Go toolchain version has been updated | ## [Codecov](https://app.codecov.io/gh/eval-hub/eval-hub/pull/383?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=eval-hub) R" - }, - { - "number": 382, - "state": "CLOSED", - "title": "fix(cve): update Go stdlib to 1.25.8 for CVE fixes", - "branch": "fix/cve-go-stdlib-1.25.8-attempt-1", - "labels": [], - "files": [ - "go.mod", - "go.sum" - ], - "changes_requested": [], - "close_reason": "## [Codecov](https://app.codecov.io/gh/eval-hub/eval-hub/pull/382?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=eval-hub) R | I am still working on this - I will get it to also update the other files" - }, - { - "number": 379, - "state": "CLOSED", - "title": "fix(cve): Update Go toolchain to 1.25.8 for stdlib CVE fixes", - "branch": "fix/cve-go-stdlib-go1.25.8-attempt-1", - "labels": [], - "files": [ - "go.mod", - "go.sum" - ], - "changes_requested": [], - "close_reason": "
📝 Walkthrough ## Walkthrough The `go.mod` file is updated to include a | ## [Codecov](https://app.codecov.io/gh/eval-hub/eval-hub/pull/379?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=eval-hub) R" - } -] \ No newline at end of file From d3027f9d638db415063cb1d9c857193d763ad81f Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Mon, 30 Mar 2026 09:27:18 -0400 Subject: [PATCH 08/23] feat: add multi-repo support to guidance.generate and guidance.update MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Both commands now accept multiple repos — space-separated, comma-separated, or mixed. Each repo is processed independently in a loop; one failure does not abort the others. Changes: - Arguments: accept [ ...] or ,,... - Step 1: parse all repo refs into REPOS array (normalize URLs, deduplicate) - --pr flag: full URLs apply only to their matching repo; plain numbers apply to all repos via REPO_SPECIFIC_PRS + GLOBAL_PR_NUMBERS maps - Steps 2-8 (generate) / 2-9 (update): wrapped in for REPO in REPOS loop - Per-repo cleanup inside the loop; failure skips to next repo - Summary: lists all repos with PR URLs, skips, and failures - README and ambient.json updated with multi-repo examples Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../guidance-generator/.ambient/ambient.json | 6 +- .../.claude/commands/guidance.generate.md | 179 ++++++++++++------ .../.claude/commands/guidance.update.md | 152 ++++++++++----- workflows/guidance-generator/README.md | 36 ++-- 4 files changed, 252 insertions(+), 121 deletions(-) diff --git a/workflows/guidance-generator/.ambient/ambient.json b/workflows/guidance-generator/.ambient/ambient.json index 850efe46..7efa1786 100644 --- a/workflows/guidance-generator/.ambient/ambient.json +++ b/workflows/guidance-generator/.ambient/ambient.json @@ -1,8 +1,8 @@ { "name": "PR Guidance Generator", - "description": "Analyze a repository's merged and closed fix PRs to generate compact guidance files that teach automated workflows (CVE Fixer, Bugfix) how to create PRs that match that repo's conventions. Supports full history analysis or targeted analysis of specific PRs.", - "systemPrompt": "You are a PR pattern analyst for the Ambient Code Platform. Your role is to help teams generate and maintain guidance files that teach automated fix workflows how to create pull requests matching their repository's conventions.\n\nKEY RESPONSIBILITIES:\n- Fetch and analyze historical fix PRs from a GitHub repository\n- Extract patterns from merged PRs (what works) and closed PRs (what to avoid)\n- Generate compact, high-signal guidance files — no fluff, no verbose examples\n- Create pull requests in target repositories with the generated guidance files\n- Update existing guidance files with patterns from new PRs\n\nWORKFLOW METHODOLOGY:\n1. GENERATE - Analyze a repo's PR history (or specific PRs) and create guidance files, then open a PR\n2. UPDATE - Fetch PRs since last analysis (or specific PRs), merge new patterns, open an update PR\n\nAVAILABLE COMMANDS:\n/guidance.generate - Analyze a repo and generate guidance files, then create a PR\n/guidance.update - Refresh existing guidance with recent PRs and create an update PR\n\nBoth commands accept --pr [,...] to analyze specific PRs instead of fetching all.\n\nOUTPUT LOCATIONS:\n- Raw PR data: artifacts/guidance//raw/\n- Analysis output: artifacts/guidance//analysis/\n- Generated files: artifacts/guidance//output/\n\nCORE PRINCIPLES:\n- Guidance files must stay under 80 lines — dense signal, no narrative\n- Use adaptive rule threshold based on bucket size: 3+ PRs (10+ merged), 2+ PRs (3-9 merged), 1+ PR with limited-data warning (1-2 merged). Skip file only if 0 merged PRs.\n- In --pr mode: analyze only the specified PRs. Never drop a user-specified PR even if it does not match CVE or bugfix bucket patterns — include it in both buckets and let synthesis classify it.\n- Merged PRs = positive examples. Closed PRs = what to avoid.\n- Review REQUEST_CHANGES comments reveal what workflows should do proactively.\n- Never guess patterns — only state what the PR data supports.\n- Sanitize control characters from all PR text fields before JSON construction to prevent silent failures.", - "startupPrompt": "Welcome! I'm your PR Guidance Generator.\n\nI analyze a repository's fix PR history to generate compact guidance files that teach automated workflows — like CVE Fixer and Bugfix — how to create PRs your team will actually merge.\n\nAVAILABLE COMMANDS:\n/guidance.generate - Analyze PR history and generate guidance files, then open a PR\n/guidance.update - Refresh existing guidance with recent PRs and open an update PR\n\nFlags (both commands):\n --pr , Analyze specific PRs instead of fetching all (URLs or numbers, comma-separated)\n\nFlags for /guidance.generate only:\n --cve-only Only generate CVE fix guidance\n --bugfix-only Only generate bugfix guidance\n --limit N Max PRs to analyze per bucket (default: 100)\n\nEXAMPLES:\n /guidance.generate org/repo\n /guidance.generate org/repo --pr https://github.com/org/repo/pull/42,https://github.com/org/repo/pull/87\n /guidance.update org/repo --pr 103\n\nWhat repository would you like to analyze?", + "description": "Analyze merged and closed fix PRs across one or more repositories to generate compact guidance files that teach automated workflows (CVE Fixer, Bugfix) how to create PRs matching each repo's conventions. Each repo is processed independently.", + "systemPrompt": "You are a PR pattern analyst for the Ambient Code Platform. Your role is to help teams generate and maintain guidance files that teach automated fix workflows how to create pull requests matching their repository's conventions.\n\nKEY RESPONSIBILITIES:\n- Fetch and analyze historical fix PRs from one or more GitHub repositories\n- Process each repository independently — one failure must not abort others\n- Extract patterns from merged PRs (what works) and closed PRs (what to avoid)\n- Generate compact, high-signal guidance files — no fluff, no verbose examples\n- Create one pull request per repository with the generated guidance files\n- Update existing guidance files with patterns from new PRs\n- Print a final summary listing all PR URLs and any failures\n\nWORKFLOW METHODOLOGY:\n1. GENERATE - Parse multiple repos, loop over each: analyze PR history (or specific PRs), create guidance files, open a PR per repo\n2. UPDATE - Parse multiple repos, loop over each: fetch new PRs (or specific PRs), merge patterns, open an update PR per repo\n\nAVAILABLE COMMANDS:\n/guidance.generate [ ...] - Generate guidance files for one or more repos\n/guidance.update [ ...] - Refresh existing guidance for one or more repos\n\nBoth commands accept repos space-separated or comma-separated (or mixed).\nBoth commands accept --pr [,...] to analyze specific PRs.\nFull PR URLs apply only to their matching repo; plain numbers apply to all repos.\n\nOUTPUT LOCATIONS (per repo):\n- Raw PR data: artifacts/guidance//raw/\n- Analysis output: artifacts/guidance//analysis/\n- Generated files: artifacts/guidance//output/\n\nCORE PRINCIPLES:\n- Process each repo independently in a loop — never let one repo failure abort others\n- Guidance files target ~80 lines; never drop rules to enforce the limit\n- Use adaptive rule threshold: 3+ PRs (large bucket), 2+ (medium), 1+ with limited-data warning (small). Skip file only if 0 merged PRs.\n- In --pr mode: never drop a user-specified PR even if it does not match bucket patterns\n- Merged PRs = positive examples. Closed PRs = what to avoid.\n- Review REQUEST_CHANGES comments reveal what workflows should do proactively.\n- Never guess patterns — only state what the PR data supports.\n- Sanitize control characters from all PR text fields before JSON construction.", + "startupPrompt": "Welcome! I'm your PR Guidance Generator.\n\nI analyze fix PR history across one or more repositories to generate compact guidance files that teach automated workflows — like CVE Fixer and Bugfix — how to create PRs your team will actually merge. Each repo gets its own guidance files and its own PR.\n\nAVAILABLE COMMANDS:\n/guidance.generate - Analyze PR history and generate guidance files\n/guidance.update - Refresh existing guidance with recent PRs\n\nREPO INPUT — space-separated, comma-separated, or mixed:\n /guidance.generate org/repo1 org/repo2\n /guidance.generate org/repo1,org/repo2,org/repo3\n\nFLAGS (both commands):\n --pr , Analyze specific PRs (full URLs or numbers, comma-separated)\n Full URLs apply to their repo only; plain numbers apply to all\n\nFLAGS (guidance.generate only):\n --cve-only Only generate CVE fix guidance\n --bugfix-only Only generate bugfix guidance\n --limit N Max PRs per bucket per repo (default: 100)\n\nWhat repositories would you like to analyze?", "results": { "Generated Guidance": "artifacts/guidance/**/output/*.md", "PR Analysis": "artifacts/guidance/**/analysis/*.md" diff --git a/workflows/guidance-generator/.claude/commands/guidance.generate.md b/workflows/guidance-generator/.claude/commands/guidance.generate.md index 0f4c4eed..ea8f734a 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.generate.md +++ b/workflows/guidance-generator/.claude/commands/guidance.generate.md @@ -35,65 +35,116 @@ Artifacts: artifacts/guidance/org-repo/ ## Arguments ``` -/guidance.generate [--cve-only] [--bugfix-only] [--limit N] -/guidance.generate --pr [,...] +/guidance.generate [ ...] [--cve-only] [--bugfix-only] [--limit N] +/guidance.generate [,,...] [--cve-only] [--bugfix-only] [--limit N] +/guidance.generate [ ...] --pr [,...] ``` -- `repo-url`: Full GitHub URL (e.g., `https://github.com/org/repo`) or `org/repo` -- `--cve-only`: Skip bugfix analysis -- `--bugfix-only`: Skip CVE analysis -- `--limit N`: Max PRs to fetch per bucket (default: 100, min: 20) -- `--pr `: Comma-separated PR URLs or numbers to analyze instead of fetching all PRs. - Skips bulk fetch entirely. Accepts full URLs (`https://github.com/org/repo/pull/123`) - or plain numbers (`123`). The generated file will include a `manual-selection` note - in its header. +- `repo-url`: One or more repos — space-separated or comma-separated (or both). + Accepts full GitHub URLs (`https://github.com/org/repo`) or `org/repo` slugs. + Each repo is processed independently and gets its own PR. +- `--cve-only`: Skip bugfix analysis for all repos +- `--bugfix-only`: Skip CVE analysis for all repos +- `--limit N`: Max PRs to fetch per bucket per repo (default: 100, min: 20) +- `--pr `: Comma-separated PR URLs or numbers. Full URLs + (`https://github.com/org/repo/pull/123`) are applied only to their matching repo. + Plain numbers (`123`) are applied to all repos. ## Process ### 1. Parse Arguments and Validate -Extract `REPO` in `org/repo` format from the provided URL or slug. -If not provided, ask: "What is the GitHub repository URL?" - -Parse `--pr` into a comma-separated list of PR numbers. Accept both full GitHub -PR URLs and plain numbers: +Parse all repo references (space-separated, comma-separated, or mixed) and +`--pr` into structured data. Validate `gh` auth once before the loop. ```bash -# Validate gh auth +# Validate gh auth once gh auth status || { echo "ERROR: gh not authenticated. Run 'gh auth login'"; exit 1; } -# Validate repo exists and is accessible -gh repo view "$REPO" --json name > /dev/null 2>&1 || { - echo "ERROR: Cannot access $REPO. Check URL and permissions." - exit 1 +# Normalize repo args: replace commas with spaces, strip GitHub URL prefix, +# deduplicate, and collect into REPOS array +normalize_repo() { + local REF="$1" + if [[ "$REF" =~ github\.com/([a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+) ]]; then + echo "${BASH_REMATCH[1]}" + elif [[ "$REF" =~ ^[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+$ ]]; then + echo "$REF" + else + echo "WARNING: Cannot parse repo '$REF' — skipping" >&2 + echo "" + fi } -# Derive a safe slug for directory names (replace / with -) -REPO_SLUG=$(echo "$REPO" | tr '/' '-') +REPOS=() +for RAW in $(echo "$REPO_ARGS" | tr ',' ' '); do + NORMALIZED=$(normalize_repo "$RAW") + [ -n "$NORMALIZED" ] && REPOS+=("$NORMALIZED") +done + +# Deduplicate +REPOS=($(printf '%s\n' "${REPOS[@]}" | awk '!seen[$0]++')) + +if [ ${#REPOS[@]} -eq 0 ]; then + echo "ERROR: No valid repository references provided." + echo "Usage: /guidance.generate org/repo1 org/repo2" + exit 1 +fi + +echo "Repos to process (${#REPOS[@]}):" +for R in "${REPOS[@]}"; do echo " - $R"; done + +# Parse --pr: full URLs map to their repo; plain numbers apply to all repos +declare -A REPO_SPECIFIC_PRS # keyed by "org/repo", value = space-separated PR numbers +GLOBAL_PR_NUMBERS="" # plain numbers — applied to every repo -# Parse --pr flag: extract PR numbers from URLs or plain numbers -SPECIFIC_PR_NUMBERS="" if [ -n "$PR_REFS" ]; then - IFS=',' read -ra PR_LIST <<< "$PR_REFS" + IFS=',' read -ra PR_LIST <<< "$(echo "$PR_REFS" | tr ' ' ',')" for PR_REF in "${PR_LIST[@]}"; do PR_REF=$(echo "$PR_REF" | tr -d ' ') - if [[ "$PR_REF" =~ github\.com/[^/]+/[^/]+/pull/([0-9]+) ]]; then - SPECIFIC_PR_NUMBERS="$SPECIFIC_PR_NUMBERS ${BASH_REMATCH[1]}" + if [[ "$PR_REF" =~ github\.com/([a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+)/pull/([0-9]+) ]]; then + PR_REPO="${BASH_REMATCH[1]}" + PR_NUM="${BASH_REMATCH[2]}" + REPO_SPECIFIC_PRS["$PR_REPO"]="${REPO_SPECIFIC_PRS[$PR_REPO]:-} $PR_NUM" elif [[ "$PR_REF" =~ ^[0-9]+$ ]]; then - SPECIFIC_PR_NUMBERS="$SPECIFIC_PR_NUMBERS $PR_REF" + GLOBAL_PR_NUMBERS="$GLOBAL_PR_NUMBERS $PR_REF" else echo "WARNING: Could not parse PR reference '$PR_REF' — skipping" fi done - SPECIFIC_PR_NUMBERS=$(echo "$SPECIFIC_PR_NUMBERS" | tr -s ' ' | sed 's/^ //') - echo "Manual PR mode: analyzing PR(s) $SPECIFIC_PR_NUMBERS" + GLOBAL_PR_NUMBERS=$(echo "$GLOBAL_PR_NUMBERS" | tr -s ' ' | sed 's/^ //') fi -# Setup directories -mkdir -p "artifacts/guidance/$REPO_SLUG/raw" -mkdir -p "artifacts/guidance/$REPO_SLUG/analysis" -mkdir -p "artifacts/guidance/$REPO_SLUG/output" -mkdir -p "/tmp/guidance-gen/$REPO_SLUG" +# Accumulators for the final summary +PR_RESULTS=() # "org/repo -> " +FAILED_REPOS=() # "org/repo -> " +``` + +--- +> **Steps 2–8 repeat for each repo in `${REPOS[@]}`.** + +```bash +for REPO in "${REPOS[@]}"; do + echo "" + echo "=== $REPO ===" + + # Validate this repo is accessible; skip on failure rather than aborting all + if ! gh repo view "$REPO" --json name > /dev/null 2>&1; then + echo " ERROR: Cannot access $REPO — skipping" + FAILED_REPOS+=("$REPO -> cannot access repository") + continue + fi + + REPO_SLUG=$(echo "$REPO" | tr '/' '-') + + # Combine repo-specific --pr numbers with global plain numbers for this repo + SPECIFIC_PR_NUMBERS="${REPO_SPECIFIC_PRS[$REPO]:-} $GLOBAL_PR_NUMBERS" + SPECIFIC_PR_NUMBERS=$(echo "$SPECIFIC_PR_NUMBERS" | tr -s ' ' | sed 's/^ //') + [ -n "$SPECIFIC_PR_NUMBERS" ] && echo " Manual PR mode: PR(s) $SPECIFIC_PR_NUMBERS" + + mkdir -p "artifacts/guidance/$REPO_SLUG/raw" + mkdir -p "artifacts/guidance/$REPO_SLUG/analysis" + mkdir -p "artifacts/guidance/$REPO_SLUG/output" + mkdir -p "/tmp/guidance-gen/$REPO_SLUG" ``` ### 2. Fetch PR Metadata (Pass 1 — lightweight) @@ -578,31 +629,42 @@ echo "PR created: $PR_URL" report the error clearly. Tell the user to create the PR manually and provide the branch name. -### 8. Cleanup +### 8. Cleanup (per repo) ```bash -cd / -rm -rf "/tmp/guidance-gen/$REPO_SLUG" -echo "Cleaned up /tmp/guidance-gen/$REPO_SLUG" + cd / + rm -rf "/tmp/guidance-gen/$REPO_SLUG" + + # Collect result for final summary + if [ -n "${PR_URL:-}" ]; then + PR_RESULTS+=("$REPO -> $PR_URL") + else + FAILED_REPOS+=("$REPO -> PR creation failed (see output above)") + fi + +done # end of per-repo loop ``` ### 9. Print Summary +Print one entry per repo, then a totals line. + ``` -Done. +Done. Processed repo(s). -Repository: https://github.com/ -Analyzed: CVE PRs ( merged, closed) - Bugfix PRs ( merged, closed) -Rules: CVE rules, bugfix rules (adaptive threshold applied) +org/repo1 + CVE: 12 rules | Bugfix: 9 rules + PR: https://github.com/org/repo1/pull/88 -Files generated: - artifacts/guidance//output/cve-fix-guidance.md - artifacts/guidance//output/bugfix-guidance.md +org/repo2 + CVE: skipped (0 merged CVE PRs) + Bugfix: 7 rules + PR: https://github.com/org/repo2/pull/41 -PR: +org/repo3 — FAILED: cannot access repository -Artifacts: artifacts/guidance// +--- +PRs created: | Failed: ``` ## Output @@ -617,15 +679,16 @@ Artifacts: artifacts/guidance// ## Success Criteria -- [ ] Both buckets filtered from PR metadata -- [ ] Per-PR details fetched (files + review REQUEST_CHANGES) -- [ ] Closed PRs have closing context fetched -- [ ] Patterns synthesized with 3-PR minimum applied -- [ ] Guidance files aim for ~80 lines (all threshold-passing rules included regardless) -- [ ] Files written to artifacts/output/ -- [ ] PR created in target repo with correct files in .cve-fix/ and .bugfix/ -- [ ] /tmp cleaned up -- [ ] PR URL printed to console +- [ ] All repos parsed from input (space and comma separated) +- [ ] gh auth validated once before the loop +- [ ] Each repo processed independently — one failure does not abort others +- [ ] Per-repo: both buckets filtered from PR metadata +- [ ] Per-repo: per-PR details fetched (files + review REQUEST_CHANGES) +- [ ] Per-repo: patterns synthesized with adaptive threshold +- [ ] Per-repo: guidance files written to artifacts/guidance//output/ +- [ ] Per-repo: PR created in target repo +- [ ] Per-repo: /tmp cleaned up after PR creation +- [ ] Final summary lists all repos with PR URLs and any failures ## Notes diff --git a/workflows/guidance-generator/.claude/commands/guidance.update.md b/workflows/guidance-generator/.claude/commands/guidance.update.md index ae90cfcf..e68560db 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.update.md +++ b/workflows/guidance-generator/.claude/commands/guidance.update.md @@ -34,52 +34,103 @@ Updating files and creating PR... https://github.com/org/repo/pull/103 ## Arguments ``` -/guidance.update -/guidance.update --pr [,...] +/guidance.update [ ...] +/guidance.update [,,...] +/guidance.update [ ...] --pr [,...] ``` -- `repo-url`: Full GitHub URL or `org/repo` slug -- `--pr `: Comma-separated PR URLs or numbers to merge into existing guidance - instead of fetching all PRs since the last-analyzed date. Accepts full URLs - (`https://github.com/org/repo/pull/123`) or plain numbers (`123`). The - `last-analyzed` date in the file header is still updated to today. +- `repo-url`: One or more repos — space-separated or comma-separated (or both). + Each repo is updated independently and gets its own PR. +- `--pr `: Comma-separated PR URLs or numbers. Full URLs are applied only to + their matching repo. Plain numbers are applied to all repos. The `last-analyzed` + date is still updated to today in all files. ## Process ### 1. Parse Arguments and Validate ```bash +# Validate gh auth once gh auth status || { echo "ERROR: gh not authenticated. Run 'gh auth login'"; exit 1; } -gh repo view "$REPO" --json name > /dev/null 2>&1 || { - echo "ERROR: Cannot access $REPO" - exit 1 +# Normalize repo args: replace commas with spaces, strip GitHub URL prefix, deduplicate +normalize_repo() { + local REF="$1" + if [[ "$REF" =~ github\.com/([a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+) ]]; then + echo "${BASH_REMATCH[1]}" + elif [[ "$REF" =~ ^[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+$ ]]; then + echo "$REF" + else + echo "WARNING: Cannot parse repo '$REF' — skipping" >&2 + echo "" + fi } -REPO_SLUG=$(echo "$REPO" | tr '/' '-') +REPOS=() +for RAW in $(echo "$REPO_ARGS" | tr ',' ' '); do + NORMALIZED=$(normalize_repo "$RAW") + [ -n "$NORMALIZED" ] && REPOS+=("$NORMALIZED") +done + +REPOS=($(printf '%s\n' "${REPOS[@]}" | awk '!seen[$0]++')) + +if [ ${#REPOS[@]} -eq 0 ]; then + echo "ERROR: No valid repository references provided." + exit 1 +fi + +echo "Repos to process (${#REPOS[@]}):" +for R in "${REPOS[@]}"; do echo " - $R"; done + +# Parse --pr: full URLs map to their repo; plain numbers apply to all repos +declare -A REPO_SPECIFIC_PRS +GLOBAL_PR_NUMBERS="" -# Parse --pr flag: extract PR numbers from URLs or plain numbers -SPECIFIC_PR_NUMBERS="" if [ -n "$PR_REFS" ]; then - IFS=',' read -ra PR_LIST <<< "$PR_REFS" + IFS=',' read -ra PR_LIST <<< "$(echo "$PR_REFS" | tr ' ' ',')" for PR_REF in "${PR_LIST[@]}"; do PR_REF=$(echo "$PR_REF" | tr -d ' ') - if [[ "$PR_REF" =~ github\.com/[^/]+/[^/]+/pull/([0-9]+) ]]; then - SPECIFIC_PR_NUMBERS="$SPECIFIC_PR_NUMBERS ${BASH_REMATCH[1]}" + if [[ "$PR_REF" =~ github\.com/([a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+)/pull/([0-9]+) ]]; then + PR_REPO="${BASH_REMATCH[1]}" + PR_NUM="${BASH_REMATCH[2]}" + REPO_SPECIFIC_PRS["$PR_REPO"]="${REPO_SPECIFIC_PRS[$PR_REPO]:-} $PR_NUM" elif [[ "$PR_REF" =~ ^[0-9]+$ ]]; then - SPECIFIC_PR_NUMBERS="$SPECIFIC_PR_NUMBERS $PR_REF" + GLOBAL_PR_NUMBERS="$GLOBAL_PR_NUMBERS $PR_REF" else echo "WARNING: Could not parse PR reference '$PR_REF' — skipping" fi done - SPECIFIC_PR_NUMBERS=$(echo "$SPECIFIC_PR_NUMBERS" | tr -s ' ' | sed 's/^ //') - echo "Manual PR mode: merging PR(s) $SPECIFIC_PR_NUMBERS into existing guidance" + GLOBAL_PR_NUMBERS=$(echo "$GLOBAL_PR_NUMBERS" | tr -s ' ' | sed 's/^ //') fi -mkdir -p "artifacts/guidance/$REPO_SLUG/raw" -mkdir -p "artifacts/guidance/$REPO_SLUG/analysis" -mkdir -p "artifacts/guidance/$REPO_SLUG/output" -mkdir -p "/tmp/guidance-gen/$REPO_SLUG" +PR_RESULTS=() +FAILED_REPOS=() +``` + +--- +> **Steps 2–9 repeat for each repo in `${REPOS[@]}`.** + +```bash +for REPO in "${REPOS[@]}"; do + echo "" + echo "=== $REPO ===" + + if ! gh repo view "$REPO" --json name > /dev/null 2>&1; then + echo " ERROR: Cannot access $REPO — skipping" + FAILED_REPOS+=("$REPO -> cannot access repository") + continue + fi + + REPO_SLUG=$(echo "$REPO" | tr '/' '-') + + SPECIFIC_PR_NUMBERS="${REPO_SPECIFIC_PRS[$REPO]:-} $GLOBAL_PR_NUMBERS" + SPECIFIC_PR_NUMBERS=$(echo "$SPECIFIC_PR_NUMBERS" | tr -s ' ' | sed 's/^ //') + [ -n "$SPECIFIC_PR_NUMBERS" ] && echo " Manual PR mode: PR(s) $SPECIFIC_PR_NUMBERS" + + mkdir -p "artifacts/guidance/$REPO_SLUG/raw" + mkdir -p "artifacts/guidance/$REPO_SLUG/analysis" + mkdir -p "artifacts/guidance/$REPO_SLUG/output" + mkdir -p "/tmp/guidance-gen/$REPO_SLUG" ``` ### 2. Read Existing Guidance Files from Repository @@ -480,25 +531,39 @@ PR_URL=$(gh pr create \ echo "PR created: $PR_URL" ``` -### 9. Cleanup +### 9. Cleanup (per repo) ```bash -cd / -rm -rf "/tmp/guidance-gen/$REPO_SLUG" + cd / + rm -rf "/tmp/guidance-gen/$REPO_SLUG" + + if [ -n "${PR_URL:-}" ]; then + PR_RESULTS+=("$REPO -> $PR_URL") + else + FAILED_REPOS+=("$REPO -> PR creation failed (see output above)") + fi + +done # end of per-repo loop ``` ### 10. Print Summary ``` -Done. +Done. Processed repo(s). + +org/repo1 + New PRs analyzed: 8 CVE, 12 bugfix (since 2026-01-15) + Changes: 2 new rules, 3 counts updated, 1 contradiction flagged + PR: https://github.com/org/repo1/pull/103 -Repository: https://github.com/ -New PRs: CVE PRs, bugfix PRs (since ) -Changes: new rules, counts updated, contradictions flagged +org/repo2 + No guidance files found — run /guidance.generate first + SKIPPED -PR: +org/repo3 — FAILED: cannot access repository -Artifacts: artifacts/guidance// +--- +PRs created: | Skipped: | Failed: ``` ## Output @@ -513,17 +578,18 @@ Artifacts: artifacts/guidance// ## Success Criteria -- [ ] Existing guidance files found and last-analyzed date extracted -- [ ] New PRs fetched since last-analyzed date -- [ ] Per-PR details fetched for new PRs -- [ ] New patterns synthesized (new rules, updated counts, contradictions flagged) -- [ ] Existing files updated in-place (no rewrites, structure preserved) -- [ ] No existing rules were dropped — only additions and count updates -- [ ] Files exceeding 80 lines are flagged in PR description (not silently truncated) -- [ ] Header timestamps updated -- [ ] PR created in target repo -- [ ] /tmp cleaned up -- [ ] PR URL printed to console +- [ ] All repos parsed from input (space and comma separated) +- [ ] gh auth validated once before the loop +- [ ] Each repo processed independently — one failure does not abort others +- [ ] Per-repo: existing guidance files found and last-analyzed date extracted +- [ ] Per-repo: new PRs fetched (date-based or --pr specific) +- [ ] Per-repo: new patterns synthesized (new rules, updated counts, contradictions flagged) +- [ ] Per-repo: files updated in-place, no existing rules dropped +- [ ] Per-repo: files exceeding 80 lines flagged in PR description +- [ ] Per-repo: header timestamps updated +- [ ] Per-repo: PR created in target repo +- [ ] Per-repo: /tmp cleaned up +- [ ] Final summary lists all repos with PR URLs, skips, and failures ## Notes diff --git a/workflows/guidance-generator/README.md b/workflows/guidance-generator/README.md index ec20b6ae..fcadfce3 100644 --- a/workflows/guidance-generator/README.md +++ b/workflows/guidance-generator/README.md @@ -32,20 +32,20 @@ Full pipeline for a fresh repo. Analyzes all recent fix PRs automatically, or analyze specific PRs of your choice with `--pr`. ``` -/guidance.generate https://github.com/org/repo -/guidance.generate org/repo --cve-only -/guidance.generate org/repo --bugfix-only -/guidance.generate org/repo --limit 50 -/guidance.generate org/repo --pr https://github.com/org/repo/pull/42,https://github.com/org/repo/pull/87 -/guidance.generate org/repo --pr 42,87 +/guidance.generate org/repo1 org/repo2 org/repo3 +/guidance.generate org/repo1,org/repo2,org/repo3 +/guidance.generate org/repo1 org/repo2 --cve-only +/guidance.generate org/repo1,org/repo2 --pr 42,https://github.com/org/repo2/pull/87 ``` +Each repo is processed independently and gets its own PR. One repo failing does +not stop the others. A summary of all PR URLs is printed at the end. + Flags: -- `--cve-only` / `--bugfix-only`: generate only one of the two guidance files -- `--limit N`: cap the number of PRs fetched per bucket (default: 100) -- `--pr `: comma-separated PR URLs or numbers — skips bulk fetch and - analyzes only these PRs. Useful for seeding guidance from a curated set of - representative PRs. The generated file header records which PRs were used. +- `--cve-only` / `--bugfix-only`: generate only one of the two guidance files (all repos) +- `--limit N`: cap PRs fetched per bucket per repo (default: 100) +- `--pr `: comma-separated PR URLs or numbers — skips bulk fetch. Full URLs + are applied only to their matching repo; plain numbers apply to all repos. Generates: - `.cve-fix/examples.md` — read by the CVE Fixer workflow (step 4.5) @@ -58,15 +58,17 @@ Reads the `last-analyzed` date from existing files, fetches only newer PRs, merges new patterns, and opens a PR with the updates. ``` -/guidance.update https://github.com/org/repo -/guidance.update org/repo --pr https://github.com/org/repo/pull/103 -/guidance.update org/repo --pr 103,104 +/guidance.update org/repo1 org/repo2 +/guidance.update org/repo1,org/repo2 +/guidance.update org/repo1 org/repo2 --pr 103,https://github.com/org/repo2/pull/104 ``` +Each repo is updated independently and gets its own PR. + Flags: -- `--pr `: instead of fetching all PRs since the last-analyzed date, - merge only the specified PRs into existing guidance. The `last-analyzed` - date is still updated to today. +- `--pr `: merge only the specified PRs instead of fetching all PRs since + the last-analyzed date. Full URLs apply to their matching repo; plain numbers + apply to all repos. The `last-analyzed` date is still updated to today. ## Generated File Format From c0fb140f41222f5d5e1aab75ba13e07db17f57a9 Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Mon, 30 Mar 2026 09:51:55 -0400 Subject: [PATCH 09/23] feat: --pr space/comma, --cve-only/--bugfix-only on update, agent startupPrompt 1. --pr accepts space-separated, comma-separated, or mixed refs on both commands (bash already handled this via tr; docs now reflect it) 2. guidance.update gains --cve-only and --bugfix-only flags: - Parsed in Step 1 alongside other flags - Skips reading the unwanted guidance file in Step 2 - Zeroes out the skipped bucket's meta file after filtering so all subsequent steps (fetch details, synthesize, merge) naturally skip it 3. startupPrompt rewritten as an agent directive: instructs the agent to ask the user which repos and which command, and to help them choose between generate vs update based on whether guidance files already exist. Replaces the previous canned greeting. Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../guidance-generator/.ambient/ambient.json | 4 +-- .../.claude/commands/guidance.generate.md | 6 ++-- .../.claude/commands/guidance.update.md | 35 ++++++++++++++----- workflows/guidance-generator/README.md | 17 +++++---- 4 files changed, 43 insertions(+), 19 deletions(-) diff --git a/workflows/guidance-generator/.ambient/ambient.json b/workflows/guidance-generator/.ambient/ambient.json index 7efa1786..6f25a442 100644 --- a/workflows/guidance-generator/.ambient/ambient.json +++ b/workflows/guidance-generator/.ambient/ambient.json @@ -1,8 +1,8 @@ { "name": "PR Guidance Generator", "description": "Analyze merged and closed fix PRs across one or more repositories to generate compact guidance files that teach automated workflows (CVE Fixer, Bugfix) how to create PRs matching each repo's conventions. Each repo is processed independently.", - "systemPrompt": "You are a PR pattern analyst for the Ambient Code Platform. Your role is to help teams generate and maintain guidance files that teach automated fix workflows how to create pull requests matching their repository's conventions.\n\nKEY RESPONSIBILITIES:\n- Fetch and analyze historical fix PRs from one or more GitHub repositories\n- Process each repository independently — one failure must not abort others\n- Extract patterns from merged PRs (what works) and closed PRs (what to avoid)\n- Generate compact, high-signal guidance files — no fluff, no verbose examples\n- Create one pull request per repository with the generated guidance files\n- Update existing guidance files with patterns from new PRs\n- Print a final summary listing all PR URLs and any failures\n\nWORKFLOW METHODOLOGY:\n1. GENERATE - Parse multiple repos, loop over each: analyze PR history (or specific PRs), create guidance files, open a PR per repo\n2. UPDATE - Parse multiple repos, loop over each: fetch new PRs (or specific PRs), merge patterns, open an update PR per repo\n\nAVAILABLE COMMANDS:\n/guidance.generate [ ...] - Generate guidance files for one or more repos\n/guidance.update [ ...] - Refresh existing guidance for one or more repos\n\nBoth commands accept repos space-separated or comma-separated (or mixed).\nBoth commands accept --pr [,...] to analyze specific PRs.\nFull PR URLs apply only to their matching repo; plain numbers apply to all repos.\n\nOUTPUT LOCATIONS (per repo):\n- Raw PR data: artifacts/guidance//raw/\n- Analysis output: artifacts/guidance//analysis/\n- Generated files: artifacts/guidance//output/\n\nCORE PRINCIPLES:\n- Process each repo independently in a loop — never let one repo failure abort others\n- Guidance files target ~80 lines; never drop rules to enforce the limit\n- Use adaptive rule threshold: 3+ PRs (large bucket), 2+ (medium), 1+ with limited-data warning (small). Skip file only if 0 merged PRs.\n- In --pr mode: never drop a user-specified PR even if it does not match bucket patterns\n- Merged PRs = positive examples. Closed PRs = what to avoid.\n- Review REQUEST_CHANGES comments reveal what workflows should do proactively.\n- Never guess patterns — only state what the PR data supports.\n- Sanitize control characters from all PR text fields before JSON construction.", - "startupPrompt": "Welcome! I'm your PR Guidance Generator.\n\nI analyze fix PR history across one or more repositories to generate compact guidance files that teach automated workflows — like CVE Fixer and Bugfix — how to create PRs your team will actually merge. Each repo gets its own guidance files and its own PR.\n\nAVAILABLE COMMANDS:\n/guidance.generate - Analyze PR history and generate guidance files\n/guidance.update - Refresh existing guidance with recent PRs\n\nREPO INPUT — space-separated, comma-separated, or mixed:\n /guidance.generate org/repo1 org/repo2\n /guidance.generate org/repo1,org/repo2,org/repo3\n\nFLAGS (both commands):\n --pr , Analyze specific PRs (full URLs or numbers, comma-separated)\n Full URLs apply to their repo only; plain numbers apply to all\n\nFLAGS (guidance.generate only):\n --cve-only Only generate CVE fix guidance\n --bugfix-only Only generate bugfix guidance\n --limit N Max PRs per bucket per repo (default: 100)\n\nWhat repositories would you like to analyze?", + "systemPrompt": "You are a PR pattern analyst for the Ambient Code Platform. Your role is to help teams generate and maintain guidance files that teach automated fix workflows how to create pull requests matching their repository's conventions.\n\nKEY RESPONSIBILITIES:\n- Fetch and analyze historical fix PRs from one or more GitHub repositories\n- Process each repository independently — one failure must not abort others\n- Extract patterns from merged PRs (what works) and closed PRs (what to avoid)\n- Generate compact, high-signal guidance files — no fluff, no verbose examples\n- Create one pull request per repository with the generated guidance files\n- Update existing guidance files with patterns from new PRs\n- Print a final summary listing all PR URLs and any failures\n\nWORKFLOW METHODOLOGY:\n1. GENERATE - Parse multiple repos, loop over each: analyze PR history (or specific PRs), create guidance files, open a PR per repo\n2. UPDATE - Parse multiple repos, loop over each: fetch new PRs (or specific PRs), merge patterns, open an update PR per repo\n\nAVAILABLE COMMANDS:\n/guidance.generate [ ...] [--cve-only] [--bugfix-only] [--limit N] [--pr ]\n/guidance.update [ ...] [--cve-only] [--bugfix-only] [--pr ]\n\nBoth commands accept repos and --pr refs space-separated, comma-separated, or mixed.\nFull PR URLs in --pr apply only to their matching repo; plain numbers apply to all repos.\n\nOUTPUT LOCATIONS (per repo):\n- Raw PR data: artifacts/guidance//raw/\n- Analysis output: artifacts/guidance//analysis/\n- Generated files: artifacts/guidance//output/\n\nCORE PRINCIPLES:\n- Process each repo independently in a loop — never let one repo failure abort others\n- Guidance files target ~80 lines; never drop rules to enforce the limit\n- Use adaptive rule threshold: 3+ PRs (large bucket), 2+ (medium), 1+ with limited-data warning (small). Skip file only if 0 merged PRs.\n- In --pr mode: never drop a user-specified PR even if it does not match bucket patterns\n- Merged PRs = positive examples. Closed PRs = what to avoid.\n- Review REQUEST_CHANGES comments reveal what workflows should do proactively.\n- Never guess patterns — only state what the PR data supports.\n- Sanitize control characters from all PR text fields before JSON construction.", + "startupPrompt": "Ask the user which repository or repositories they want to analyze, and whether they want to generate new guidance files or update existing ones. If they are unsure, ask whether their repos already have .cve-fix/examples.md or .bugfix/guidance.md — if yes, suggest /guidance.update; if no, suggest /guidance.generate. Keep the introduction short: one sentence describing what the workflow does, then a concise list of the two commands and their key flags. Do not use marketing language or a canned greeting.", "results": { "Generated Guidance": "artifacts/guidance/**/output/*.md", "PR Analysis": "artifacts/guidance/**/analysis/*.md" diff --git a/workflows/guidance-generator/.claude/commands/guidance.generate.md b/workflows/guidance-generator/.claude/commands/guidance.generate.md index ea8f734a..9c42a26e 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.generate.md +++ b/workflows/guidance-generator/.claude/commands/guidance.generate.md @@ -46,9 +46,9 @@ Artifacts: artifacts/guidance/org-repo/ - `--cve-only`: Skip bugfix analysis for all repos - `--bugfix-only`: Skip CVE analysis for all repos - `--limit N`: Max PRs to fetch per bucket per repo (default: 100, min: 20) -- `--pr `: Comma-separated PR URLs or numbers. Full URLs - (`https://github.com/org/repo/pull/123`) are applied only to their matching repo. - Plain numbers (`123`) are applied to all repos. +- `--pr `: PR URLs or numbers — space-separated, comma-separated, or mixed. + Full URLs (`https://github.com/org/repo/pull/123`) are applied only to their + matching repo. Plain numbers (`123`) are applied to all repos. ## Process diff --git a/workflows/guidance-generator/.claude/commands/guidance.update.md b/workflows/guidance-generator/.claude/commands/guidance.update.md index e68560db..ebeedb01 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.update.md +++ b/workflows/guidance-generator/.claude/commands/guidance.update.md @@ -34,16 +34,18 @@ Updating files and creating PR... https://github.com/org/repo/pull/103 ## Arguments ``` -/guidance.update [ ...] -/guidance.update [,,...] -/guidance.update [ ...] --pr [,...] +/guidance.update [ ...] [--cve-only] [--bugfix-only] +/guidance.update [,,...] [--cve-only] [--bugfix-only] +/guidance.update [ ...] --pr [ ...] ``` - `repo-url`: One or more repos — space-separated or comma-separated (or both). Each repo is updated independently and gets its own PR. -- `--pr `: Comma-separated PR URLs or numbers. Full URLs are applied only to - their matching repo. Plain numbers are applied to all repos. The `last-analyzed` - date is still updated to today in all files. +- `--cve-only`: Only update `.cve-fix/examples.md` — skip bugfix guidance. +- `--bugfix-only`: Only update `.bugfix/guidance.md` — skip CVE guidance. +- `--pr `: PR URLs or numbers — space-separated, comma-separated, or mixed. + Full URLs are applied only to their matching repo. Plain numbers are applied to + all repos. The `last-analyzed` date is still updated to today in all files. ## Process @@ -103,6 +105,16 @@ if [ -n "$PR_REFS" ]; then GLOBAL_PR_NUMBERS=$(echo "$GLOBAL_PR_NUMBERS" | tr -s ' ' | sed 's/^ //') fi +# Parse scope flags (apply to all repos) +CVE_ONLY=false +BUGFIX_ONLY=false +[ "${CVE_ONLY_FLAG:-}" = "true" ] && CVE_ONLY=true +[ "${BUGFIX_ONLY_FLAG:-}" = "true" ] && BUGFIX_ONLY=true +if $CVE_ONLY && $BUGFIX_ONLY; then + echo "ERROR: --cve-only and --bugfix-only are mutually exclusive." + exit 1 +fi + PR_RESULTS=() FAILED_REPOS=() ``` @@ -151,7 +163,10 @@ FOUND_CVE=false FOUND_BUGFIX=false LAST_DATE="" -if [ -f "$CVE_FILE" ]; then +$BUGFIX_ONLY && echo " --bugfix-only: skipping CVE guidance" +$CVE_ONLY && echo " --cve-only: skipping bugfix guidance" + +if [ -f "$CVE_FILE" ] && ! $BUGFIX_ONLY; then FOUND_CVE=true # Extract date from: CVE_DATE=$(grep -m1 'last-analyzed:' "$CVE_FILE" | \ @@ -160,7 +175,7 @@ if [ -f "$CVE_FILE" ]; then LAST_DATE="$CVE_DATE" fi -if [ -f "$BUGFIX_FILE" ]; then +if [ -f "$BUGFIX_FILE" ] && ! $CVE_ONLY; then FOUND_BUGFIX=true BUGFIX_DATE=$(grep -m1 'last-analyzed:' "$BUGFIX_FILE" | \ grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2}' | head -1) @@ -282,6 +297,10 @@ if [ -n "$SPECIFIC_PR_NUMBERS" ]; then fi fi +# Zero out skipped buckets so subsequent steps treat them as empty +$BUGFIX_ONLY && echo "[]" > "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" +$CVE_ONLY && echo "[]" > "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" + NEW_CVE=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json") NEW_BUGFIX=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json") echo " CVE bucket: $NEW_CVE new PRs" diff --git a/workflows/guidance-generator/README.md b/workflows/guidance-generator/README.md index fcadfce3..82ce7209 100644 --- a/workflows/guidance-generator/README.md +++ b/workflows/guidance-generator/README.md @@ -44,8 +44,9 @@ not stop the others. A summary of all PR URLs is printed at the end. Flags: - `--cve-only` / `--bugfix-only`: generate only one of the two guidance files (all repos) - `--limit N`: cap PRs fetched per bucket per repo (default: 100) -- `--pr `: comma-separated PR URLs or numbers — skips bulk fetch. Full URLs - are applied only to their matching repo; plain numbers apply to all repos. +- `--pr `: space-separated, comma-separated, or mixed PR URLs or numbers — + skips bulk fetch. Full URLs apply only to their matching repo; plain numbers + apply to all repos. Generates: - `.cve-fix/examples.md` — read by the CVE Fixer workflow (step 4.5) @@ -60,15 +61,19 @@ merges new patterns, and opens a PR with the updates. ``` /guidance.update org/repo1 org/repo2 /guidance.update org/repo1,org/repo2 -/guidance.update org/repo1 org/repo2 --pr 103,https://github.com/org/repo2/pull/104 +/guidance.update org/repo1 org/repo2 --cve-only +/guidance.update org/repo1 org/repo2 --pr 103 https://github.com/org/repo2/pull/104 ``` Each repo is updated independently and gets its own PR. Flags: -- `--pr `: merge only the specified PRs instead of fetching all PRs since - the last-analyzed date. Full URLs apply to their matching repo; plain numbers - apply to all repos. The `last-analyzed` date is still updated to today. +- `--cve-only`: only update `.cve-fix/examples.md`, skip bugfix guidance. +- `--bugfix-only`: only update `.bugfix/guidance.md`, skip CVE guidance. +- `--pr `: space-separated, comma-separated, or mixed PR URLs or numbers. + Merges only the specified PRs instead of fetching all PRs since the last-analyzed + date. Full URLs apply to their matching repo; plain numbers apply to all repos. + The `last-analyzed` date is still updated to today. ## Generated File Format From 954610545792d63f20859bfe7553c042ec7ecbf7 Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Tue, 31 Mar 2026 09:16:04 -0400 Subject: [PATCH 10/23] feat: add commit fallback analysis for repos with few or no fix PRs For any bucket with fewer than 3 merged PRs, the workflow now scans recent commits as a supplementary signal source. This makes the workflow useful for brand new repos or repos that land fixes directly on branches. Implementation: - New step 3.5 (generate) / 4.5 (update): fetch_commit_fallback() - Triggered when merged PR count < 3 in a bucket - Skipped entirely when --pr is specified (user chose the data) - Fetches up to 100 commit messages (lightweight, no file data) - Filters by CVE/bugfix message patterns - Fetches file list for up to 50 matched commits (one API call each) - Builds compact records with source: "commit" field - Saves to artifacts/guidance//raw/-commits.json - Commit records merged into PR detail files via jq -s before synthesis - Synthesis step updated with source-aware thresholds: - PRs: 3/2/1 depending on bucket size (existing adaptive threshold) - Commits only: 5 minimum per rule - Mixed: 3 total with at least 1 PR - Evidence notation: (N commits), (N/M merged PRs + K commits) - Commits cannot populate Don'ts section (no rejection signal) - Header gains cve-commits/bugfix-commits field when fallback was used Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../.claude/commands/guidance.generate.md | 169 ++++++++++++++++-- .../.claude/commands/guidance.update.md | 145 ++++++++++++++- 2 files changed, 288 insertions(+), 26 deletions(-) diff --git a/workflows/guidance-generator/.claude/commands/guidance.generate.md b/workflows/guidance-generator/.claude/commands/guidance.generate.md index 9c42a26e..d8ceb555 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.generate.md +++ b/workflows/guidance-generator/.claude/commands/guidance.generate.md @@ -276,6 +276,101 @@ echo " Bugfix bucket: $BUGFIX_TOTAL PRs ($BUGFIX_MERGED merged, $BUGFIX_CLOSED If both buckets are empty, report this clearly and exit — the repo may not have recognizable fix PR naming conventions. Suggest the user check PR title patterns. +### 3.5. Fetch Commit Fallback + +For any bucket with fewer than 3 merged PRs, scan recent commits as a supplementary +signal source. Skip this step entirely if `--pr` was specified (user chose the data). + +```bash +# Fetch commit fallback for a bucket if merged PR count < 3 +# Args: BUCKET_LABEL META_FILE OUT_FILE MSG_PATTERN +fetch_commit_fallback() { + local LABEL="$1" + local META_FILE="$2" + local OUT_FILE="$3" + local MSG_PATTERN="$4" + + echo "[]" > "$OUT_FILE" + + # Skip if manual PR mode — user chose the data explicitly + [ -n "$SPECIFIC_PR_NUMBERS" ] && return + + local MERGED_COUNT + MERGED_COUNT=$(jq '[.[] | select(.state == "MERGED")] | length' "$META_FILE") + + if [ "$MERGED_COUNT" -ge 3 ]; then + return # Enough PR data — no fallback needed + fi + + echo " $LABEL bucket: $MERGED_COUNT merged PRs — scanning commits as fallback..." + + # Fetch up to 100 recent commit messages (lightweight — no file data yet) + gh api "repos/$REPO/commits?per_page=100" \ + --jq '.[] | {sha: .sha, message: .commit.message}' \ + > "/tmp/guidance-gen/$REPO_SLUG/${LABEL}-commits-raw.jsonl" 2>/dev/null + + local SAMPLED=0 + local MAX_COMMITS=50 + + while IFS= read -r LINE && [ "$SAMPLED" -lt "$MAX_COMMITS" ]; do + local SHA MSG_RAW TITLE + + SHA=$(echo "$LINE" | jq -r '.sha') + MSG_RAW=$(echo "$LINE" | jq -r '.message' | sanitize_str) + TITLE=$(echo "$MSG_RAW" | head -1) + + # Filter by message pattern for this bucket + echo "$TITLE" | grep -qiE "$MSG_PATTERN" || continue + + # Fetch file list for this commit (targeted — only for matched commits) + local FILES + FILES=$(gh api "repos/$REPO/commits/$SHA" \ + --jq '[.files[].filename]' 2>/dev/null || echo "[]") + + local BODY + BODY=$(echo "$MSG_RAW" | tail -n +2 | tr '\n' ' ' | cut -c1-300) + + local RECORD + RECORD=$(jq -n \ + --arg sha "$SHA" \ + --arg title "$TITLE" \ + --arg body "$BODY" \ + --argjson files "$FILES" \ + '{source: "commit", sha: $sha, state: "MERGED", + title: $title, branch: "", labels: [], + files: $files, changes_requested: [], close_reason: null, + commit_body: $body}' 2>/tmp/guidance-jq-err.txt) + + if [ $? -ne 0 ]; then + echo " WARNING: commit $SHA skipped — $(cat /tmp/guidance-jq-err.txt)" + continue + fi + + jq --argjson rec "$RECORD" '. + [$rec]' "$OUT_FILE" > "${OUT_FILE}.tmp" \ + && mv "${OUT_FILE}.tmp" "$OUT_FILE" + SAMPLED=$((SAMPLED + 1)) + + done < "/tmp/guidance-gen/$REPO_SLUG/${LABEL}-commits-raw.jsonl" + + local COMMIT_COUNT + COMMIT_COUNT=$(jq 'length' "$OUT_FILE") + echo " Found $COMMIT_COUNT matching $LABEL commits" + + # Save to artifacts for transparency + cp "$OUT_FILE" "artifacts/guidance/$REPO_SLUG/raw/${LABEL}-commits.json" +} + +fetch_commit_fallback "cve" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-commits.json" \ + "CVE-[0-9]{4}-[0-9]+|^security:|^fix\(cve\):|^Fix CVE" + +fetch_commit_fallback "bugfix" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-commits.json" \ + "^fix[:(]|^bugfix|^bug fix|fixes[[:space:]]#[0-9]+|closes[[:space:]]#[0-9]+" +``` + ### 4. Fetch Per-PR Details (Pass 2 — targeted) For each PR in both buckets, fetch only: file paths changed and review data. @@ -373,7 +468,22 @@ fetch_pr_details \ "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" \ "/tmp/guidance-gen/$REPO_SLUG/bugfix-details.json" -# Save raw data to artifacts for reference +# Merge commit fallback records into the detail files +jq -s '.[0] + .[1]' \ + "/tmp/guidance-gen/$REPO_SLUG/cve-details.json" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-commits.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/cve-details-merged.json" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/cve-details-merged.json" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-details.json" + +jq -s '.[0] + .[1]' \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-details.json" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-commits.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/bugfix-details-merged.json" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/bugfix-details-merged.json" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-details.json" + +# Save to artifacts for reference cp "/tmp/guidance-gen/$REPO_SLUG/cve-details.json" \ "artifacts/guidance/$REPO_SLUG/raw/cve-prs.json" cp "/tmp/guidance-gen/$REPO_SLUG/bugfix-details.json" \ @@ -385,25 +495,43 @@ cp "/tmp/guidance-gen/$REPO_SLUG/bugfix-details.json" \ Read `cve-details.json` and `bugfix-details.json` from the artifacts. Analyze them as the agent — do NOT write a script for this step. -**For each bucket, identify patterns across the PR records. Apply these rules:** +**Records have two sources — treat them differently:** -**Inclusion threshold**: Only include a rule if it appears in 3 or more PRs. -State the evidence count inline: `(8/9 merged PRs)`. +Records with no `source` field (or `source != "commit"`) are PR records. +Records with `source: "commit"` came from the commit fallback and have no +`changes_requested` or `close_reason` data. -**What to extract:** +**Inclusion thresholds by source:** -From merged PRs: -- **Title format**: What template do titles follow? Extract the pattern. - Example: `Security: Fix CVE-YYYY-XXXXX ()` or `fix(): ` +| Source | Min occurrences per rule | +|--------|--------------------------| +| Merged PRs (10+ in bucket) | 3 | +| Merged PRs (3–9 in bucket) | 2 | +| Merged PRs (1–2 in bucket) | 1 | +| Commits only | 5 | +| Mixed (PRs + commits) | 3 total, at least 1 PR | + +**What to extract from PR records:** +- **Title format**: What template do titles follow? - **Branch format**: What naming pattern do branches use? -- **Files changed**: Which files appear together most often? Are there always-together groups? +- **Files changed**: Which files appear together most often? - **Labels**: What labels are consistently applied? - **Co-changes**: When package A changes, does package B always change too? -- **From changes_requested**: What did reviewers ask for that wasn't there? These are proactive rules. +- **From changes_requested**: What reviewers asked for — these become proactive rules. +- **From close_reason + changes_requested**: Why PRs were rejected — these become "don'ts". + +**What to extract from commit records (no reviewer signal available):** +- **Message format**: Title line pattern, body structure, trailers (`Co-authored-by:`, `Fixes #`) +- **Files changed**: Which files appear together in fix commits +- **Co-changes**: Package co-upgrade patterns visible in file sets + +**Commit-only rules cannot populate the "Don'ts" section** — there is no rejection +signal from commits. If a bucket is commit-only, omit the Don'ts section entirely. -From closed PRs: -- **close_reason + changes_requested**: Why was the PR closed/rejected? Each reason becomes a "don't". -- Look for patterns across multiple closed PRs — single-occurrence rejections are excluded. +**Evidence notation:** +- PR-sourced: `(8/9 merged PRs)` +- Commit-sourced: `(7 commits)` +- Mixed: `(3/4 merged PRs + 5 commits)` **Output of synthesis step:** Write an intermediate analysis file per bucket: @@ -415,9 +543,9 @@ artifacts/guidance//analysis/bugfix-patterns.md Each analysis file is a structured list: ``` -TITLE_FORMAT: "Security: Fix CVE-YYYY-XXXXX ()" (N/N merged) -BRANCH_FORMAT: "fix/cve-YYYY-XXXXX--attempt-N" (N/N merged) -FILES_GO_STDLIB: go.mod + Dockerfile + Dockerfile.konflux (N/N Go CVE PRs) +TITLE_FORMAT: "Security: Fix CVE-YYYY-XXXXX ()" (3/4 merged PRs + 6 commits) +BRANCH_FORMAT: "fix/cve-YYYY-XXXXX--attempt-N" (3/4 merged PRs) +FILES_GO_STDLIB: go.mod + Dockerfile + Dockerfile.konflux (8 commits) PROACTIVE_go_sum: Include go.sum — flagged missing in N closed PRs DONT_multiple_cves: One CVE per PR — N closed PRs rejected for combining ... @@ -444,7 +572,14 @@ When in manual PR mode, the header must note which PRs were used: ``` -In auto mode, omit the `manual-selection` field: +When commit fallback was used, add a `commit-fallback` count to the header: + +```markdown +# CVE Fix Guidance — + +``` + +In auto mode with no fallback needed, omit the `cve-commits` field: ```markdown # CVE Fix Guidance — diff --git a/workflows/guidance-generator/.claude/commands/guidance.update.md b/workflows/guidance-generator/.claude/commands/guidance.update.md index ebeedb01..36ebba64 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.update.md +++ b/workflows/guidance-generator/.claude/commands/guidance.update.md @@ -307,6 +307,93 @@ echo " CVE bucket: $NEW_CVE new PRs" echo " Bugfix bucket: $NEW_BUGFIX new PRs" ``` +### 4.5. Fetch Commit Fallback + +For any bucket with fewer than 3 new merged PRs since the last-analyzed date, +scan recent commits as supplementary signal. Skip if `--pr` was specified. + +```bash +fetch_commit_fallback() { + local LABEL="$1" + local META_FILE="$2" + local OUT_FILE="$3" + local MSG_PATTERN="$4" + + echo "[]" > "$OUT_FILE" + + [ -n "$SPECIFIC_PR_NUMBERS" ] && return + + local MERGED_COUNT + MERGED_COUNT=$(jq '[.[] | select(.state == "MERGED")] | length' "$META_FILE") + + if [ "$MERGED_COUNT" -ge 3 ]; then + return + fi + + echo " $LABEL bucket: $MERGED_COUNT new merged PRs — scanning commits as fallback..." + + gh api "repos/$REPO/commits?per_page=100" \ + --jq '.[] | {sha: .sha, message: .commit.message}' \ + > "/tmp/guidance-gen/$REPO_SLUG/${LABEL}-commits-raw.jsonl" 2>/dev/null + + local SAMPLED=0 + local MAX_COMMITS=50 + + while IFS= read -r LINE && [ "$SAMPLED" -lt "$MAX_COMMITS" ]; do + local SHA MSG_RAW TITLE + + SHA=$(echo "$LINE" | jq -r '.sha') + MSG_RAW=$(echo "$LINE" | jq -r '.message' | sanitize_str) + TITLE=$(echo "$MSG_RAW" | head -1) + + echo "$TITLE" | grep -qiE "$MSG_PATTERN" || continue + + local FILES + FILES=$(gh api "repos/$REPO/commits/$SHA" \ + --jq '[.files[].filename]' 2>/dev/null || echo "[]") + + local BODY + BODY=$(echo "$MSG_RAW" | tail -n +2 | tr '\n' ' ' | cut -c1-300) + + local RECORD + RECORD=$(jq -n \ + --arg sha "$SHA" \ + --arg title "$TITLE" \ + --arg body "$BODY" \ + --argjson files "$FILES" \ + '{source: "commit", sha: $sha, state: "MERGED", + title: $title, branch: "", labels: [], + files: $files, changes_requested: [], close_reason: null, + commit_body: $body}' 2>/tmp/guidance-jq-err.txt) + + if [ $? -ne 0 ]; then + echo " WARNING: commit $SHA skipped — $(cat /tmp/guidance-jq-err.txt)" + continue + fi + + jq --argjson rec "$RECORD" '. + [$rec]' "$OUT_FILE" > "${OUT_FILE}.tmp" \ + && mv "${OUT_FILE}.tmp" "$OUT_FILE" + SAMPLED=$((SAMPLED + 1)) + + done < "/tmp/guidance-gen/$REPO_SLUG/${LABEL}-commits-raw.jsonl" + + local COMMIT_COUNT + COMMIT_COUNT=$(jq 'length' "$OUT_FILE") + echo " Found $COMMIT_COUNT matching $LABEL commits" + cp "$OUT_FILE" "artifacts/guidance/$REPO_SLUG/raw/${LABEL}-commits.json" +} + +fetch_commit_fallback "cve" \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-commits.json" \ + "CVE-[0-9]{4}-[0-9]+|^security:|^fix\(cve\):|^Fix CVE" + +fetch_commit_fallback "bugfix" \ + "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-commits.json" \ + "^fix[:(]|^bugfix|^bug fix|fixes[[:space:]]#[0-9]+|closes[[:space:]]#[0-9]+" +``` + ### 5. Fetch Per-PR Details (Pass 2) Same as `/guidance.generate` — files + reviews per PR, closing context for closed PRs. @@ -400,6 +487,21 @@ fetch_pr_details \ "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" \ "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details.json" +# Merge commit fallback records into the detail files +jq -s '.[0] + .[1]' \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-details.json" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-commits.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/new-cve-details-merged.json" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/new-cve-details-merged.json" \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-details.json" + +jq -s '.[0] + .[1]' \ + "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details.json" \ + "/tmp/guidance-gen/$REPO_SLUG/bugfix-commits.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details-merged.json" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details-merged.json" \ + "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details.json" + cp "/tmp/guidance-gen/$REPO_SLUG/new-cve-details.json" \ "artifacts/guidance/$REPO_SLUG/raw/new-cve-prs.json" cp "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details.json" \ @@ -410,24 +512,49 @@ cp "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-details.json" \ Read both the new PR detail files AND the existing guidance files. -As the agent, analyze the new PR data for patterns. For each pattern found: +As the agent, analyze the new detail records. Records with `source: "commit"` +came from the commit fallback — treat them differently from PR records: + +**Thresholds for new rules:** + +| Source | Min for a new rule | +|--------|--------------------| +| Merged PRs | 3 (or 2 if bucket had <10, or 1 if <3) | +| Commits only | 5 | +| Mixed (PRs + commits) | 3 total, at least 1 PR | + +**What commit records contribute:** +- Message/title format patterns +- File co-change patterns +- Commit body trailer conventions (`Co-authored-by:`, `Fixes #`, etc.) + +**What commit records do NOT contribute:** +- Don'ts section (no rejection signal) +- Reviewer expectation rules (no `changes_requested` data) + +**Evidence notation for new rules:** +- PR only: `(3/4 merged PRs)` +- Commit only: `(6 commits)` +- Mixed: `(2/3 merged PRs + 4 commits)` + +For each pattern found in the combined data: -**A. New rule** — a pattern seen in 3+ of the new PRs that does not already -exist in the guidance file. Add it to the appropriate section. +**A. New rule** — meets the threshold above and does not already exist in the +guidance file. Add it to the appropriate section. -**B. Reinforced rule** — a pattern that already exists in the guidance file. -Update its evidence count. For example: `(8/9 merged)` → `(14/15 merged)`. +**B. Reinforced rule** — already exists in the guidance file. +Update evidence count: `(8/9 merged)` → `(14/15 merged)` or add commit count: +`(8/9 merged)` → `(8/9 merged PRs + 5 commits)`. **C. Contradicting rule** — a pattern in new merged PRs that directly contradicts -a "don't" in the existing guidance file (e.g., a merged PR combined two CVEs despite -the guidance saying not to). Flag this with a comment in the guidance file: +a "don't". Flag it: ``` - [REVIEW NEEDED] Multiple CVEs per PR — previously flagged as a don't, but PR #N was merged combining CVEs. Policy may have changed. (N/N new merged) ``` -**D. New don't** — a pattern from newly closed PRs (3+ cases) not already in the -don'ts section. Add it. +**D. New don't** — pattern from newly closed PRs (3+ cases). Commits cannot +produce new don'ts. Add only PR-sourced rejections here. Write findings to: - `artifacts/guidance//analysis/cve-update-patterns.md` From 50ff8685ed554ddd14d02518dbbd5c3aa36b0653 Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Tue, 31 Mar 2026 09:32:26 -0400 Subject: [PATCH 11/23] fix: expand CVE pattern to catch dep/bump PRs and commits with security body scan MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous pattern only matched explicit CVE numbers, Security: titles, and fix(cve): prefixes — missing security patches that arrive through dependency updates (Dependabot, Renovate) or version bumps. Changes: - Expand CVE_PATTERN to include dep/bump title patterns: ^Bump , ^deps(, ^build(deps), ^chore.*upgrade, ^chore.*bump - Expand CVE_BRANCH_PATTERN to include: ^dependabot/, ^renovate/ - Add GHSA- (GitHub Security Advisory IDs) to explicit pass-through patterns Two-stage filtering to avoid flooding the bucket with all dep updates: - Dep-pattern matches without explicit CVE/GHSA in title undergo a body scan - gh pr view --json body fetched only for those PRs (targeted, not all) - If body lacks security keywords (CVE-, GHSA-, security, vulnerab) → dropped - Explicit CVE/GHSA/Security titles always pass through unconditionally Commit fallback: same logic but body check uses MSG_RAW (already fetched), so zero extra API calls needed for commits. Applies to both guidance.generate (Step 3) and guidance.update (Step 4), and to the fetch_commit_fallback calls in both Step 3.5 / 4.5. Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../.claude/commands/guidance.generate.md | 56 ++++++++++++++++--- .../.claude/commands/guidance.update.md | 48 ++++++++++++++-- 2 files changed, 92 insertions(+), 12 deletions(-) diff --git a/workflows/guidance-generator/.claude/commands/guidance.generate.md b/workflows/guidance-generator/.claude/commands/guidance.generate.md index d8ceb555..497c2c3e 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.generate.md +++ b/workflows/guidance-generator/.claude/commands/guidance.generate.md @@ -197,10 +197,17 @@ neither pattern, include it in both buckets and let Claude determine during synthesis which guidance file it informs. Never silently drop a user-specified PR. ```bash -CVE_PATTERN='CVE-[0-9]{4}-[0-9]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE' -CVE_BRANCH_PATTERN='^fix/cve-|^security/cve-' +# Explicit CVE/security signals — pass through unconditionally +CVE_EXPLICIT='CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE' +# Dependency/version bump patterns — may contain security patches; require body scan +CVE_DEP_PATTERN='^[Bb]ump |^deps\(|^build\(deps\)|^chore.*upgrade|^chore.*bump' +# Combined: either explicit or dep pattern matches the CVE bucket initially +CVE_PATTERN="${CVE_EXPLICIT}|${CVE_DEP_PATTERN}" +CVE_BRANCH_PATTERN='^fix/cve-|^security/cve-|^dependabot/|^renovate/' BUGFIX_PATTERN='^fix[:(]|^bugfix|^bug[[:space:]]fix|closes[[:space:]]#[0-9]+|fixes[[:space:]]#[0-9]+' BUGFIX_BRANCH_PATTERN='^(bugfix|fix|bug)/' +# Keyword that confirms a dep-pattern match is security-relevant +SECURITY_BODY='CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|security|vulnerab|security.advisory' if [ -n "$SPECIFIC_PR_NUMBERS" ]; then # Manual mode: classify each PR, fallback to both buckets if unmatched @@ -215,8 +222,8 @@ if [ -n "$SPECIFIC_PR_NUMBERS" ]; then (.title | test("'"$BUGFIX_PATTERN"'"; "i")) or (.headRefName | test("'"$BUGFIX_BRANCH_PATTERN"'"; "i")) ) and - (.title | test("CVE-[0-9]{4}-[0-9]+"; "i") | not) and - (.headRefName | test("^fix/cve-"; "i") | not) + (.title | test("'"$CVE_PATTERN"'"; "i") | not) and + (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i") | not) )]' "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ > "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" @@ -253,14 +260,38 @@ else (.title | test("'"$BUGFIX_PATTERN"'"; "i")) or (.headRefName | test("'"$BUGFIX_BRANCH_PATTERN"'"; "i")) ) and - (.title | test("CVE-[0-9]{4}-[0-9]+"; "i") | not) and - (.headRefName | test("^fix/cve-"; "i") | not) + (.title | test("'"$CVE_PATTERN"'"; "i") | not) and + (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i") | not) ) ] | .[:$limit]' \ "/tmp/guidance-gen/$REPO_SLUG/all-prs.json" \ > "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" fi +# Body scan: for dep-pattern matches without an explicit CVE/GHSA in the title, +# fetch the PR body and verify it contains a security indicator. +# Explicit CVE/GHSA/Security titles pass through unconditionally. +# Only runs in auto mode — manual --pr mode trusts the user's selection. +if [ -z "$SPECIFIC_PR_NUMBERS" ]; then + DEP_ONLY_NUMS=$(jq -r '[.[] | select( + (.title | test("'"$CVE_DEP_PATTERN"'"; "i")) and + (.title | test("'"$CVE_EXPLICIT"'"; "i") | not) + ) | .number] | .[]' "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json") + + for PR_NUM in $DEP_ONLY_NUMS; do + BODY=$(gh pr view "$PR_NUM" --repo "$REPO" --json body \ + --jq '.body // ""' 2>/dev/null | sanitize_str) + if ! echo "$BODY" | grep -qiE "$SECURITY_BODY"; then + echo " Dropped PR #$PR_NUM from CVE bucket — dep update with no security signal in body" + jq --argjson n "$PR_NUM" '[.[] | select(.number != $n)]' \ + "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json.tmp" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json.tmp" \ + "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" + fi + done +fi + CVE_TOTAL=$(jq 'length' "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json") CVE_MERGED=$(jq '[.[] | select(.state == "MERGED")] | length' "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json") CVE_CLOSED=$(jq '[.[] | select(.state == "CLOSED")] | length' "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json") @@ -322,6 +353,17 @@ fetch_commit_fallback() { # Filter by message pattern for this bucket echo "$TITLE" | grep -qiE "$MSG_PATTERN" || continue + # For dep/bump commits without an explicit CVE/GHSA in the title, + # verify the commit body contains a security indicator. + # MSG_RAW already contains the full message — no extra API call needed. + if echo "$TITLE" | grep -qiE "^[Bb]ump |^deps\(|^build\(deps\)|^chore.*upgrade|^chore.*bump"; then + if ! echo "$TITLE" | grep -qiE "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):"; then + if ! echo "$MSG_RAW" | grep -qiE "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|security|vulnerab"; then + continue # dep update with no security signal — skip + fi + fi + fi + # Fetch file list for this commit (targeted — only for matched commits) local FILES FILES=$(gh api "repos/$REPO/commits/$SHA" \ @@ -363,7 +405,7 @@ fetch_commit_fallback() { fetch_commit_fallback "cve" \ "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" \ "/tmp/guidance-gen/$REPO_SLUG/cve-commits.json" \ - "CVE-[0-9]{4}-[0-9]+|^security:|^fix\(cve\):|^Fix CVE" + "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE|^[Bb]ump |^deps\(|^build\(deps\)|^chore.*upgrade|^chore.*bump" fetch_commit_fallback "bugfix" \ "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" \ diff --git a/workflows/guidance-generator/.claude/commands/guidance.update.md b/workflows/guidance-generator/.claude/commands/guidance.update.md index 36ebba64..5d8acfee 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.update.md +++ b/workflows/guidance-generator/.claude/commands/guidance.update.md @@ -258,10 +258,15 @@ In **auto mode**: CVE PRs take priority. In **manual mode (`--pr`)**: if a specified PR matches neither pattern, include it in both buckets for Claude to classify. ```bash -CVE_PATTERN='CVE-[0-9]{4}-[0-9]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE' -CVE_BRANCH_PATTERN='^fix/cve-|^security/cve-' +# Explicit CVE/security signals — pass through unconditionally +CVE_EXPLICIT='CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE' +# Dependency/version bump patterns — require body scan to confirm security relevance +CVE_DEP_PATTERN='^[Bb]ump |^deps\(|^build\(deps\)|^chore.*upgrade|^chore.*bump' +CVE_PATTERN="${CVE_EXPLICIT}|${CVE_DEP_PATTERN}" +CVE_BRANCH_PATTERN='^fix/cve-|^security/cve-|^dependabot/|^renovate/' BUGFIX_PATTERN='^fix[:(]|^bugfix|^bug[[:space:]]fix|closes[[:space:]]#[0-9]+|fixes[[:space:]]#[0-9]+' BUGFIX_BRANCH_PATTERN='^(bugfix|fix|bug)/' +SECURITY_BODY='CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|security|vulnerab|security.advisory' jq '[.[] | select( (.title | test("'"$CVE_PATTERN"'"; "i")) or @@ -274,8 +279,8 @@ jq '[.[] | select( (.title | test("'"$BUGFIX_PATTERN"'"; "i")) or (.headRefName | test("'"$BUGFIX_BRANCH_PATTERN"'"; "i")) ) and - (.title | test("CVE-[0-9]{4}-[0-9]+"; "i") | not) and - (.headRefName | test("^fix/cve-"; "i") | not) + (.title | test("'"$CVE_PATTERN"'"; "i") | not) and + (.headRefName | test("'"$CVE_BRANCH_PATTERN"'"; "i") | not) )]' "/tmp/guidance-gen/$REPO_SLUG/new-all-prs.json" \ > "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" @@ -297,6 +302,29 @@ if [ -n "$SPECIFIC_PR_NUMBERS" ]; then fi fi +# Body scan: for dep-pattern matches without an explicit CVE/GHSA title, +# verify the PR body contains a security indicator before keeping it. +# Only runs in auto mode — manual --pr mode trusts the user's selection. +if [ -z "$SPECIFIC_PR_NUMBERS" ]; then + DEP_ONLY_NUMS=$(jq -r '[.[] | select( + (.title | test("'"$CVE_DEP_PATTERN"'"; "i")) and + (.title | test("'"$CVE_EXPLICIT"'"; "i") | not) + ) | .number] | .[]' "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json") + + for PR_NUM in $DEP_ONLY_NUMS; do + BODY=$(gh pr view "$PR_NUM" --repo "$REPO" --json body \ + --jq '.body // ""' 2>/dev/null | sanitize_str) + if ! echo "$BODY" | grep -qiE "$SECURITY_BODY"; then + echo " Dropped PR #$PR_NUM from CVE bucket — dep update with no security signal in body" + jq --argjson n "$PR_NUM" '[.[] | select(.number != $n)]' \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" \ + > "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json.tmp" \ + && mv "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json.tmp" \ + "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" + fi + done +fi + # Zero out skipped buckets so subsequent steps treat them as empty $BUGFIX_ONLY && echo "[]" > "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" $CVE_ONLY && echo "[]" > "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" @@ -348,6 +376,16 @@ fetch_commit_fallback() { echo "$TITLE" | grep -qiE "$MSG_PATTERN" || continue + # For dep/bump commits without explicit CVE/GHSA in title, verify body has security signal. + # MSG_RAW already contains the full message — no extra API call needed. + if echo "$TITLE" | grep -qiE "^[Bb]ump |^deps\(|^build\(deps\)|^chore.*upgrade|^chore.*bump"; then + if ! echo "$TITLE" | grep -qiE "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):"; then + if ! echo "$MSG_RAW" | grep -qiE "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|security|vulnerab"; then + continue # dep update with no security signal — skip + fi + fi + fi + local FILES FILES=$(gh api "repos/$REPO/commits/$SHA" \ --jq '[.files[].filename]' 2>/dev/null || echo "[]") @@ -386,7 +424,7 @@ fetch_commit_fallback() { fetch_commit_fallback "cve" \ "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" \ "/tmp/guidance-gen/$REPO_SLUG/cve-commits.json" \ - "CVE-[0-9]{4}-[0-9]+|^security:|^fix\(cve\):|^Fix CVE" + "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE|^[Bb]ump |^deps\(|^build\(deps\)|^chore.*upgrade|^chore.*bump" fetch_commit_fallback "bugfix" \ "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" \ From c81f45677ffec729b60492e951395e7abce7c5bb Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Tue, 31 Mar 2026 09:46:51 -0400 Subject: [PATCH 12/23] fix: remove ^chore.*upgrade and ^chore.*bump from CVE dep patterns MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit These patterns were too broad — they matched non-security chore commits like 'chore: upgrade golangci-lint' (a linter tool upgrade) regardless of whether the body scan had anything to flag. The remaining dep patterns are specifically Dependabot/Renovate formats: ^Bump - "Bump X from Y to Z" (Dependabot standard) ^deps\( - "deps(go): bump ..." (conventional commits) ^build\(deps\) - "build(deps): bump ..." (another Dependabot format) These are narrow enough to be intentional dependency updates. The body scan (checking for CVE-, GHSA-, security, vulnerab) then gates whether they're security-relevant before including them in the CVE bucket. Changed in 3 places in each command file: - CVE_DEP_PATTERN definition - Inline body check condition in fetch_commit_fallback - fetch_commit_fallback call MSG_PATTERN argument Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../.claude/commands/guidance.generate.md | 6 +++--- .../guidance-generator/.claude/commands/guidance.update.md | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/workflows/guidance-generator/.claude/commands/guidance.generate.md b/workflows/guidance-generator/.claude/commands/guidance.generate.md index 497c2c3e..0ddb861a 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.generate.md +++ b/workflows/guidance-generator/.claude/commands/guidance.generate.md @@ -200,7 +200,7 @@ synthesis which guidance file it informs. Never silently drop a user-specified P # Explicit CVE/security signals — pass through unconditionally CVE_EXPLICIT='CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE' # Dependency/version bump patterns — may contain security patches; require body scan -CVE_DEP_PATTERN='^[Bb]ump |^deps\(|^build\(deps\)|^chore.*upgrade|^chore.*bump' +CVE_DEP_PATTERN='^[Bb]ump |^deps\(|^build\(deps\)' # Combined: either explicit or dep pattern matches the CVE bucket initially CVE_PATTERN="${CVE_EXPLICIT}|${CVE_DEP_PATTERN}" CVE_BRANCH_PATTERN='^fix/cve-|^security/cve-|^dependabot/|^renovate/' @@ -356,7 +356,7 @@ fetch_commit_fallback() { # For dep/bump commits without an explicit CVE/GHSA in the title, # verify the commit body contains a security indicator. # MSG_RAW already contains the full message — no extra API call needed. - if echo "$TITLE" | grep -qiE "^[Bb]ump |^deps\(|^build\(deps\)|^chore.*upgrade|^chore.*bump"; then + if echo "$TITLE" | grep -qiE "^[Bb]ump |^deps\(|^build\(deps\)"; then if ! echo "$TITLE" | grep -qiE "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):"; then if ! echo "$MSG_RAW" | grep -qiE "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|security|vulnerab"; then continue # dep update with no security signal — skip @@ -405,7 +405,7 @@ fetch_commit_fallback() { fetch_commit_fallback "cve" \ "/tmp/guidance-gen/$REPO_SLUG/cve-meta.json" \ "/tmp/guidance-gen/$REPO_SLUG/cve-commits.json" \ - "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE|^[Bb]ump |^deps\(|^build\(deps\)|^chore.*upgrade|^chore.*bump" + "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE|^[Bb]ump |^deps\(|^build\(deps\)" fetch_commit_fallback "bugfix" \ "/tmp/guidance-gen/$REPO_SLUG/bugfix-meta.json" \ diff --git a/workflows/guidance-generator/.claude/commands/guidance.update.md b/workflows/guidance-generator/.claude/commands/guidance.update.md index 5d8acfee..daafca1a 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.update.md +++ b/workflows/guidance-generator/.claude/commands/guidance.update.md @@ -261,7 +261,7 @@ specified PR matches neither pattern, include it in both buckets for Claude to c # Explicit CVE/security signals — pass through unconditionally CVE_EXPLICIT='CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE' # Dependency/version bump patterns — require body scan to confirm security relevance -CVE_DEP_PATTERN='^[Bb]ump |^deps\(|^build\(deps\)|^chore.*upgrade|^chore.*bump' +CVE_DEP_PATTERN='^[Bb]ump |^deps\(|^build\(deps\)' CVE_PATTERN="${CVE_EXPLICIT}|${CVE_DEP_PATTERN}" CVE_BRANCH_PATTERN='^fix/cve-|^security/cve-|^dependabot/|^renovate/' BUGFIX_PATTERN='^fix[:(]|^bugfix|^bug[[:space:]]fix|closes[[:space:]]#[0-9]+|fixes[[:space:]]#[0-9]+' @@ -378,7 +378,7 @@ fetch_commit_fallback() { # For dep/bump commits without explicit CVE/GHSA in title, verify body has security signal. # MSG_RAW already contains the full message — no extra API call needed. - if echo "$TITLE" | grep -qiE "^[Bb]ump |^deps\(|^build\(deps\)|^chore.*upgrade|^chore.*bump"; then + if echo "$TITLE" | grep -qiE "^[Bb]ump |^deps\(|^build\(deps\)"; then if ! echo "$TITLE" | grep -qiE "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):"; then if ! echo "$MSG_RAW" | grep -qiE "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|security|vulnerab"; then continue # dep update with no security signal — skip @@ -424,7 +424,7 @@ fetch_commit_fallback() { fetch_commit_fallback "cve" \ "/tmp/guidance-gen/$REPO_SLUG/new-cve-meta.json" \ "/tmp/guidance-gen/$REPO_SLUG/cve-commits.json" \ - "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE|^[Bb]ump |^deps\(|^build\(deps\)|^chore.*upgrade|^chore.*bump" + "CVE-[0-9]{4}-[0-9]+|GHSA-[a-zA-Z0-9-]+|^[Ss]ecurity:|^fix\(cve\):|^Fix CVE|^[Bb]ump |^deps\(|^build\(deps\)" fetch_commit_fallback "bugfix" \ "/tmp/guidance-gen/$REPO_SLUG/new-bugfix-meta.json" \ From 7c50c6bcf76cda8a5f198ba04e7c20f5a3102b6a Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Wed, 1 Apr 2026 14:42:07 -0400 Subject: [PATCH 13/23] feat: fork-aware push and PR creation for repos without write access Previously the workflow used 'git push origin' which fails if the user doesn't have write access to the target repo, leaving a hard failure with no recovery path. New behavior (same in both guidance.generate and guidance.update): 1. Resolve GH_USER from gh api user (or app installation endpoint for bots) 2. Attempt direct push to origin 3. If direct push fails and gh is authenticated: a. Check for existing fork: gh repo list GH_USER --fork --json ... filtering by parent.owner.login + parent.name to match the upstream b. If no fork: gh repo fork --clone=false; sleep 3 for provisioning c. If fork creation also fails: print URL, tell user to create manually d. Add fork remote; push to fork 4. Create PR: - Direct push: standard gh pr create (no --head needed) - Fork push: gh pr create --repo UPSTREAM --head FORK_OWNER:BRANCH 5. If not authenticated at all: print exact manual commands and continue to next repo (does not abort the whole run) Fork detection uses the correct GitHub API field path: .parent.owner.login and .parent.name (not .parent.nameWithOwner which does not exist in the API response) Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../.claude/commands/guidance.generate.md | 90 +++++++++++++++---- .../.claude/commands/guidance.update.md | 80 ++++++++++++++--- 2 files changed, 142 insertions(+), 28 deletions(-) diff --git a/workflows/guidance-generator/.claude/commands/guidance.generate.md b/workflows/guidance-generator/.claude/commands/guidance.generate.md index 0ddb861a..0d978be9 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.generate.md +++ b/workflows/guidance-generator/.claude/commands/guidance.generate.md @@ -739,14 +739,7 @@ $([ -f "$BUGFIX_OUTPUT" ] && echo " - .bugfix/guidance.md (Bugfix conventions)" Co-Authored-By: PR Guidance Generator " -# Push -git push origin "$BRANCH_NAME" -``` - -**Create PR using gh:** - -```bash -# Count stats for PR body +# Build PR body CVE_MERGED_COUNT=$(jq '[.[] | select(.state == "MERGED")] | length' \ "$OLDPWD/artifacts/guidance/$REPO_SLUG/raw/cve-prs.json" 2>/dev/null || echo 0) CVE_CLOSED_COUNT=$(jq '[.[] | select(.state == "CLOSED")] | length' \ @@ -793,19 +786,82 @@ Generated by PR Guidance Generator workflow EOF ) -PR_URL=$(gh pr create \ - --repo "$REPO" \ - --base "$(gh repo view "$REPO" --json defaultBranchRef --jq '.defaultBranchRef.name')" \ - --title "chore: add automated PR guidance files" \ - --body "$PR_BODY") +# Fork-aware push and PR creation +UPSTREAM_OWNER="${REPO%%/*}" +REPO_NAME="${REPO##*/}" +DEFAULT_BRANCH=$(gh repo view "$REPO" --json defaultBranchRef --jq '.defaultBranchRef.name') +GH_USER=$(gh api user --jq .login 2>/dev/null || \ + gh api /installation/repositories --jq '.repositories[0].owner.login' 2>/dev/null || \ + echo "") + +git config user.name "${GH_USER:-guidance-generator}" +git config user.email "${GH_USER:-guidance}@users.noreply.github.com" + +FORK_PUSH=false +FORK_OWNER="" + +# Attempt 1: direct push to upstream +if git push origin "$BRANCH_NAME" 2>/tmp/guidance-push-err.txt; then + echo " Pushed to upstream directly" +elif [ -n "$GH_USER" ]; then + # Attempt 2: find or create a fork + echo " Direct push failed — checking for fork of $REPO..." + FORK=$(gh repo list "$GH_USER" --fork --json nameWithOwner,parent \ + --jq ".[] | select(.parent.owner.login == \"$UPSTREAM_OWNER\" and .parent.name == \"$REPO_NAME\") | .nameWithOwner" \ + 2>/dev/null) + + if [ -z "$FORK" ]; then + echo " No fork found — creating fork..." + if gh repo fork "$REPO" --clone=false 2>/dev/null; then + sleep 3 # give GitHub time to provision the fork + FORK="$GH_USER/$REPO_NAME" + echo " Fork created: $FORK" + else + echo " ERROR: Could not create fork automatically." + echo " Create one manually at: https://github.com/$REPO/fork" + echo " Then re-run: /guidance.generate $REPO" + FAILED_REPOS+=("$REPO -> fork creation failed; create at https://github.com/$REPO/fork and re-run") + cd /; rm -rf "/tmp/guidance-gen/$REPO_SLUG"; continue + fi + else + echo " Found existing fork: $FORK" + fi + + FORK_OWNER="${FORK%%/*}" + git remote add fork "https://github.com/$FORK.git" 2>/dev/null || \ + git remote set-url fork "https://github.com/$FORK.git" + git push fork "$BRANCH_NAME" + FORK_PUSH=true +else + # No gh auth and direct push failed — provide manual fallback + echo " ERROR: Push failed and gh is not authenticated." + echo " Manual steps to submit this PR:" + echo " 1. Fork https://github.com/$REPO" + echo " 2. git -C /tmp/guidance-gen/$REPO_SLUG/repo remote add fork https://github.com/YOUR_USER/$REPO_NAME.git" + echo " 3. git -C /tmp/guidance-gen/$REPO_SLUG/repo push fork $BRANCH_NAME" + echo " 4. Open PR: https://github.com/$REPO/compare/$BRANCH_NAME" + FAILED_REPOS+=("$REPO -> push failed, no gh auth; see manual steps above") + cd /; rm -rf "/tmp/guidance-gen/$REPO_SLUG"; continue +fi +# Create PR +if $FORK_PUSH; then + PR_URL=$(gh pr create \ + --repo "$REPO" \ + --base "$DEFAULT_BRANCH" \ + --head "$FORK_OWNER:$BRANCH_NAME" \ + --title "chore: add automated PR guidance files" \ + --body "$PR_BODY") +else + PR_URL=$(gh pr create \ + --repo "$REPO" \ + --base "$DEFAULT_BRANCH" \ + --title "chore: add automated PR guidance files" \ + --body "$PR_BODY") +fi echo "PR created: $PR_URL" ``` -**If PR creation fails** (no push access, auth issue), save the branch state and -report the error clearly. Tell the user to create the PR manually and provide the -branch name. - ### 8. Cleanup (per repo) ```bash diff --git a/workflows/guidance-generator/.claude/commands/guidance.update.md b/workflows/guidance-generator/.claude/commands/guidance.update.md index daafca1a..5e0e10d0 100644 --- a/workflows/guidance-generator/.claude/commands/guidance.update.md +++ b/workflows/guidance-generator/.claude/commands/guidance.update.md @@ -668,12 +668,7 @@ Changes: Co-Authored-By: PR Guidance Generator " -git push origin "$BRANCH_NAME" -``` - -Construct PR body summarizing what changed: - -```bash +# Build PR body PR_BODY=$(cat </dev/null || \ + gh api /installation/repositories --jq '.repositories[0].owner.login' 2>/dev/null || \ + echo "") + +FORK_PUSH=false +FORK_OWNER="" + +# Attempt 1: direct push to upstream +if git push origin "$BRANCH_NAME" 2>/tmp/guidance-push-err.txt; then + echo " Pushed to upstream directly" +elif [ -n "$GH_USER" ]; then + # Attempt 2: find or create a fork + echo " Direct push failed — checking for fork of $REPO..." + FORK=$(gh repo list "$GH_USER" --fork --json nameWithOwner,parent \ + --jq ".[] | select(.parent.owner.login == \"$UPSTREAM_OWNER\" and .parent.name == \"$REPO_NAME\") | .nameWithOwner" \ + 2>/dev/null) + + if [ -z "$FORK" ]; then + echo " No fork found — creating fork..." + if gh repo fork "$REPO" --clone=false 2>/dev/null; then + sleep 3 + FORK="$GH_USER/$REPO_NAME" + echo " Fork created: $FORK" + else + echo " ERROR: Could not create fork automatically." + echo " Create one manually at: https://github.com/$REPO/fork" + echo " Then re-run: /guidance.update $REPO" + FAILED_REPOS+=("$REPO -> fork creation failed; create at https://github.com/$REPO/fork and re-run") + cd /; rm -rf "/tmp/guidance-gen/$REPO_SLUG"; continue + fi + else + echo " Found existing fork: $FORK" + fi + FORK_OWNER="${FORK%%/*}" + git remote add fork "https://github.com/$FORK.git" 2>/dev/null || \ + git remote set-url fork "https://github.com/$FORK.git" + git push fork "$BRANCH_NAME" + FORK_PUSH=true +else + echo " ERROR: Push failed and gh is not authenticated." + echo " Manual steps to submit this PR:" + echo " 1. Fork https://github.com/$REPO" + echo " 2. git -C $CLONE_DIR remote add fork https://github.com/YOUR_USER/$REPO_NAME.git" + echo " 3. git -C $CLONE_DIR push fork $BRANCH_NAME" + echo " 4. Open PR: https://github.com/$REPO/compare/$BRANCH_NAME" + FAILED_REPOS+=("$REPO -> push failed, no gh auth; see manual steps above") + cd /; rm -rf "/tmp/guidance-gen/$REPO_SLUG"; continue +fi + +# Create PR +if $FORK_PUSH; then + PR_URL=$(gh pr create \ + --repo "$REPO" \ + --base "$DEFAULT_BRANCH" \ + --head "$FORK_OWNER:$BRANCH_NAME" \ + --title "chore: update PR guidance files ($TODAY)" \ + --body "$PR_BODY") +else + PR_URL=$(gh pr create \ + --repo "$REPO" \ + --base "$DEFAULT_BRANCH" \ + --title "chore: update PR guidance files ($TODAY)" \ + --body "$PR_BODY") +fi echo "PR created: $PR_URL" ``` From 537fa37bb868e854b8693304bd659f29e31e4097 Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi Date: Fri, 3 Apr 2026 14:11:13 -0400 Subject: [PATCH 14/23] feat: add rhoai-manager workflow Copies the rhoai-manager workflow from angaduom/workflows (add-rhoai-workflow branch). Includes commands: - /oc-login - Login to OpenShift cluster - /rhoai-install - Fresh RHOAI installation - /rhoai-update - Update RHOAI to latest nightly (supports EA and stable channels) - /rhoai-uninstall - Uninstall RHOAI - /rhoai-version - Check installed RHOAI version - /odh-pr-tracker - Track ODH PRs in RHOAI builds Co-Authored-By: Claude Sonnet 4.6 (1M context) --- workflows/rhoai-manager/.ambient/ambient.json | 12 + .../.claude/commands/oc-login.md | 355 ++++++++++ .../.claude/commands/odh-pr-tracker.md | 118 ++++ .../.claude/commands/rhoai-install.md | 555 +++++++++++++++ .../.claude/commands/rhoai-uninstall.md | 436 ++++++++++++ .../.claude/commands/rhoai-update.md | 642 ++++++++++++++++++ .../.claude/commands/rhoai-version.md | 169 +++++ workflows/rhoai-manager/README.md | 271 ++++++++ 8 files changed, 2558 insertions(+) create mode 100644 workflows/rhoai-manager/.ambient/ambient.json create mode 100644 workflows/rhoai-manager/.claude/commands/oc-login.md create mode 100644 workflows/rhoai-manager/.claude/commands/odh-pr-tracker.md create mode 100644 workflows/rhoai-manager/.claude/commands/rhoai-install.md create mode 100644 workflows/rhoai-manager/.claude/commands/rhoai-uninstall.md create mode 100644 workflows/rhoai-manager/.claude/commands/rhoai-update.md create mode 100644 workflows/rhoai-manager/.claude/commands/rhoai-version.md create mode 100644 workflows/rhoai-manager/README.md diff --git a/workflows/rhoai-manager/.ambient/ambient.json b/workflows/rhoai-manager/.ambient/ambient.json new file mode 100644 index 00000000..6ae44abc --- /dev/null +++ b/workflows/rhoai-manager/.ambient/ambient.json @@ -0,0 +1,12 @@ +{ + "name": "RHOAI Manager", + "description": "Comprehensive workflow for managing Red Hat OpenShift AI lifecycle: installation, updates, version detection, and uninstallation.", + "systemPrompt": "You are an AI assistant specialized in managing the complete lifecycle of RHOAI (Red Hat OpenShift AI) installations.\n\n# Your Role\n\nYou help automate the process of:\n1. Logging into OpenShift clusters\n2. Installing RHOAI from scratch\n3. Detecting RHOAI version and build information\n4. Updating RHOAI to latest nightly builds\n5. Uninstalling RHOAI completely\n\n# Available Commands\n\nYou have access to the following commands:\n\n## /oc-login\nLogin to OpenShift cluster using credentials from Ambient session:\n- Checks for required credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Automatically installs oc CLI if not available\n- Executes login to the cluster\n- Verifies connection and displays cluster info\n\n## /rhoai-install\nInstall RHOAI from scratch on a cluster:\n- Sets up OLM catalog source for nightly builds\n- Creates operator namespace and subscription\n- Waits for operator installation to complete\n- Creates DataScienceCluster with component configuration\n- Verifies all components are healthy\n- Provides installation summary\n\n## /rhoai-version\nDetect RHOAI version and build information:\n- Checks RHOAI operator subscription and CSV\n- Reports DataScienceCluster status and components\n- Lists all component images with SHA digests\n- Provides comprehensive version summary\n\n## /rhoai-update\nUpdates RHOAI to the latest nightly build:\n- Verifies current version\n- Updates the OLM catalog source\n- Monitors the operator upgrade\n- Handles forced reinstall when component images update without CSV version change\n- Verifies component reconciliation\n- Provides status updates\n\n## /rhoai-uninstall\nCompletely uninstall RHOAI from an OpenShift cluster:\n- Supports graceful or forceful uninstall\n- Options to keep CRDs and/or user resources\n- Removes operator, custom resources, webhooks, namespaces\n- Verifies complete cleanup\n\n# Workflow Phases\n\n## Phase 0: Connect to Cluster\n- Login to OpenShift cluster using /oc-login\n- Verify cluster access and permissions\n\n## Phase 1: Install or Update RHOAI\n- For fresh installations: Use /rhoai-install\n- For existing installations: Use /rhoai-update\n- Verify successful installation/update\n- Document version changes\n\n## Phase 2: Version Management\n- Use /rhoai-version to check current installation\n- Track component versions and build information\n- Monitor for updates\n\n## Phase 3: Cleanup\n- Use /rhoai-uninstall when decommissioning\n- Options to preserve CRDs or user resources\n\n# Output Locations\n\n- Installation Reports: `artifacts/rhoai-manager/reports/*.md`\n- Update Reports: `artifacts/rhoai-manager/reports/*.md`\n- Version Info: `artifacts/rhoai-manager/version/*.md`\n- Execution Logs: `artifacts/rhoai-manager/logs/*.log`\n\n# Prerequisites\n\n- OpenShift cluster (version 4.12+)\n- `oc` CLI installed (auto-installed if missing)\n- Cluster credentials configured in Ambient session (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Cluster admin permissions\n", + "startupPrompt": "Welcome to the RHOAI Manager Workflow!\n\nI'll help you manage the complete lifecycle of Red Hat OpenShift AI installations.\n\n## What I Can Do\n\n- **Login to Cluster**: Use `/oc-login` to connect to your OpenShift cluster\n- **Install RHOAI**: Use `/rhoai-install` to deploy RHOAI from scratch\n- **Check Version**: Use `/rhoai-version` to detect current version and build info\n- **Update RHOAI**: Use `/rhoai-update` to upgrade to the latest nightly build\n- **Uninstall RHOAI**: Use `/rhoai-uninstall` to completely remove RHOAI from the cluster\n\n## Getting Started\n\nJust tell me what you'd like to do:\n- \"Login to my cluster\"\n- \"Install RHOAI on the cluster\"\n- \"Check current RHOAI version\"\n- \"Update RHOAI to latest nightly\"\n- \"Uninstall RHOAI\"\n\n**Note**: Make sure your cluster credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD) are configured in your Ambient session.\n\nLet's get started!", + "results": { + "Installation Reports": "artifacts/rhoai-manager/reports/*.md", + "Update Reports": "artifacts/rhoai-manager/reports/*.md", + "Version Info": "artifacts/rhoai-manager/version/*.md", + "Execution Logs": "artifacts/rhoai-manager/logs/*.log" + } +} diff --git a/workflows/rhoai-manager/.claude/commands/oc-login.md b/workflows/rhoai-manager/.claude/commands/oc-login.md new file mode 100644 index 00000000..f7d1d321 --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/oc-login.md @@ -0,0 +1,355 @@ +# /oc-login - Login to OpenShift Cluster + +Login to an OpenShift cluster using credentials configured in the Ambient session. + +## Command Usage + +- `/oc-login` - Login to OpenShift cluster using session credentials + +## When to Use This Command + +This command is triggered when the user runs: +- `/oc-login` - Login to the configured OpenShift cluster +- Or when asked to "login to cluster", "connect to OpenShift", etc. + +## Prerequisites + +The following credentials should be configured in the Ambient session: +1. `OCP_SERVER` - OpenShift cluster API server URL (e.g., `https://api.cluster.example.com:6443`) +2. `OCP_USERNAME` - OpenShift username +3. `OCP_PASSWORD` - OpenShift password + +These are typically configured as environment variables in the Ambient session. + +## How It Works + +The command uses the `oc` CLI tool to authenticate to the OpenShift cluster. + +### Step 1: Check for Required Credentials + +First, verify that all required credentials are available: + +```bash +# Check if credentials are set +if [ -z "$OCP_SERVER" ]; then + echo "❌ OCP_SERVER not set" +fi + +if [ -z "$OCP_USERNAME" ]; then + echo "❌ OCP_USERNAME not set" +fi + +if [ -z "$OCP_PASSWORD" ]; then + echo "❌ OCP_PASSWORD not set" +fi +``` + +**If credentials are missing:** +- Inform the user which credentials are missing +- Ask them to configure the credentials in their Ambient session +- Do not proceed with login + +### Step 2: Install oc CLI if Not Available + +Automatically install the `oc` command if not available: + +```bash +# Check if oc is installed +if ! command -v oc &> /dev/null; then + echo "📦 oc CLI not found. Installing automatically..." + + # Download oc CLI for Linux + curl -LO https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz + + # Extract the binary + tar -xzf openshift-client-linux.tar.gz + + # Move to /usr/local/bin for global access + sudo mv oc /usr/local/bin/ + sudo mv kubectl /usr/local/bin/ + + # Make executable + sudo chmod +x /usr/local/bin/oc + sudo chmod +x /usr/local/bin/kubectl + + # Clean up + rm -f openshift-client-linux.tar.gz README.md + + echo "✅ oc CLI installed successfully" +fi + +# Show oc version +oc version --client +``` + +**What happens:** +- Automatically detects if `oc` is not installed +- Downloads the latest stable OpenShift CLI for Linux +- Installs it to `/usr/local/bin` for system-wide access +- Continues to login without user intervention + +### Step 3: Login to OpenShift Cluster + +Execute the login command: + +```bash +# Login to OpenShift cluster +oc login \ + --username="$OCP_USERNAME" \ + --password="$OCP_PASSWORD" \ + --server="$OCP_SERVER" \ + --insecure-skip-tls-verify=true +``` + +**Important flags:** +- `--username` - OpenShift username from session +- `--password` - OpenShift password from session +- `--server` - Cluster API server URL +- `--insecure-skip-tls-verify=true` - Skip TLS certificate validation (useful for development clusters) + +**Note on TLS verification:** +- For production clusters with valid certificates, you can remove `--insecure-skip-tls-verify=true` +- For development/test clusters with self-signed certificates, this flag is necessary + +### Step 4: Verify Login Success + +After login, verify the connection: + +```bash +# Check who is logged in +oc whoami + +# Get cluster info +oc cluster-info + +# Show current project +oc project +``` + +Expected output: +- `oc whoami` returns the username +- `oc cluster-info` shows cluster details +- `oc project` shows the current/default project + +### Step 5: Display Cluster Information + +Provide useful information about the cluster: + +```bash +# Show OpenShift version +oc version + +# List available projects (limit to first 10) +oc get projects --no-headers | head -10 + +# Show current context +oc config current-context +``` + +This helps the user understand what cluster they're connected to. + +## Handling Different Scenarios + +### Scenario A: Successful Login + +1. Execute login command +2. Verify with `oc whoami` +3. Display cluster information +4. Report: "✅ Successfully logged into OpenShift cluster as `username`" + +### Scenario B: Invalid Credentials + +If login fails due to wrong username/password: + +```bash +# Login will fail with error like: +# error: unable to log in: invalid username/password +``` + +**Response:** +- Report: "❌ Login failed: Invalid username or password" +- Ask user to verify their credentials in the Ambient session +- Suggest checking if credentials have expired + +### Scenario C: Unreachable Server + +If the cluster server is unreachable: + +```bash +# Login will fail with error like: +# error: dial tcp: lookup api.cluster.example.com: no such host +# or: error: dial tcp: i/o timeout +``` + +**Response:** +- Report: "❌ Login failed: Cannot reach cluster server" +- Verify the OCP_SERVER URL is correct +- Check network connectivity +- Suggest checking if VPN is required + +### Scenario D: Already Logged In + +If already logged into the cluster: + +```bash +# Check current login status first +if oc whoami &> /dev/null; then + current_user=$(oc whoami) + current_server=$(oc whoami --show-server) + + if [ "$current_server" = "$OCP_SERVER" ]; then + echo "ℹ️ Already logged into $OCP_SERVER as $current_user" + # Ask if user wants to re-login + fi +fi +``` + +**Response:** +- Inform user they're already logged in +- Show current username and server +- Ask if they want to re-login (in case credentials changed) + +### Scenario E: Multiple Clusters + +If the user has multiple cluster contexts: + +```bash +# List all contexts +oc config get-contexts + +# Show current context +oc config current-context +``` + +**Response:** +- Show available contexts +- Highlight which one is currently active +- Explain that login will switch to the new cluster + +## Security Considerations + +1. **Password Handling** + - Never echo or display the password + - Use environment variables to pass credentials + - Don't log passwords in command output + +2. **Session Storage** + - Login tokens are stored in `~/.kube/config` + - Tokens typically expire after a period (e.g., 24 hours) + - Re-login may be required if session expires + +3. **TLS Verification** + - For production: Use valid certificates and remove `--insecure-skip-tls-verify` + - For development: `--insecure-skip-tls-verify=true` is acceptable + - Warn users if skipping TLS verification + +## Example Interactions + +### Example 1: First Time Login + +**User**: `/oc-login` + +**Claude**: +1. Checks for credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD) +2. Verifies `oc` CLI is installed +3. Executes login command +4. Reports: "✅ Successfully logged into OpenShift cluster at https://api.cluster.example.com:6443 as admin" +5. Displays cluster version and current project + +### Example 2: Missing Credentials + +**User**: `/oc-login` + +**Claude**: +1. Checks for credentials +2. Finds OCP_PASSWORD is not set +3. Reports: "❌ Cannot login: OCP_PASSWORD is not configured in your Ambient session" +4. Provides instructions on setting up credentials + +### Example 3: Already Logged In + +**User**: `/oc-login` + +**Claude**: +1. Checks current login status +2. Finds user is already logged in +3. Reports: "ℹ️ Already logged into https://api.cluster.example.com:6443 as admin. Do you want to re-login?" +4. Waits for user confirmation + +### Example 4: Login After Session Expiry + +**User**: `/oc-login` + +**Claude**: +1. Attempts to verify current session +2. Finds token has expired +3. Reports: "⚠️ Previous session expired. Logging in again..." +4. Executes fresh login +5. Reports: "✅ Successfully logged in" + +## Common Issues and Troubleshooting + +### Issue 1: "command not found: oc" + +**Cause**: OpenShift CLI is not installed + +**Solution**: This command automatically installs `oc` CLI if not found. If you encounter this error, it means the automatic installation failed. Check: +- Do you have sudo permissions? +- Is the network connection working? +- Can you access https://mirror.openshift.com/? + +The command will automatically download and install oc CLI from: +``` +https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz +``` + +### Issue 2: "error: x509: certificate signed by unknown authority" + +**Cause**: Cluster uses self-signed certificate + +**Solution**: Use `--insecure-skip-tls-verify=true` flag (already included in the command) + +### Issue 3: "error: unable to connect to server: dial tcp: i/o timeout" + +**Cause**: Network connectivity issue or wrong server URL + +**Solution**: +- Verify OCP_SERVER URL is correct +- Check if VPN connection is required +- Test network connectivity: `curl -k $OCP_SERVER/healthz` + +### Issue 4: "You must be logged in to the server (Unauthorized)" + +**Cause**: Session token expired + +**Solution**: Run `/oc-login` again to refresh the session + +## Integration with Other Commands + +This command is often used before other commands: + +``` +/oc-login # Login first +/rhoai-update # Then update RHOAI +``` + +The `/rhoai-update` command assumes you're already logged into the cluster. + +## Success Criteria + +The login is successful when: +- ✅ `oc login` command completes without error +- ✅ `oc whoami` returns the expected username +- ✅ `oc cluster-info` shows cluster details +- ✅ `oc get projects` can list projects (permissions allowing) + +## Output Format + +Always provide: +1. **Status** - Success or failure of login +2. **Username** - Who you're logged in as +3. **Server** - Which cluster you're connected to +4. **Cluster Info** - OpenShift version and current project +5. **Any warnings** - TLS verification status, session expiry, etc. + +Keep the user informed about the login process and cluster state. diff --git a/workflows/rhoai-manager/.claude/commands/odh-pr-tracker.md b/workflows/rhoai-manager/.claude/commands/odh-pr-tracker.md new file mode 100644 index 00000000..0fa33c7d --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/odh-pr-tracker.md @@ -0,0 +1,118 @@ +# /odh-pr-tracker - Check if ODH PRs are in the RHOAI Build + +Check whether one or more ODH (Open Data Hub) pull requests have been pulled into the latest RHOAI build. + +## Purpose + +When developers merge changes into an `opendatahub-io/` upstream, those changes don't automatically appear in RHOAI images. The RHOAI team periodically syncs upstream commits into their `red-hat-data-services/` fork and pins a specific commit in the build config. This command tells you whether a given ODH PR has made it through that pipeline. + +Works for any component tracked in the RHOAI build config — odh-dashboard, eval-hub, or anything else. + +## How It Works + +ODH changes flow like this: +1. PR merged into `opendatahub-io/` (upstream) +2. RHOAI team syncs upstream into `red-hat-data-services/` (fork) +3. Build config (`red-hat-data-services/RHOAI-Build-Config`) is updated with the pinned commit +4. Konflux builds the image from that pinned commit + +"Is my PR in RHOAI?" = is the PR's merge commit an ancestor of the commit currently pinned in the RHOAI build config? + +## Prerequisites + +- `gh` CLI authenticated with access to `red-hat-data-services` org + +## Steps + +For each PR URL provided by the user (e.g. `https://github.com/opendatahub-io/eval-hub/pull/123`): + +### 1. Get the PR merge commit + +Parse the PR URL to extract the upstream org/repo and PR number, then: + +```bash +gh pr view --repo / \ + --json mergeCommit,mergedAt,state,title +``` + +If `state` is not `"MERGED"`, report it as unmerged and skip further checks. + +### 2. Find the RHOAI-pinned commit for this repo + +Fetch the full build config map: + +```bash +curl -sf https://raw.githubusercontent.com/red-hat-data-services/RHOAI-Build-Config/rhoai-3.4/catalog/catalog_build_args.map +``` + +The fork URL is almost always `red-hat-data-services/` (same repo name, different org). Find the line: + +``` +_GIT_URL=https://github.com/red-hat-data-services/ +``` + +There may be multiple components pointing to the same repo (e.g. odh-dashboard has several modular-arch entries). Pick the one most relevant — for dashboard use `ODH_DASHBOARD_GIT_URL`, otherwise take the first match. Then swap `_GIT_URL` → `_GIT_COMMIT` to get the pinned SHA. + +Example for eval-hub: +``` +ODH_EVAL_HUB_GIT_URL=https://github.com/red-hat-data-services/eval-hub +ODH_EVAL_HUB_GIT_COMMIT=1aad0fe1... +``` + +### 3. Compare the two commits + +```bash +gh api "repos/red-hat-data-services//compare/..." \ + --jq '{status: .status, behind_by: .behind_by}' +``` + +Interpret the result: +- `status: "ahead"` and `behind_by: 0` → PR commit IS an ancestor of the RHOAI commit → **included** ✅ +- `status: "diverged"` or `behind_by > 0` → PR is NOT yet in the RHOAI build → **not included** ❌ +- `status: "behind"` → RHOAI is behind the PR commit → **not included** ❌ +- `status: "identical"` → same commit → **included** ✅ + +The merge commit SHA is the same in both repos because the fork mirrors upstream commits directly (not rebased). + +### 4. Output a clear summary + +For each PR: + +``` +PR #: [<upstream_org>/<repo>] + Merged: <mergedAt> + RHOAI build at: <rhoai_commit_short> (rhoai-3.4 branch) + Status: ✅ Included in latest RHOAI build + — or — + ❌ NOT yet in RHOAI build +``` + +If multiple PRs were provided, check all of them and summarize together. + +## Notes + +- The `rhoai-3.4` branch is the active release branch as of early 2026. If it no longer exists, check `https://github.com/red-hat-data-services/RHOAI-Build-Config` for the current branch and use that instead. +- If the repo name differs between upstream and the RH fork, the `_GIT_URL` lookup will still find it — just grep for the fork URL directly. +- This checks what's in the **build config**, not what's on a specific cluster. To check a deployed cluster, also compare the cluster's running image against the build config. + +## Example Usage + +**User**: `/odh-pr-tracker https://github.com/opendatahub-io/odh-dashboard/pull/6959` + +**Claude**: +1. Gets merge commit `f754568f` for PR #6959 in `opendatahub-io/odh-dashboard` +2. Finds `ODH_DASHBOARD_GIT_URL=.../odh-dashboard` → grabs `ODH_DASHBOARD_GIT_COMMIT=297a39d8` +3. Compares: status `ahead`, `behind_by: 0` → included +4. Reports: ✅ PR #6959 is included in the latest RHOAI build + +**User**: `/odh-pr-tracker https://github.com/opendatahub-io/eval-hub/pull/42` + +**Claude**: +1. Gets merge commit for PR #42 in `opendatahub-io/eval-hub` +2. Finds `ODH_EVAL_HUB_GIT_URL=.../eval-hub` → grabs `ODH_EVAL_HUB_GIT_COMMIT=1aad0fe1` +3. Compares commits in `red-hat-data-services/eval-hub` +4. Reports result + +**User**: `/odh-pr-tracker https://github.com/opendatahub-io/odh-dashboard/pull/6959 https://github.com/opendatahub-io/eval-hub/pull/42` + +Claude checks both PRs and reports status for each. diff --git a/workflows/rhoai-manager/.claude/commands/rhoai-install.md b/workflows/rhoai-manager/.claude/commands/rhoai-install.md new file mode 100644 index 00000000..5b9ef10b --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/rhoai-install.md @@ -0,0 +1,555 @@ +# /rhoai-install - Install RHOAI on OpenShift Cluster + +Install Red Hat OpenShift AI (RHOAI) on an OpenShift cluster using OLM (Operator Lifecycle Manager). + +## Command Usage + +### Development/Nightly Builds (default) +```bash +/rhoai-install # Latest dev catalog (3.4 beta) +/rhoai-install channel=beta # Explicit beta channel +/rhoai-install image=quay.io/modh/rhoai-catalog:latest-release-3.5 # Custom image +``` + +### GA Production Releases +```bash +/rhoai-install catalog=redhat-operators # GA catalog, stable channel +/rhoai-install catalog=redhat-operators channel=fast # GA catalog, fast channel +/rhoai-install catalog=redhat-operators channel=stable # GA catalog, stable channel +``` + +### Combined Parameters +```bash +/rhoai-install catalog=rhoai-catalog-dev channel=beta image=quay.io/modh/rhoai-catalog:custom +``` + +## Catalog Types + +| Catalog | Description | Use Case | +|---------|-------------|----------| +| `rhoai-catalog-dev` (default) | Development nightly builds | Testing EA/nightly builds | +| `redhat-operators` | Red Hat certified GA releases | Production deployments | + +## Available Channels + +| Channel | Description | Catalog Type | +|---------|-------------|--------------| +| `beta` (default) | Latest EA/nightly builds | rhoai-catalog-dev | +| `fast` | Early GA releases | redhat-operators | +| `stable` | Stable GA releases | redhat-operators | + +## Prerequisites + +Before running this command: +1. **Cluster access**: Logged into OpenShift cluster with cluster-admin privileges (use `/oc-login`) +2. **Tools installed**: `oc` CLI and `jq` must be available +3. **No existing RHOAI**: This command is for fresh installations only + +## Process + +### Step 1: Parse Input Arguments + +```bash +# Default values +CATALOG_SOURCE="rhoai-catalog-dev" +CATALOG_IMAGE="" +CHANNEL="beta" +CUSTOM_IMAGE_OVERRIDE="" + +# Parse key=value arguments +for arg in "$@"; do + case "$arg" in + catalog=*) + CATALOG_SOURCE="${arg#*=}" + ;; + channel=*) + CHANNEL="${arg#*=}" + ;; + image=*) + CUSTOM_IMAGE_OVERRIDE="${arg#*=}" + ;; + *) + echo "⚠️ Unknown parameter: $arg (expected: catalog=, channel=, or image=)" + ;; + esac +done + +# Smart defaults based on catalog type +if [[ "$CATALOG_SOURCE" == "rhoai-catalog-dev" ]]; then + # Development catalog - use custom image or default + if [[ -n "$CUSTOM_IMAGE_OVERRIDE" ]]; then + CATALOG_IMAGE="$CUSTOM_IMAGE_OVERRIDE" + else + CATALOG_IMAGE="quay.io/modh/rhoai-catalog:latest-release-3.4" + fi + CATALOG_NAMESPACE="openshift-marketplace" + USE_CUSTOM_CATALOG=true + + echo "📦 Catalog: Development (rhoai-catalog-dev)" + echo " Image: $CATALOG_IMAGE" + echo " Channel: $CHANNEL" + +elif [[ "$CATALOG_SOURCE" == "redhat-operators" ]]; then + # GA catalog - uses built-in Red Hat operators catalog + CATALOG_IMAGE="" + CATALOG_NAMESPACE="openshift-marketplace" + USE_CUSTOM_CATALOG=false + + echo "📦 Catalog: GA Production (redhat-operators)" + echo " Channel: $CHANNEL" + + if [[ -n "$CUSTOM_IMAGE_OVERRIDE" ]]; then + echo "⚠️ WARNING: image parameter ignored for redhat-operators catalog (uses built-in catalog)" + fi + +else + echo "❌ ERROR: Unknown catalog '$CATALOG_SOURCE'" + echo " Supported: rhoai-catalog-dev, redhat-operators" + exit 1 +fi +``` + +**Parameter Summary:** +- `catalog` - Catalog source to use (default: `rhoai-catalog-dev`) +- `channel` - Subscription channel (default: `beta`) +- `image` - Custom catalog image (only for rhoai-catalog-dev) + +### Step 2: Verify Cluster Access + +```bash +# Check prerequisites +command -v oc &>/dev/null || die "oc command not found" +command -v jq &>/dev/null || die "jq command not found" +oc whoami &>/dev/null || die "Not logged into an OpenShift cluster" + +echo "Logged in as: $(oc whoami)" +echo "Cluster: $(oc whoami --show-server)" + +# Verify RHOAI is not already installed +if oc get csv -n redhat-ods-operator 2>/dev/null | grep -q rhods-operator; then + die "RHOAI is already installed. Use /rhoai-update to update existing installation." +fi +``` + +### Step 3: Create Operator Namespace + +```bash +OPERATOR_NAMESPACE="redhat-ods-operator" + +# Create namespace if it doesn't exist +if ! oc get namespace "$OPERATOR_NAMESPACE" &>/dev/null; then + oc create namespace "$OPERATOR_NAMESPACE" + echo "✅ Created namespace: $OPERATOR_NAMESPACE" +else + echo "✅ Namespace already exists: $OPERATOR_NAMESPACE" +fi +``` + +### Step 4: Create CatalogSource (if using custom catalog) + +```bash +if [[ "$USE_CUSTOM_CATALOG" == "true" ]]; then + echo "Creating custom CatalogSource: $CATALOG_SOURCE" + + cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: CatalogSource +metadata: + name: $CATALOG_SOURCE + namespace: $CATALOG_NAMESPACE +spec: + displayName: "Red Hat OpenShift AI Dev Catalog" + image: $CATALOG_IMAGE + publisher: Red Hat + sourceType: grpc + updateStrategy: + registryPoll: + interval: 30m +EOF + + echo "✅ CatalogSource created: $CATALOG_SOURCE" + + # Wait for catalog to be ready + echo "Waiting for CatalogSource to be ready..." + TIMEOUT=300 + INTERVAL=10 + ELAPSED=0 + + while [[ $ELAPSED -lt $TIMEOUT ]]; do + CATALOG_STATE=$(oc get catalogsource "$CATALOG_SOURCE" -n "$CATALOG_NAMESPACE" \ + -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null || echo "") + + if [[ "$CATALOG_STATE" == "READY" ]]; then + echo "✅ CatalogSource is READY" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo " CatalogSource state: ${CATALOG_STATE:-Unknown} (${ELAPSED}s/${TIMEOUT}s)" + done + + [[ "$CATALOG_STATE" == "READY" ]] || echo "⚠️ WARNING: CatalogSource not READY after ${TIMEOUT}s" +else + echo "Using built-in catalog: $CATALOG_SOURCE" +fi +``` + +### Step 5: Create OperatorGroup + +```bash +# Create OperatorGroup in operator namespace +cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1 +kind: OperatorGroup +metadata: + name: rhods-operator + namespace: $OPERATOR_NAMESPACE +spec: + targetNamespaces: + - $OPERATOR_NAMESPACE +EOF + +echo "✅ OperatorGroup created" +``` + +### Step 6: Create Subscription + +```bash +# Create Subscription +cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: Subscription +metadata: + name: rhods-operator + namespace: $OPERATOR_NAMESPACE +spec: + channel: $CHANNEL + installPlanApproval: Automatic + name: rhods-operator + source: $CATALOG_SOURCE + sourceNamespace: $CATALOG_NAMESPACE +EOF + +echo "✅ Subscription created" +echo " Channel: $CHANNEL" +echo " Source: $CATALOG_SOURCE" + +sleep 5 +``` + +This creates: +- **Namespace**: `redhat-ods-operator` +- **CatalogSource**: Custom catalog (if using dev catalog) or uses built-in `redhat-operators` +- **Subscription**: `rhods-operator` pointing to the chosen catalog +- **OperatorGroup**: For the operator namespace + +### Step 7: Wait for Operator CSV + +```bash +# Wait up to 600 seconds for CSV to reach Succeeded +CSV_PHASE="" +TIMEOUT=600 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CSV_LINE=$(oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator | grep -v Replacing || echo "") + + if [[ -n "$CSV_LINE" ]]; then + CSV_NAME=$(echo "$CSV_LINE" | awk "{print \$1}") + CSV_PHASE=$(echo "$CSV_LINE" | awk "{print \$NF}") + echo "CSV: $CSV_NAME, Phase: $CSV_PHASE" + + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + echo "✅ Operator installed successfully" + break + fi + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for rhods-operator CSV... (${ELAPSED}s/${TIMEOUT}s)" +done + +[[ "$CSV_PHASE" == "Succeeded" ]] || die "Operator did not reach Succeeded phase within ${TIMEOUT}s" +``` + +### Step 8: Create DataScienceCluster + +```bash +# Wait for DSCInitialization +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get dscinitializations default-dsci &>/dev/null; then + echo "✅ DSCInitialization found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for DSCInitialization... (${ELAPSED}s/${TIMEOUT}s)" +done + +oc get dscinitializations default-dsci &>/dev/null || die "DSCInitialization not found within ${TIMEOUT}s" + +# Extract DSC from CSV initialization-resource +CSV_NAME=$(oc get csv -n redhat-ods-operator 2>/dev/null | awk '/rhods-operator/{print $1; exit}') +if [[ -n "$CSV_NAME" ]]; then + oc get csv "$CSV_NAME" -n redhat-ods-operator \ + -o jsonpath='{.metadata.annotations.operatorframework\.io/initialization-resource}' \ + > /tmp/default-dsc.json + + oc apply -f /tmp/default-dsc.json + echo "✅ DSC created from CSV initialization-resource" +else + die "Cannot find rhods-operator CSV in redhat-ods-operator namespace" +fi +``` + +### Step 9: Configure DSC Components + +```bash +# Wait for DSC to exist +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get datasciencecluster default-dsc &>/dev/null; then + echo "✅ DataScienceCluster found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for DataScienceCluster... (${ELAPSED}s/${TIMEOUT}s)" +done + +# Patch DSC to enable required components +cat > /tmp/dsc-components-patch.yaml << 'YAML' +spec: + components: + aipipelines: + managementState: Managed + argoWorkflowsControllers: + managementState: Managed + llamastackoperator: + managementState: Managed + mlflowoperator: + managementState: Managed + trainer: + managementState: Removed +YAML + +oc patch datasciencecluster default-dsc --type merge --patch-file /tmp/dsc-components-patch.yaml || \ + die "Failed to patch DataScienceCluster" + +echo "✅ DSC component configuration applied:" +echo " - aipipelines: Managed (with argoWorkflowsControllers)" +echo " - llamastackoperator: Managed" +echo " - mlflowoperator: Managed" +echo " - trainer: Removed (requires JobSet operator)" + +sleep 5 +``` + +**Why these components?** +- `aipipelines`: For AI/ML pipelines with Argo Workflows +- `llamastackoperator`: For Llama Stack server deployments +- `mlflowoperator`: For ML experiment tracking +- `trainer`: Removed (requires JobSet operator, not available by default) + +### Step 10: Wait for DSC Ready + +```bash +# Wait for DataScienceCluster to be Ready +TIMEOUT=600 +INTERVAL=15 +ELAPSED=0 +DSC_PHASE="" + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + DSC_PHASE=$(oc get datasciencecluster -o jsonpath="{.items[0].status.phase}" 2>/dev/null || echo "Unknown") + echo "DSC phase: $DSC_PHASE" + + if [[ "$DSC_PHASE" == "Ready" ]]; then + echo "✅ DataScienceCluster is Ready" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for DataScienceCluster... (${ELAPSED}s/${TIMEOUT}s)" +done + +if [[ "$DSC_PHASE" != "Ready" ]]; then + echo "⚠️ WARNING: DSC is not Ready after ${TIMEOUT}s (current: ${DSC_PHASE:-Unknown})" + echo "Not-ready components:" + oc get dsc default-dsc -o json 2>/dev/null | \ + jq -r '.status.conditions[] | select(.status=="False") | select(.message | test("Removed") | not) | " \(.type): \(.message)"' 2>/dev/null || true +fi +``` + +### Step 11: Wait for Dashboard + +```bash +# Wait for dashboard deployment to be ready +TIMEOUT=300 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + READY=$(oc get deployment rhods-dashboard -n redhat-ods-applications -o jsonpath="{.status.readyReplicas}" 2>/dev/null || echo "0") + DESIRED=$(oc get deployment rhods-dashboard -n redhat-ods-applications -o jsonpath="{.spec.replicas}" 2>/dev/null || echo "0") + + if [[ "$READY" -gt 0 && "$READY" -eq "$DESIRED" ]]; then + echo "✅ Dashboard deployment is ready" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for dashboard deployment... (${ELAPSED}s/${TIMEOUT}s)" +done + +echo "Dashboard containers:" +oc get deployment rhods-dashboard -n redhat-ods-applications \ + -o jsonpath='{range .spec.template.spec.containers[*]}{.name}{"\n"}{end}' 2>/dev/null || \ + echo " Dashboard deployment not found" +``` + +### Step 12: Configure Dashboard Features + +```bash +# Wait for OdhDashboardConfig to exist +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + echo "✅ OdhDashboardConfig found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for OdhDashboardConfig... (${ELAPSED}s/${TIMEOUT}s)" +done + +if ! oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + echo "⚠️ WARNING: OdhDashboardConfig not found yet, feature flags will be configured when available" +else + # Enable feature flags + oc patch odhdashboardconfig odh-dashboard-config -n redhat-ods-applications --type merge -p '{ + "spec": { + "dashboardConfig": { + "automl": true, + "autorag": true, + "genAiStudio": true + } + } + }' || { + echo "⚠️ WARNING: Failed to patch dashboard config, feature flags may need manual configuration" + } + + echo "✅ Dashboard feature flags configured:" + echo " - automl: enabled" + echo " - autorag: enabled" + echo " - genAiStudio: enabled" + + # Restart dashboard to pick up changes + echo "Restarting dashboard to apply feature flag changes..." + oc rollout restart deployment rhods-dashboard -n redhat-ods-applications 2>/dev/null || true + sleep 3 +fi +``` + +### Step 13: Verify Installation + +```bash +echo "" +echo "=== Installation Summary ===" + +# Show CSV +echo "" +echo "CSV:" +oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator || echo " WARNING: CSV not found" + +# Show Dashboard URL +echo "" +echo "Dashboard:" +DASHBOARD_ROUTE=$(oc get route rhods-dashboard -n redhat-ods-applications -o jsonpath='{.spec.host}' 2>/dev/null || echo "") +if [[ -n "$DASHBOARD_ROUTE" ]]; then + echo " https://$DASHBOARD_ROUTE" +else + echo " WARNING: Dashboard route not found yet" +fi + +echo "" +echo "✅ RHOAI installation complete!" +``` + +## Output + +The command creates a report at `artifacts/rhoai-manager/reports/install-report-[timestamp].md` with: +- Installation parameters (catalog source, channel, image) +- Operator CSV details +- DataScienceCluster status +- Configured components +- Dashboard URL +- Feature flags enabled + +## Usage Examples + +### Development/Testing +```bash +# Install latest dev build (default: beta channel, dev catalog) +/rhoai-install + +# Install from dev catalog with custom image +/rhoai-install image=quay.io/modh/rhoai-catalog:latest-release-3.5 + +# Install from dev catalog with specific channel +/rhoai-install channel=beta +``` + +### Production GA +```bash +# Install from GA catalog (stable channel) +/rhoai-install catalog=redhat-operators channel=stable + +# Install from GA catalog (fast channel for early GA releases) +/rhoai-install catalog=redhat-operators channel=fast + +# Install from GA catalog with default stable channel +/rhoai-install catalog=redhat-operators +``` + +Or simply ask: +- "Install RHOAI from dev catalog" +- "Install RHOAI from production catalog" +- "Set up RHOAI on my cluster" +- "Install latest RHOAI nightly" + +## Common Issues + +**Problem:** CSV stuck in "Installing" phase +**Solution:** Check operator pod logs in `redhat-ods-operator` namespace + +**Problem:** DSC not reaching Ready +**Solution:** Check component conditions with `oc get dsc default-dsc -o yaml | yq '.status.conditions'` + +**Problem:** Dashboard not accessible +**Solution:** Verify route exists and check dashboard pod logs in `redhat-ods-applications` + +## Next Steps + +After installation: +1. Access the dashboard at the URL shown in the output +2. Configure user access and permissions +3. Deploy models and workbenches +4. Set up data connections + +To update RHOAI to a newer version, use `/rhoai-update`. diff --git a/workflows/rhoai-manager/.claude/commands/rhoai-uninstall.md b/workflows/rhoai-manager/.claude/commands/rhoai-uninstall.md new file mode 100644 index 00000000..97ca5d8d --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/rhoai-uninstall.md @@ -0,0 +1,436 @@ +# /rhoai-uninstall - Uninstall RHOAI from Cluster + +Completely uninstall Red Hat OpenShift AI (RHOAI) from an OpenShift cluster, removing all related resources. + +## Purpose + +This command performs a comprehensive cleanup of RHOAI, removing the operator, custom resources, CRDs, and all related namespaces. + +## Prerequisites + +- Must be logged into an OpenShift cluster (use `/oc-login` first if needed) +- Cluster admin permissions required +- RHOAI must be installed on the cluster + +## Command Usage + +- `/rhoai-uninstall` - Standard uninstall (forceful cleanup) +- `/rhoai-uninstall graceful` - Graceful uninstall followed by forceful cleanup +- `/rhoai-uninstall keep-crds` - Uninstall but keep CRDs +- `/rhoai-uninstall keep-all` - Keep CRDs and user resources (projects, models, etc.) + +## Uninstall Options + +### Standard Uninstall (Default) +Forcefully removes all RHOAI resources including: +- Operator and subscriptions +- Custom resources (DSC, DSCInitialization, etc.) +- CRDs +- Namespaces +- User resources (data science projects, models, workbenches) + +### Graceful Uninstall +Attempts graceful removal first, then forceful cleanup: +- Allows RHOAI to clean up resources in proper order +- Runs finalizers correctly +- Falls back to forceful cleanup if graceful fails + +### Keep CRDs +Removes RHOAI but keeps the CRDs installed + +### Keep All +Keeps both CRDs and user resources: +- Data science projects remain +- User models, workbenches, connections preserved +- Useful for reinstalling RHOAI without losing user work + +## Uninstall Process + +### Step 1: Verify Cluster Access + +Check that you're logged into the cluster with admin permissions: + +```bash +# Verify login +oc whoami + +# Verify admin permissions +oc auth can-i delete namespace +``` + +If not logged in or lacking permissions, stop and inform the user. + +### Step 2: Check Current RHOAI Installation + +Verify RHOAI is installed: + +```bash +# Check for RHOAI operator namespace +oc get namespace redhat-ods-operator 2>/dev/null + +# Check for RHOAI operator +oc get csv -n redhat-ods-operator | grep rhods-operator + +# Check for DataScienceCluster +oc get datasciencecluster -A +``` + +Report what's found and confirm with user before proceeding. + +### Step 3: Graceful Uninstall (if requested) + +If graceful uninstall is requested: + +```bash +# Create the deletion ConfigMap +oc create configmap delete-self-managed-odh -n redhat-ods-operator + +# Label it to trigger graceful deletion +oc label configmap/delete-self-managed-odh \ + api.openshift.com/addon-managed-odh-delete=true \ + -n redhat-ods-operator + +# Wait for redhat-ods-applications namespace to be removed (up to 5 minutes) +echo "Waiting for graceful deletion to complete (max 5 minutes)..." +if oc wait --for=delete --timeout=300s namespace redhat-ods-applications 2>/dev/null; then + echo "✅ Graceful deletion completed successfully" +else + echo "⚠️ Graceful deletion timed out or failed, proceeding with forceful cleanup" +fi + +# Clean up the ConfigMap +oc delete configmap delete-self-managed-odh -n redhat-ods-operator --ignore-not-found +``` + +### Step 4: Delete RHOAI Custom Resources + +Remove all RHOAI custom resources before deleting CRDs: + +```bash +# Delete DataScienceCluster +echo "Deleting DataScienceCluster resources..." +oc get datasciencecluster -A -o custom-columns=:metadata.name,:metadata.namespace --no-headers | \ + while read name namespace; do + oc patch datasciencecluster $name -n $namespace --type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true + oc delete datasciencecluster $name -n $namespace --timeout=60s --ignore-not-found + done + +# Delete DSCInitialization +echo "Deleting DSCInitialization resources..." +oc get dscinitialization -A -o custom-columns=:metadata.name,:metadata.namespace --no-headers | \ + while read name namespace; do + oc patch dscinitialization $name -n $namespace --type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true + oc delete dscinitialization $name -n $namespace --timeout=60s --ignore-not-found + done + +# Delete Notebooks (they often have finalizers) +echo "Deleting Notebook resources..." +oc get notebooks.kubeflow.org -A -o custom-columns=:metadata.name,:metadata.namespace --no-headers | \ + while read name namespace; do + oc patch notebooks.kubeflow.org $name -n $namespace --type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true + oc delete notebooks.kubeflow.org $name -n $namespace --timeout=60s --ignore-not-found + done + +# Delete InferenceServices +echo "Deleting InferenceService resources..." +oc delete inferenceservices.serving.kserve.io --all -A --ignore-not-found --timeout=60s + +# Delete ServingRuntimes +echo "Deleting ServingRuntime resources..." +oc delete servingruntimes.serving.kserve.io --all -A --ignore-not-found --timeout=60s + +# Delete DataSciencePipelinesApplications +echo "Deleting DataSciencePipelinesApplication resources..." +oc delete datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io --all -A --ignore-not-found --timeout=60s +``` + +### Step 5: Delete Webhooks + +Remove validating and mutating webhooks that may block deletion: + +```bash +# Delete RHOAI-related validating webhooks +echo "Deleting validating webhooks..." +oc get validatingwebhookconfiguration -o json | \ + jq -r '.items[] | select(.metadata.name | test("odh|rhods|opendatahub|kserve")) | .metadata.name' | \ + xargs -r oc delete validatingwebhookconfiguration + +# Delete RHOAI-related mutating webhooks +echo "Deleting mutating webhooks..." +oc get mutatingwebhookconfiguration -o json | \ + jq -r '.items[] | select(.metadata.name | test("odh|rhods|opendatahub|kserve")) | .metadata.name' | \ + xargs -r oc delete mutatingwebhookconfiguration +``` + +### Step 6: Delete RHOAI Operator + +Remove the operator subscription and CSV: + +```bash +# Delete subscription +echo "Deleting RHOAI operator subscription..." +oc delete subscription rhods-operator -n redhat-ods-operator --ignore-not-found --timeout=60s + +# Delete CSV +echo "Deleting ClusterServiceVersion..." +CSV_NAME=$(oc get csv -n redhat-ods-operator -o custom-columns=:metadata.name --no-headers | grep rhods-operator) +if [ -n "$CSV_NAME" ]; then + oc delete csv $CSV_NAME -n redhat-ods-operator --ignore-not-found --timeout=60s +fi + +# Delete catalog source if it's a dev catalog +echo "Checking for dev catalog sources..." +if oc get catalogsource rhoai-catalog-dev -n openshift-marketplace &>/dev/null; then + echo "Deleting rhoai-catalog-dev..." + oc delete catalogsource rhoai-catalog-dev -n openshift-marketplace --ignore-not-found +fi +``` + +### Step 7: Delete Namespaces + +Remove all RHOAI-related namespaces: + +```bash +# List of RHOAI namespaces +NAMESPACES="redhat-ods-operator redhat-ods-applications redhat-ods-applications-auth-provider redhat-ods-monitoring rhods-notebooks rhoai-model-registries" + +for ns in $NAMESPACES; do + if oc get namespace $ns &>/dev/null; then + echo "Deleting namespace: $ns" + + # Delete all resources in the namespace first + oc delete all --all -n $ns --ignore-not-found --timeout=30s 2>/dev/null || true + + # Delete the namespace + oc delete namespace $ns --ignore-not-found --timeout=60s || true + + # If stuck in Terminating, remove finalizers + if oc get namespace $ns -o jsonpath='{.status.phase}' 2>/dev/null | grep -q "Terminating"; then + echo " Namespace stuck in Terminating, removing finalizers..." + oc patch namespace $ns -p '{"spec":{"finalizers":[]}}' --type=merge 2>/dev/null || true + fi + fi +done +``` + +### Step 8: Delete CRDs (unless keep-crds or keep-all) + +If user didn't request to keep CRDs: + +```bash +echo "Deleting RHOAI CRDs..." + +# Core RHOAI CRDs +oc delete crd datascienceclusters.datasciencecluster.opendatahub.io --ignore-not-found +oc delete crd dscinitializations.dscinitialization.opendatahub.io --ignore-not-found +oc delete crd acceleratorprofiles.dashboard.opendatahub.io --ignore-not-found +oc delete crd hardwareprofiles.dashboard.opendatahub.io --ignore-not-found +oc delete crd odhapplications.dashboard.opendatahub.io --ignore-not-found +oc delete crd odhdashboardconfigs.opendatahub.io --ignore-not-found +oc delete crd odhdocuments.dashboard.opendatahub.io --ignore-not-found +oc delete crd modelregistries.modelregistry.opendatahub.io --ignore-not-found + +# KServe CRDs +oc delete crd inferenceservices.serving.kserve.io --ignore-not-found +oc delete crd servingruntimes.serving.kserve.io --ignore-not-found +oc delete crd inferencegraphs.serving.kserve.io --ignore-not-found + +# Notebook CRDs (remove finalizers first) +oc get notebooks.kubeflow.org -A -o custom-columns=:metadata.name,:metadata.namespace --no-headers | \ + while read name namespace; do + oc patch notebooks.kubeflow.org $name -n $namespace --type=merge -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true + done +oc delete crd notebooks.kubeflow.org --ignore-not-found + +# DataSciencePipelinesApplications +oc delete crd datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io --ignore-not-found + +# All CRDs labeled by RHOAI operator +oc delete crd -l operators.coreos.com/rhods-operator.redhat-ods-operator --ignore-not-found + +# Ray CRDs +oc delete crd rayclusters.ray.io --ignore-not-found +oc delete crd rayjobs.ray.io --ignore-not-found +oc delete crd rayservices.ray.io --ignore-not-found + +# CodeFlare CRDs +oc delete crd appwrappers.workload.codeflare.dev --ignore-not-found + +# TrustyAI CRDs +oc delete crd trustyaiservices.trustyai.opendatahub.io --ignore-not-found +``` + +### Step 9: Clean Up User Resources (if keep-all not requested) + +Remove user data science projects and resources: + +```bash +# Find and delete data science project namespaces +echo "Looking for user data science projects..." +USER_PROJECTS=$(oc get namespaces -l opendatahub.io/dashboard=true -o custom-columns=:metadata.name --no-headers) + +if [ -n "$USER_PROJECTS" ]; then + echo "Found user projects: $USER_PROJECTS" + for project in $USER_PROJECTS; do + echo " Deleting project: $project" + oc delete namespace $project --ignore-not-found --timeout=60s || true + done +else + echo "No user data science projects found" +fi +``` + +### Step 10: Verify Cleanup + +Check that all resources have been removed: + +```bash +# Check for remaining RHOAI namespaces +echo "Checking for remaining RHOAI namespaces..." +REMAINING_NS=$(oc get namespaces | grep -E "redhat-ods|rhods|rhoai" || echo "") +if [ -n "$REMAINING_NS" ]; then + echo "⚠️ Some namespaces still exist:" + echo "$REMAINING_NS" +else + echo "✅ All RHOAI namespaces removed" +fi + +# Check for RHOAI CRDs +echo "Checking for remaining RHOAI CRDs..." +REMAINING_CRDS=$(oc get crd | grep -E "opendatahub|kubeflow|kserve" || echo "") +if [ -n "$REMAINING_CRDS" ]; then + echo "⚠️ Some CRDs still exist:" + echo "$REMAINING_CRDS" +else + echo "✅ All RHOAI CRDs removed" +fi + +# Check for operator +echo "Checking for RHOAI operator..." +REMAINING_CSV=$(oc get csv -A | grep rhods-operator || echo "") +if [ -n "$REMAINING_CSV" ]; then + echo "⚠️ RHOAI operator still exists:" + echo "$REMAINING_CSV" +else + echo "✅ RHOAI operator removed" +fi +``` + +### Step 11: Report Summary + +Provide a summary of what was removed: + +``` +✅ RHOAI Uninstall Complete! + +Removed: +- RHOAI Operator +- DataScienceCluster and DSCInitialization +- All RHOAI namespaces +- Custom Resources (notebooks, inference services, etc.) +[- CRDs (if not kept)] +[- User data science projects (if not kept)] + +The cluster is now clean and ready for a fresh RHOAI installation if needed. +``` + +## Important Warnings + +**Before running this command, warn the user:** + +1. **⚠️ Data Loss Warning** + - This will DELETE all RHOAI resources including user workbenches, models, and data + - User should backup any important work first + - Cannot be undone + +2. **⚠️ Cluster Access Required** + - Requires cluster-admin permissions + - Will modify cluster-wide resources (CRDs, webhooks) + +3. **⚠️ Downtime Warning** + - Any running workloads will be terminated + - Data science pipelines will be stopped + - Active model servers will be shut down + +## Example Interactions + +### Example 1: Standard Uninstall + +**User**: `/rhoai-uninstall` + +**Claude**: +1. Checks cluster access and RHOAI installation +2. Warns about data loss and asks for confirmation +3. Deletes custom resources +4. Removes webhooks +5. Deletes operator +6. Removes namespaces +7. Deletes CRDs +8. Reports: "✅ RHOAI completely removed from cluster" + +### Example 2: Graceful Uninstall + +**User**: `/rhoai-uninstall graceful` + +**Claude**: +1. Creates deletion ConfigMap +2. Waits for graceful deletion (up to 5 minutes) +3. If graceful succeeds, cleans up remaining resources +4. If graceful fails/times out, proceeds with forceful cleanup +5. Reports final status + +### Example 3: Keep User Resources + +**User**: `/rhoai-uninstall keep-all` + +**Claude**: +1. Removes RHOAI operator and core resources +2. Keeps CRDs installed +3. Preserves user data science projects +4. Reports: "✅ RHOAI operator removed. CRDs and user projects preserved." + +## Troubleshooting + +### Issue 1: Namespaces Stuck in Terminating + +**Cause**: Finalizers or webhooks blocking deletion + +**Solution**: +```bash +# Remove finalizers +oc patch namespace <ns-name> -p '{"spec":{"finalizers":[]}}' --type=merge + +# Delete blocking webhooks +oc delete validatingwebhookconfiguration --all +oc delete mutatingwebhookconfiguration --all +``` + +### Issue 2: CRDs Won't Delete + +**Cause**: Custom resources still exist + +**Solution**: Delete all custom resources first, remove finalizers if needed + +### Issue 3: Permission Denied + +**Cause**: Insufficient permissions + +**Solution**: Must be cluster-admin. Check with `oc auth can-i delete namespace` + +## Integration with Other Commands + +Typical workflow: +``` +/oc-login # Login to cluster +/rhoai-version # Check what's installed +/rhoai-uninstall # Remove RHOAI +``` + +## Success Criteria + +Uninstall is successful when: +- ✅ All RHOAI namespaces deleted +- ✅ RHOAI operator removed +- ✅ CRDs deleted (unless kept) +- ✅ No RHOAI webhooks remain +- ✅ `oc get csv -A | grep rhods-operator` returns nothing diff --git a/workflows/rhoai-manager/.claude/commands/rhoai-update.md b/workflows/rhoai-manager/.claude/commands/rhoai-update.md new file mode 100644 index 00000000..c07b07b9 --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/rhoai-update.md @@ -0,0 +1,642 @@ +# /rhoai-update - Update RHOAI to Newer Build + +Update an existing Red Hat OpenShift AI (RHOAI) installation to a newer nightly build or version. + +## Command Usage + +- `/rhoai-update` - Update to latest available nightly (currently 3.4, preserves current channel) +- `/rhoai-update 3.4` - Update to RHOAI 3.4 (preserves current channel) +- `/rhoai-update 3.4-ea.2` - Update to RHOAI 3.4 EA build 2 +- `/rhoai-update 3.4 -c beta` - Update to 3.4 and change channel to beta +- `/rhoai-update 3.3 -c stable-3.3` - Update to 3.3 and change to stable-3.3 channel +- `/rhoai-update 3.4@sha256:abc123...` - Update to 3.4 with specific SHA digest + +## Available Channels + +| Channel | Description | Use Case | +|---------|-------------|----------| +| `beta` | Latest EA builds | Testing 3.4.0-ea.x builds | +| `stable` | Latest GA release across all versions | Production stable | +| `stable-3.4` | RHOAI 3.4.x GA | Latest 3.4 GA nightly (recommended) | +| `stable-3.3` | RHOAI 3.3.x GA | Stable 3.3 releases | + +## Prerequisites + +Before running this command: +1. **Existing RHOAI**: RHOAI must already be installed (use `/rhoai-install` for fresh installations) +2. **Cluster access**: Logged into OpenShift cluster with cluster-admin privileges (use `/oc-login`) +3. **Tools installed**: `oc` CLI and `jq` must be available + +## Process + +### Step 1: Parse Input Arguments + +```bash +# Default values +VERSION_ARG="" +CHANNEL="" # Will be set from existing subscription if not specified +USER_SPECIFIED_CHANNEL=false + +# Parse arguments +while [[ $# -gt 0 ]]; do + case $1 in + -c|--channel) + CHANNEL="$2" + USER_SPECIFIED_CHANNEL=true + shift 2 + ;; + *) + VERSION_ARG="$1" + shift + ;; + esac +done + +# Build image URL +if [[ -z "$VERSION_ARG" ]]; then + IMAGE="quay.io/rhoai/rhoai-fbc-fragment:rhoai-3.4" + echo "No version specified, defaulting to RHOAI 3.4" +elif [[ "$VERSION_ARG" == *"/"* ]]; then + IMAGE="$VERSION_ARG" +elif [[ "$VERSION_ARG" == rhoai-* ]]; then + IMAGE="quay.io/rhoai/rhoai-fbc-fragment:${VERSION_ARG}" +else + IMAGE="quay.io/rhoai/rhoai-fbc-fragment:rhoai-${VERSION_ARG}" +fi + +echo "Target image: $IMAGE" +``` + +### Step 2: Verify Cluster Access and Existing Installation + +```bash +# Check prerequisites +command -v oc &>/dev/null || die "oc command not found" +command -v jq &>/dev/null || die "jq command not found" +oc whoami &>/dev/null || die "Not logged into an OpenShift cluster" + +echo "Logged in as: $(oc whoami)" +echo "Cluster: $(oc whoami --show-server)" + +# Verify RHOAI is already installed +if ! oc get csv -n redhat-ods-operator 2>/dev/null | grep -q rhods-operator; then + die "RHOAI is not installed. Use /rhoai-install for fresh installation." +fi + +echo "✅ Detected existing RHOAI installation" +``` + +### Step 3: Handle Channel Preservation/Change + +```bash +# Get existing channel from subscription +EXISTING_CHANNEL=$(oc get subscription -n redhat-ods-operator -o jsonpath='{.items[0].spec.channel}' 2>/dev/null || echo "") + +if [[ -n "$EXISTING_CHANNEL" ]]; then + echo "Current channel: $EXISTING_CHANNEL" + + if [[ "$USER_SPECIFIED_CHANNEL" == "true" && "$CHANNEL" != "$EXISTING_CHANNEL" ]]; then + echo "" + echo "⚠️ WARNING: Channel change requested!" + echo " Current channel: $EXISTING_CHANNEL" + echo " New channel: $CHANNEL" + echo " Changing channels may cause unexpected upgrades or downgrades!" + echo "" + + # In interactive mode, prompt user + # In automated mode, preserve existing channel for safety + if [[ -t 0 ]]; then + read -p "Do you want to CHANGE the channel? [y/N] " -n 1 -r + echo + if [[ ! $REPLY =~ ^[Yy]$ ]]; then + echo "Preserving existing channel: $EXISTING_CHANNEL" + CHANNEL="$EXISTING_CHANNEL" + fi + else + echo "Automated mode: Preserving existing channel for safety" + CHANNEL="$EXISTING_CHANNEL" + fi + else + # User didn't specify channel, preserve existing + CHANNEL="$EXISTING_CHANNEL" + echo "Preserving existing channel: $CHANNEL" + fi +else + # No existing channel found, use beta as default + [[ -z "$CHANNEL" ]] && CHANNEL="beta" + echo "No existing channel found, using: $CHANNEL" +fi + +echo "Target channel: $CHANNEL" +``` + +### Step 4: Clone olminstall Repository + +```bash +OLMINSTALL_REPO="https://gitlab.cee.redhat.com/data-hub/olminstall.git" +OLMINSTALL_DIR="/tmp/olminstall" + +if [ -d "$OLMINSTALL_DIR" ]; then + echo "Updating existing clone..." + git -C "$OLMINSTALL_DIR" pull --rebase --quiet 2>/dev/null || true +else + echo "Cloning from $OLMINSTALL_REPO..." + git clone --quiet "$OLMINSTALL_REPO" "$OLMINSTALL_DIR" +fi + +[[ -d "$OLMINSTALL_DIR" ]] || die "Failed to clone olminstall" +echo "olminstall ready" +``` + +### Step 5: Update RHOAI Catalog + +```bash +cd "$OLMINSTALL_DIR" +bash setup.sh -t operator -i "$IMAGE" -u "$CHANNEL" +``` + +This updates: +- **CatalogSource**: `rhoai-catalog-dev` with new image +- **Subscription**: May update to new channel if specified + +### Step 6: Force Catalog Refresh + +```bash +# Force catalog to pull fresh image by deleting the pod +echo "Forcing catalog refresh to ensure latest component images..." + +CATALOG_POD=$(oc get pod -n openshift-marketplace -l olm.catalogSource=rhoai-catalog-dev -o name 2>/dev/null | head -1) + +if [[ -n "$CATALOG_POD" ]]; then + echo "Deleting catalog pod to force fresh image pull..." + oc delete "$CATALOG_POD" -n openshift-marketplace 2>/dev/null || true + + # Wait for new catalog pod to be ready + TIMEOUT=120 + INTERVAL=5 + ELAPSED=0 + + while [[ $ELAPSED -lt $TIMEOUT ]]; do + NEW_POD=$(oc get pod -n openshift-marketplace -l olm.catalogSource=rhoai-catalog-dev -o jsonpath="{.items[0].status.phase}" 2>/dev/null || echo "") + + if [[ "$NEW_POD" == "Running" ]]; then + echo "✅ Catalog refreshed with latest image" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for new catalog pod... (${ELAPSED}s/${TIMEOUT}s)" + done + + if [[ "$NEW_POD" != "Running" ]]; then + echo "⚠️ WARNING: Catalog pod not ready, image comparison may use stale data" + fi +else + echo "ℹ️ Catalog pod not found yet, will be created fresh" +fi +``` + +### Step 7: Wait for Operator CSV + +```bash +# Wait up to 600 seconds for CSV to reach Succeeded +CSV_PHASE="" +TIMEOUT=600 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CSV_LINE=$(oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator | grep -v Replacing || echo "") + + if [[ -n "$CSV_LINE" ]]; then + CSV_NAME=$(echo "$CSV_LINE" | awk "{print \$1}") + CSV_PHASE=$(echo "$CSV_LINE" | awk "{print \$NF}") + echo "CSV: $CSV_NAME, Phase: $CSV_PHASE" + + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + echo "✅ Operator CSV is in Succeeded state" + break + fi + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for rhods-operator CSV... (${ELAPSED}s/${TIMEOUT}s)" +done + +[[ "$CSV_PHASE" == "Succeeded" ]] || die "Operator did not reach Succeeded phase within ${TIMEOUT}s" +``` + +### Step 8: Check for Newer Component Images (Critical for Updates) + +```bash +echo "" +echo "=== Checking for Newer Component Images in Catalog ===" + +# Verify catalog source is using the target image +CATALOG_SOURCE_IMAGE=$(oc get catalogsource rhoai-catalog-dev -n openshift-marketplace -o jsonpath='{.spec.image}' 2>/dev/null || echo "") + +if [[ -n "$CATALOG_SOURCE_IMAGE" ]]; then + echo "CatalogSource image: $CATALOG_SOURCE_IMAGE" + + if [[ "$CATALOG_SOURCE_IMAGE" != "$IMAGE" ]]; then + echo "⚠️ WARNING: CatalogSource image doesn't match target!" + echo " Expected: $IMAGE" + echo " Actual: $CATALOG_SOURCE_IMAGE" + fi +else + echo "⚠️ WARNING: Could not verify CatalogSource image" +fi + +# Get current CSV +CURRENT_CSV=$(oc get csv -n redhat-ods-operator -o jsonpath='{.items[0].metadata.name}' 2>/dev/null | grep rhods-operator) + +# Get catalog pod +CATALOG_POD=$(oc get pod -n openshift-marketplace -l olm.catalogSource=rhoai-catalog-dev -o name 2>/dev/null | head -1) + +if [[ -z "$CATALOG_POD" ]]; then + echo "ℹ️ Catalog pod not found, skipping image comparison" +else + echo "Comparing all component images between CSV and catalog..." + + # Get all relatedImages from current CSV + CURRENT_IMAGES=$(oc get csv "$CURRENT_CSV" -n redhat-ods-operator -o json 2>/dev/null | \ + jq -r '.spec.relatedImages[] | "\(.name)|\(.image)"' 2>/dev/null || echo "") + + if [[ -z "$CURRENT_IMAGES" ]]; then + echo "⚠️ Could not retrieve current CSV images" + else + # Get catalog.yaml content once + CATALOG_YAML=$(oc exec -n openshift-marketplace "$CATALOG_POD" -- cat /configs/rhods-operator/catalog.yaml 2>/dev/null || echo "") + + if [[ -z "$CATALOG_YAML" ]]; then + echo "⚠️ Could not retrieve catalog images" + else + IMAGES_DIFFER=false + DIFF_COUNT=0 + + # Compare each image + while IFS='|' read -r img_name img_url; do + [[ -z "$img_name" ]] && continue + + # Extract catalog image for this component + CATALOG_IMAGE=$(echo "$CATALOG_YAML" | grep -A 1 "name: $img_name" | grep "image:" | awk '{print $3}' || echo "") + + if [[ -n "$CATALOG_IMAGE" && "$img_url" != "$CATALOG_IMAGE" ]]; then + # Extract just the digest for cleaner output + CURRENT_DIGEST="${img_url##*@}" + CATALOG_DIGEST="${CATALOG_IMAGE##*@}" + + # Only report if digests actually differ (not just registry URLs) + if [[ "$CURRENT_DIGEST" != "$CATALOG_DIGEST" ]]; then + echo "⚠️ Newer image found: $img_name" + echo " Current: ${CURRENT_DIGEST:0:20}..." + echo " Catalog: ${CATALOG_DIGEST:0:20}..." + IMAGES_DIFFER=true + DIFF_COUNT=$((DIFF_COUNT + 1)) + fi + fi + done <<< "$CURRENT_IMAGES" + + if [[ "$IMAGES_DIFFER" == "true" ]]; then + echo "" + echo "Found $DIFF_COUNT component image(s) with newer versions in catalog." + echo "CSV version is unchanged, but component images have been updated." + echo "Forcing subscription reinstall to pick up newer images..." + echo "" + + # Trigger forced reinstall - SEE STEP 9 BELOW + else + echo "✅ All component images are up to date" + fi + fi + fi +fi +``` + +**Why this matters:** +- OLM may not automatically update if CSV version hasn't changed +- Component images can be updated in the catalog without CSV version bump +- Without forced reinstall, you'd be running old component images + +### Step 9: Perform Forced Reinstall (If Newer Images Found) + +This step only runs if newer component images were detected in Step 8. + +```bash +# Get current subscription info +SUB_NAME=$(oc get subscription -n redhat-ods-operator -o jsonpath='{.items[0].metadata.name}') +CSV_NAME=$(oc get csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator -o jsonpath='{.items[0].metadata.name}') +CURRENT_CHANNEL=$(oc get subscription -n redhat-ods-operator -o jsonpath='{.items[0].spec.channel}') + +echo "Current subscription: $SUB_NAME" +echo "Current CSV: $CSV_NAME" +echo "Current channel: $CURRENT_CHANNEL" + +# Delete CSV +echo "Deleting CSV..." +oc delete csv "$CSV_NAME" -n redhat-ods-operator || true +sleep 10 + +# Delete subscription +echo "Deleting subscription..." +oc delete subscription "$SUB_NAME" -n redhat-ods-operator || true +sleep 5 + +# Recreate subscription with same channel +echo "Recreating subscription (channel: $CURRENT_CHANNEL)..." +cat > /tmp/subscription-rhoai.yaml << YAML +apiVersion: operators.coreos.com/v1alpha1 +kind: Subscription +metadata: + name: rhoai-operator-dev + namespace: redhat-ods-operator +spec: + channel: ${CURRENT_CHANNEL} + installPlanApproval: Automatic + name: rhods-operator + source: rhoai-catalog-dev + sourceNamespace: openshift-marketplace +YAML + +oc apply -f /tmp/subscription-rhoai.yaml + +# Wait for new install plan +echo "Waiting for new install plan..." +sleep 15 + +# Wait for CSV to be installed +echo "Waiting for CSV to be installed from updated catalog..." +TIMEOUT=300 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CSV_PHASE=$(oc get csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator -o jsonpath="{.items[0].status.phase}" 2>/dev/null || echo "") + NEW_CSV_NAME=$(oc get csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator -o jsonpath="{.items[0].metadata.name}" 2>/dev/null || echo "") + + echo "CSV: $NEW_CSV_NAME, Phase: ${CSV_PHASE:-Pending}" + + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + echo "✅ CSV reinstalled successfully" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for CSV after reinstall... (${ELAPSED}s/${TIMEOUT}s)" +done + +[[ "$CSV_PHASE" == "Succeeded" ]] || die "CSV did not reach Succeeded after forced reinstall" + +# Verify new images +echo "" +echo "=== Verifying New Component Images ===" +NEW_AUTOML=$(oc get csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator -o jsonpath='{.spec.relatedImages[?(@.name=="odh_mod_arch_automl_image")].image}' 2>/dev/null || echo "") +NEW_AUTORAG=$(oc get csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator -o jsonpath='{.spec.relatedImages[?(@.name=="odh_mod_arch_autorag_image")].image}' 2>/dev/null || echo "") + +[[ -n "$NEW_AUTOML" ]] && echo "AutoML: ${NEW_AUTOML##*@}" +[[ -n "$NEW_AUTORAG" ]] && echo "AutoRAG: ${NEW_AUTORAG##*@}" + +echo "✅ Operator reinstalled with newer component images" +``` + +### Step 10: Configure DSC Components + +```bash +# Wait for DSC to exist +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get datasciencecluster default-dsc &>/dev/null; then + echo "✅ DataScienceCluster found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for DataScienceCluster... (${ELAPSED}s/${TIMEOUT}s)" +done + +if ! oc get datasciencecluster default-dsc &>/dev/null; then + echo "⚠️ WARNING: DSC not found. You may need to create it manually." +else + # Patch DSC to enable required components + cat > /tmp/dsc-components-patch.yaml << 'YAML' +spec: + components: + aipipelines: + managementState: Managed + argoWorkflowsControllers: + managementState: Managed + llamastackoperator: + managementState: Managed + mlflowoperator: + managementState: Managed + trainer: + managementState: Removed +YAML + + oc patch datasciencecluster default-dsc --type merge --patch-file /tmp/dsc-components-patch.yaml || \ + die "Failed to patch DataScienceCluster" + + echo "✅ DSC component configuration applied:" + echo " - aipipelines: Managed (with argoWorkflowsControllers)" + echo " - llamastackoperator: Managed" + echo " - mlflowoperator: Managed" + echo " - trainer: Removed (requires JobSet operator)" + + sleep 5 +fi +``` + +### Step 11: Wait for DSC Ready + +```bash +# Wait for DataScienceCluster to be Ready +TIMEOUT=600 +INTERVAL=15 +ELAPSED=0 +DSC_PHASE="" + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + DSC_PHASE=$(oc get datasciencecluster -o jsonpath="{.items[0].status.phase}" 2>/dev/null || echo "Unknown") + echo "DSC phase: $DSC_PHASE" + + if [[ "$DSC_PHASE" == "Ready" ]]; then + echo "✅ DataScienceCluster is Ready" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for DataScienceCluster... (${ELAPSED}s/${TIMEOUT}s)" +done + +if [[ "$DSC_PHASE" != "Ready" ]]; then + echo "⚠️ WARNING: DSC is not Ready after ${TIMEOUT}s (current: ${DSC_PHASE:-Unknown})" + echo "Not-ready components:" + oc get dsc default-dsc -o json 2>/dev/null | \ + jq -r '.status.conditions[] | select(.status=="False") | select(.message | test("Removed") | not) | " \(.type): \(.message)"' 2>/dev/null || true +fi +``` + +### Step 12: Wait for Dashboard + +```bash +# Wait for dashboard deployment to be ready +TIMEOUT=300 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + READY=$(oc get deployment rhods-dashboard -n redhat-ods-applications -o jsonpath="{.status.readyReplicas}" 2>/dev/null || echo "0") + DESIRED=$(oc get deployment rhods-dashboard -n redhat-ods-applications -o jsonpath="{.spec.replicas}" 2>/dev/null || echo "0") + + if [[ "$READY" -gt 0 && "$READY" -eq "$DESIRED" ]]; then + echo "✅ Dashboard deployment is ready" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for dashboard deployment... (${ELAPSED}s/${TIMEOUT}s)" +done + +if [[ "$READY" -lt "$DESIRED" ]]; then + echo "⚠️ WARNING: Dashboard deployment not fully ready" +fi + +echo "Dashboard containers:" +oc get deployment rhods-dashboard -n redhat-ods-applications \ + -o jsonpath='{range .spec.template.spec.containers[*]}{.name}{"\n"}{end}' 2>/dev/null || \ + echo " Dashboard deployment not found" +``` + +### Step 13: Configure Dashboard Features + +```bash +# Wait for OdhDashboardConfig to exist +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + echo "✅ OdhDashboardConfig found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo "Waiting for OdhDashboardConfig... (${ELAPSED}s/${TIMEOUT}s)" +done + +if ! oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + echo "⚠️ WARNING: OdhDashboardConfig not found yet, feature flags will be configured when available" +else + # Enable feature flags + oc patch odhdashboardconfig odh-dashboard-config -n redhat-ods-applications --type merge -p '{ + "spec": { + "dashboardConfig": { + "automl": true, + "autorag": true, + "genAiStudio": true + } + } + }' || { + echo "⚠️ WARNING: Failed to patch dashboard config, feature flags may need manual configuration" + } + + echo "✅ Dashboard feature flags configured:" + echo " - automl: enabled" + echo " - autorag: enabled" + echo " - genAiStudio: enabled" + + # Restart dashboard to pick up changes + echo "Restarting dashboard to apply feature flag changes..." + oc rollout restart deployment rhods-dashboard -n redhat-ods-applications 2>/dev/null || true + sleep 3 +fi +``` + +### Step 14: Verify Update + +```bash +echo "" +echo "=== Update Summary ===" + +# Show CSV +echo "" +echo "CSV:" +oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator || echo " WARNING: CSV not found" + +# Show Dashboard URL +echo "" +echo "Dashboard:" +DASHBOARD_ROUTE=$(oc get route rhods-dashboard -n redhat-ods-applications -o jsonpath='{.spec.host}' 2>/dev/null || echo "") +if [[ -n "$DASHBOARD_ROUTE" ]]; then + echo " https://$DASHBOARD_ROUTE" +else + echo " WARNING: Dashboard route not found yet" +fi + +echo "" +echo "✅ RHOAI update complete!" +``` + +## Output + +The command creates a report at `artifacts/rhoai-update/reports/update-report-[timestamp].md` with: +- Update parameters (version, channel, image) +- Operator CSV details (old vs new) +- Component image comparison results +- Whether forced reinstall was performed +- DataScienceCluster status +- Dashboard URL + +## Usage Examples + +```bash +# Update to latest RHOAI (preserves current channel) +/rhoai-update + +# Update to RHOAI 3.4 EA build 2 +/rhoai-update 3.4-ea.2 + +# Update to RHOAI 3.3 stable and change channel +/rhoai-update 3.3 -c stable-3.3 + +# Update with specific SHA digest +/rhoai-update 3.4@sha256:abc123def456... +``` + +Or simply ask: +- "Update RHOAI to latest" +- "Upgrade to RHOAI 3.4" +- "Update RHOAI to latest nightly" + +## Common Issues + +**Problem:** Component images not updating even though catalog was updated +**Solution:** This is expected - the forced reinstall (Step 9) handles this automatically + +**Problem:** Channel change warning appears +**Solution:** Confirm you want to change channels, or let it preserve the existing channel + +**Problem:** DSC components revert to default after update +**Solution:** The command re-applies component configuration in Step 10 + +**Problem:** Dashboard shows old features after update +**Solution:** Feature flags are re-applied in Step 13, dashboard pod is restarted + +## Next Steps + +After updating: +1. Verify all workloads are still running +2. Check dashboard for new features +3. Test model deployments +4. Review component logs for any errors + +To check current RHOAI version and build info, use `/rhoai-version`. diff --git a/workflows/rhoai-manager/.claude/commands/rhoai-version.md b/workflows/rhoai-manager/.claude/commands/rhoai-version.md new file mode 100644 index 00000000..87877110 --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/rhoai-version.md @@ -0,0 +1,169 @@ +# /rhoai-version - Detect RHOAI Version and Build Information + +Detect the Red Hat OpenShift AI (RHOAI) version and build information installed on the currently connected OpenShift cluster. + +## Purpose + +This command provides comprehensive version information about the RHOAI installation including operator version, component status, and deployed image digests. + +## Prerequisites + +- Must be logged into an OpenShift cluster (use `/oc-login` first if needed) +- RHOAI must be installed on the cluster + +## Steps + +### 1. Verify OpenShift Login + +Run `oc whoami` and `oc whoami --show-server` to confirm you are logged into an OpenShift cluster. If not logged in, stop and inform the user they need to authenticate first with `/oc-login`. + +### 2. Detect RHOAI Operator Subscription + +Run the following to extract subscription details directly (avoid `-o yaml` as it produces excessive output): + +```bash +oc get subscriptions.operators.coreos.com rhods-operator -n redhat-ods-operator -o jsonpath='Channel: {.spec.channel}{"\n"}Source: {.spec.source}{"\n"}Approval: {.spec.installPlanApproval}{"\n"}Current CSV: {.status.currentCSV}{"\n"}Installed CSV: {.status.installedCSV}{"\n"}Starting CSV: {.spec.startingCSV}{"\n"}' 2>/dev/null +``` + +If no subscription found in `redhat-ods-operator`, check `openshift-operators`: + +```bash +oc get subscriptions.operators.coreos.com -n openshift-operators -o jsonpath='{range .items[?(@.spec.name=="rhods-operator")]}Channel: {.spec.channel}{"\n"}Source: {.spec.source}{"\n"}Approval: {.spec.installPlanApproval}{"\n"}Current CSV: {.status.currentCSV}{"\n"}{end}' 2>/dev/null +``` + +### 3. Check ClusterServiceVersion (CSV) + +Get only the RHOAI operator CSV (filter by name to avoid noisy output): + +```bash +oc get csv -n redhat-ods-operator -o custom-columns=NAME:.metadata.name,DISPLAY:.spec.displayName,VERSION:.spec.version,PHASE:.status.phase 2>/dev/null | grep -E 'NAME|rhods-operator' +``` + +### 4. Check DataScienceCluster + +**Do NOT use `-o yaml` for the full DSC resource** — it is very large and the jsonpath for nested dynamic component keys does not extract cleanly. + +Instead, use these targeted commands: + +**Get component managementState values:** +```bash +oc get datasciencecluster default-dsc -o json 2>/dev/null | python3 -c " +import sys, json +dsc = json.load(sys.stdin) +comps = dsc.get('spec', {}).get('components', {}) +for name, cfg in sorted(comps.items()): + state = cfg.get('managementState', 'Unknown') if isinstance(cfg, dict) else 'Unknown' + print(f' {name}: {state}') +" +``` + +**Get status conditions:** +```bash +oc get datasciencecluster default-dsc -o jsonpath='{range .status.conditions[*]}{.type}: {.status} ({.reason}){"\n"}{end}' 2>/dev/null +``` + +### 5. Check DSCInitialization + +```bash +oc get dscinitializations default-dsci -o jsonpath='Name: {.metadata.name}{"\n"}Monitoring: {.spec.monitoring.managementState}{"\n"}' 2>/dev/null +``` + +### 6. Extract Operator Image + +```bash +oc get deployment rhods-operator -n redhat-ods-operator -o jsonpath='{.spec.template.spec.containers[*].image}' 2>/dev/null +``` + +If not found, try the ODH deployment name: +```bash +oc get deployment opendatahub-operator-controller-manager -n openshift-operators -o jsonpath='{.spec.template.spec.containers[*].image}' 2>/dev/null +``` + +### 7. Get Component Images (Always Run) + +Collect all deployed component images from `redhat-ods-applications`. This is NOT optional — always include this table. + +```bash +oc get deployments -n redhat-ods-applications -o custom-columns='COMPONENT:.metadata.name,IMAGE:.spec.template.spec.containers[0].image' 2>/dev/null +``` + +Parse each image to extract a short image name and the `sha256` digest. Present as a markdown table: + +``` +| Component | Image | Digest (short) | +|----------------------------------|------------------------------------------------------|-----------------| +| rhods-dashboard | odh-dashboard-rhel9 | sha256:db295f.. | +| kserve-controller-manager | odh-kserve-controller-rhel9 | sha256:e83b4b.. | +| ... | ... | ... | +``` + +To build this table: +- **Component** = the deployment name +- **Image** = the portion after the last `/` and before `@sha256:` (e.g., `odh-dashboard-rhel9`) +- **Digest (short)** = first 8 characters of the sha256 hash + +### 8. Present Summary + +Output a clear summary in this format: + +``` +== RHOAI Version Summary == + +Cluster: <server URL> +Logged in as: <username> + +Operator: + Name: <CSV name> + Version: <version> + Phase: <phase> + Channel: <subscription channel> + Source: <catalog source> + Approval: <install plan approval> + Operator Image: <image reference> + +DataScienceCluster: + Name: default-dsc + Status: <Ready/Not Ready> (<conditions summary>) + Components: + - <component>: <Managed|Removed> + ... + +DSCInitialization: + Name: default-dsci + Monitoring: <Managed/Removed> + +== Component Images (redhat-ods-applications) == + +| Component | Image | Digest (short) | +|-----------|-------|-----------------| +| ... | ... | ... | +``` + +If any resource is not found, note it clearly (e.g., "Not installed" or "Namespace not found") rather than failing silently. + +## Important Notes + +- **Do NOT use `oc get datasciencecluster -o yaml`** — the output is extremely large (hundreds of lines) and jsonpath with dynamic component keys fails to extract cleanly. Use the `python3 -c` approach in Step 4 or targeted jsonpath for conditions. +- **Do NOT use `-o yaml` for subscriptions** — use targeted jsonpath to extract only the fields you need. +- **The DSC resource name is `default-dsc`** and DSCI is `default-dsci` on standard RHOAI installs. Always reference by name for reliable extraction. +- **Component keys in `spec.components` are dynamic** and vary by RHOAI version. Do not hardcode a list — iterate over whatever keys exist. +- **Status conditions have changed across versions** — older versions used `Available/Progressing/Degraded/Upgradeable`, newer versions (3.x) use per-component `*Ready` conditions plus `Ready`, `ProvisioningSucceeded`, `ComponentsReady`. Handle both. + +## Example Usage + +**User**: `/rhoai-version` + +**Claude**: +1. Checks if user is logged into cluster +2. Queries RHOAI operator subscription and CSV +3. Checks DataScienceCluster and DSCInitialization status +4. Lists all component images with digests +5. Presents formatted summary with all version information + +## Integration with Other Commands + +This command is useful: +- Before running `/rhoai-update` to know current version +- After running `/rhoai-update` to verify the new version +- For troubleshooting RHOAI installations +- For documenting the current cluster state diff --git a/workflows/rhoai-manager/README.md b/workflows/rhoai-manager/README.md new file mode 100644 index 00000000..0ecaa044 --- /dev/null +++ b/workflows/rhoai-manager/README.md @@ -0,0 +1,271 @@ +# RHOAI Manager + +Comprehensive workflow for managing the complete lifecycle of Red Hat OpenShift AI: installation, updates, version detection, and uninstallation. + +## Overview + +This workflow provides an AI-powered pipeline for: +- Installing RHOAI from scratch on OpenShift clusters +- Updating RHOAI to latest nightly builds +- Detecting version and build information +- Completely uninstalling RHOAI when needed +- Managing cluster connections and authentication + +## Structure + +``` +workflows/rhoai-manager/ +├── .ambient/ +│ └── ambient.json # Workflow configuration +├── .claude/ +│ └── commands/ +│ ├── oc-login.md # OpenShift cluster login command +│ ├── rhoai-install.md # RHOAI installation command +│ ├── rhoai-version.md # RHOAI version detection command +│ ├── rhoai-update.md # RHOAI update command +│ └── rhoai-uninstall.md # RHOAI uninstall command +└── README.md # This file +``` + +## Commands + +### /oc-login + +Login to OpenShift cluster using credentials from Ambient session. + +**What it does:** +- Checks for required credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD) +- Verifies `oc` CLI is installed +- Executes login to the cluster +- Verifies connection and displays cluster info + +**Usage:** +``` +/oc-login +``` + +Or simply ask: +- "Login to my cluster" +- "Connect to OpenShift" +- "Login to OCP" + +**Required Environment Variables:** +- `OCP_SERVER` - OpenShift cluster API URL (e.g., `https://api.cluster.example.com:6443`) +- `OCP_USERNAME` - Your OpenShift username +- `OCP_PASSWORD` - Your OpenShift password + +### /rhoai-install + +Install RHOAI from scratch on an OpenShift cluster. + +**What it does:** +- Sets up OLM catalog source (dev or GA production) +- Creates operator namespace and subscription +- Waits for ClusterServiceVersion (CSV) to be ready +- Creates DataScienceCluster with component configuration +- Patches component states (Managed/Removed) as needed +- Verifies all components are healthy and reconciled +- Provides detailed installation summary + +**Usage:** +```bash +# Development/Nightly builds (default) +/rhoai-install # Latest dev catalog +/rhoai-install channel=beta # Dev catalog, beta channel +/rhoai-install image=quay.io/modh/rhoai-catalog:latest-release-3.5 + +# GA Production releases +/rhoai-install catalog=redhat-operators # GA catalog, stable channel +/rhoai-install catalog=redhat-operators channel=fast # GA catalog, fast channel +``` + +**Parameters:** +- `catalog` - Catalog source (`rhoai-catalog-dev` for nightly, `redhat-operators` for GA) +- `channel` - Subscription channel (`beta`, `fast`, or `stable`) +- `image` - Custom catalog image (only for `rhoai-catalog-dev`) + +Or simply ask: +- "Install RHOAI on the cluster" +- "Deploy RHOAI from production catalog" +- "Set up RHOAI for testing" + +**Prerequisites:** +- OpenShift cluster (version 4.12+) +- Logged into cluster with admin permissions (use `/oc-login`) +- No existing RHOAI installation + +**What gets deployed:** +- **Operator namespace**: `redhat-ods-operator` +- **Application namespace**: `redhat-ods-applications` +- **Monitoring namespace**: `redhat-ods-monitoring` +- **DataScienceCluster**: Custom resource managing all RHOAI components +- **Component operators**: Dashboard, Workbenches, Model Serving, Pipelines, etc. + +**Note:** Defaults to `rhoai-catalog-dev` for nightly builds. Use `catalog=redhat-operators` for GA production releases. + +### /rhoai-version + +Detect RHOAI version and build information. + +**What it does:** +- Checks RHOAI operator subscription and ClusterServiceVersion +- Reports DataScienceCluster status and component states +- Lists all component images with SHA256 digests +- Provides comprehensive version summary + +**Usage:** +``` +/rhoai-version +``` + +Or simply ask: +- "What version of RHOAI is installed?" +- "Check RHOAI version" +- "Show me RHOAI build info" + +**Note:** You must be logged into the cluster first (use `/oc-login`) + +### /rhoai-update + +Updates RHOAI to the latest nightly build. + +**What it does:** +- Checks current RHOAI version and component states +- Updates the OLM catalog source to latest nightly +- Monitors the operator upgrade process +- Handles special scenarios: + - **Scenario E**: Forced subscription reinstall when component images update without CSV version change + - Channel preservation across updates + - Component state preservation (Managed/Removed) +- Verifies component reconciliation +- Reports final status with before/after comparison + +**Usage:** +``` +/rhoai-update +``` + +Or simply ask: +- "Update RHOAI to latest nightly" +- "Upgrade to RHOAI 3.4 nightly" +- "Update RHOAI" + +**Note:** You must be logged into the cluster first (use `/oc-login`) + +**Advanced Features:** +- Detects when component images have newer builds without CSV version changes +- Automatically triggers forced reinstall in these cases +- Preserves DataScienceCluster component configuration across updates +- Waits for all components to reconcile before completing + +### /rhoai-uninstall + +Completely uninstall RHOAI from an OpenShift cluster. + +**What it does:** +- Removes RHOAI operator and subscriptions +- Deletes custom resources (DataScienceCluster, DSCInitialization, etc.) +- Cleans up webhooks and finalizers +- Removes RHOAI namespaces +- Deletes CRDs (optional) +- Cleans up user data science projects (optional) + +**Usage:** +``` +/rhoai-uninstall # Standard forceful uninstall +/rhoai-uninstall graceful # Graceful uninstall followed by cleanup +/rhoai-uninstall keep-crds # Keep CRDs installed +/rhoai-uninstall keep-all # Keep CRDs and user resources +``` + +Or simply ask: +- "Uninstall RHOAI from the cluster" +- "Remove RHOAI completely" +- "Clean up RHOAI installation" + +**Warning:** This will delete all RHOAI resources including user workbenches, models, and data. Backup important work first. + +**Note:** You must be logged into the cluster first (use `/oc-login`) and have cluster-admin permissions. + +## Prerequisites + +- OpenShift cluster (version 4.12+) +- `oc` CLI installed (auto-installed if missing) +- Cluster credentials configured in Ambient session: + - `OCP_SERVER` - OpenShift cluster API URL + - `OCP_USERNAME` - Your OpenShift username + - `OCP_PASSWORD` - Your OpenShift password +- Cluster admin permissions + +## Output Artifacts + +All artifacts are stored in `artifacts/rhoai-manager/`: + +- `reports/*.md` - Installation and update reports with version changes +- `version/*.md` - Version detection summaries +- `logs/*.log` - Detailed execution logs + +## Typical Workflows + +### Fresh Installation +``` +1. /oc-login # Connect to cluster +2. /rhoai-install # Install RHOAI from scratch +3. /rhoai-version # Verify installation +``` + +### Regular Updates +``` +1. /oc-login # Connect to cluster +2. /rhoai-version # Check current version +3. /rhoai-update # Update to latest nightly +4. /rhoai-version # Verify new version +``` + +### Decommissioning +``` +1. /oc-login # Connect to cluster +2. /rhoai-uninstall # Remove RHOAI completely +``` + +## GitHub Actions Integration + +This workflow is designed to run via GitHub Actions with Ambient: + +```yaml +- name: Update RHOAI to Latest Nightly + uses: ambient-code/ambient-action@v0.0.2 + with: + api-token: ${{ secrets.AMBIENT_API_TOKEN }} + workflow: workflows/rhoai-manager + prompt: Update RHOAI to the latest nightly build +``` + +## Technical Details + +### Catalog Source +- Uses `rhoai-catalog-dev` for nightly builds +- Image: `quay.io/modh/rhoai-catalog:latest-release-3.4` +- Updates trigger operator upgrades automatically + +### Component Management +- DataScienceCluster manages component states (Managed/Removed) +- Component states preserved across updates +- Individual component image tracking with SHA digests + +### Update Scenarios +The workflow handles several update scenarios: +- **Normal CSV upgrade**: Operator version changes +- **Forced reinstall**: Component images update without CSV version change +- **Channel migration**: Updates preserve subscription channel +- **Catalog refresh**: Forces OLM to re-evaluate available updates + +## Future Enhancements + +- [ ] Automated test suite execution after updates +- [ ] Test result parsing and analysis +- [ ] JIRA integration for issue updates +- [ ] Slack/email notifications +- [ ] Rollback capabilities +- [ ] Pre-upgrade validation checks +- [ ] Multi-cluster support From 920e5d9669d39073d73179dd5e8d8c0a4cf50f43 Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi <vamodi@redhat.com> Date: Fri, 3 Apr 2026 14:27:33 -0400 Subject: [PATCH 15/23] feat: add odh-install and odh-update commands to rhoai-manager workflow New commands for installing and updating Open Data Hub (ODH) nightlies: - /odh-install: Fresh ODH install via OLM using odh-stable-nightly catalog - Creates CatalogSource, Subscription, DSCInitialization, DataScienceCluster - No olminstall needed (uses openshift-operators global OperatorGroup) - /odh-update: Update existing ODH to latest nightly - Updates CatalogSource, forces catalog pod refresh - Auto-detects new CSV version (OLM handles upgrade naturally) - Falls back to forced reinstall if only component images changed Key differences from RHOAI: - Package: opendatahub-operator (vs rhods-operator) - Namespace: openshift-operators (vs redhat-ods-operator) - Catalog: quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly - Channel: fast Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> --- .../.claude/commands/odh-install.md | 280 ++++++++++++++++++ .../.claude/commands/odh-update.md | 244 +++++++++++++++ 2 files changed, 524 insertions(+) create mode 100644 workflows/rhoai-manager/.claude/commands/odh-install.md create mode 100644 workflows/rhoai-manager/.claude/commands/odh-update.md diff --git a/workflows/rhoai-manager/.claude/commands/odh-install.md b/workflows/rhoai-manager/.claude/commands/odh-install.md new file mode 100644 index 00000000..680f0a6c --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/odh-install.md @@ -0,0 +1,280 @@ +# /odh-install - Install Open Data Hub on OpenShift Cluster + +Install Open Data Hub (ODH) on an OpenShift cluster using OLM (Operator Lifecycle Manager). + +## Command Usage + +```bash +/odh-install # Latest stable nightly (default) +/odh-install channel=fast # Explicit fast channel +/odh-install image=quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly +/odh-install channel=fast image=quay.io/opendatahub/opendatahub-operator-catalog:latest +``` + +## Available Tags + +| Image Tag | Description | Use Case | +|-----------|-------------|----------| +| `odh-stable-nightly` (default) | Daily nightly from main branch | Testing latest ODH builds | +| `latest` | Most recent CI build (any branch) | Bleeding edge | +| `odh-stable` | Latest stable release | Stable deployments | + +## Available Channels + +| Channel | Description | +|---------|-------------| +| `fast` (default) | Frequent releases tracking main | +| `stable` | Stable releases only | + +## Key Differences from RHOAI + +| | RHOAI | ODH | +|-|-------|-----| +| Package | `rhods-operator` | `opendatahub-operator` | +| Operator namespace | `redhat-ods-operator` | `openshift-operators` | +| App namespace | `redhat-ods-applications` | `opendatahub` | +| Catalog image | `quay.io/rhoai/rhoai-fbc-fragment` | `quay.io/opendatahub/opendatahub-operator-catalog` | +| Default channel | `stable-3.4` / `beta` | `fast` | + +## Prerequisites + +1. **Cluster access**: Logged into OpenShift cluster with cluster-admin privileges (use `/oc-login`) +2. **Tools installed**: `oc` CLI must be available +3. **No existing ODH**: For fresh installations only (use `/odh-update` to update) + +## Process + +### Step 1: Parse Input Arguments + +```bash +CATALOG_IMAGE="quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly" +CHANNEL="fast" + +for arg in "$@"; do + case "$arg" in + channel=*) + CHANNEL="${arg#*=}" + ;; + image=*) + CATALOG_IMAGE="${arg#*=}" + ;; + *) + echo "Unknown parameter: $arg (expected: channel= or image=)" + ;; + esac +done + +echo "Catalog image: $CATALOG_IMAGE" +echo "Channel: $CHANNEL" +``` + +### Step 2: Verify Cluster Access + +```bash +oc whoami &>/dev/null || { echo "ERROR: Not logged into OpenShift cluster"; exit 1; } +echo "Logged in as: $(oc whoami)" +echo "Cluster: $(oc whoami --show-server)" + +# Check not already installed +if oc get csv -n openshift-operators 2>/dev/null | grep -q opendatahub-operator; then + echo "ERROR: ODH already installed. Use /odh-update to update." + exit 1 +fi +echo "No existing ODH installation detected" +``` + +### Step 3: Create CatalogSource + +```bash +echo "Creating ODH CatalogSource..." +cat << EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: CatalogSource +metadata: + name: odh-catalog + namespace: openshift-marketplace +spec: + sourceType: grpc + image: ${CATALOG_IMAGE} + displayName: Open Data Hub + publisher: ODH Community + updateStrategy: + registryPoll: + interval: 15m + grpcPodConfig: + securityContextConfig: restricted +EOF + +# Wait for catalog pod to be running +TIMEOUT=120 +ELAPSED=0 +while [[ $ELAPSED -lt $TIMEOUT ]]; do + PHASE=$(oc get pod -n openshift-marketplace -l olm.catalogSource=odh-catalog \ + -o jsonpath='{.items[0].status.phase}' 2>/dev/null || echo "") + if [[ "$PHASE" == "Running" ]]; then + echo "CatalogSource ready" + break + fi + sleep 5 + ELAPSED=$((ELAPSED + 5)) + echo "Waiting for catalog pod... (${ELAPSED}s/${TIMEOUT}s)" +done +``` + +### Step 4: Create Subscription + +ODH installs into `openshift-operators` which already has a global OperatorGroup — no need to create one. + +```bash +echo "Creating ODH Subscription..." +cat << EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: Subscription +metadata: + name: opendatahub-operator + namespace: openshift-operators +spec: + channel: ${CHANNEL} + name: opendatahub-operator + source: odh-catalog + sourceNamespace: openshift-marketplace + installPlanApproval: Automatic +EOF +``` + +### Step 5: Wait for Operator CSV + +```bash +TIMEOUT=600 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CSV_LINE=$(oc get csv -n openshift-operators 2>/dev/null | grep opendatahub-operator || echo "") + if [[ -n "$CSV_LINE" ]]; then + CSV_NAME=$(echo "$CSV_LINE" | awk '{print $1}') + CSV_PHASE=$(echo "$CSV_LINE" | awk '{print $NF}') + echo "CSV: $CSV_NAME, Phase: $CSV_PHASE" + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + echo "ODH operator installed successfully" + break + fi + fi + sleep 10 + ELAPSED=$((ELAPSED + 10)) + echo "Waiting for CSV... (${ELAPSED}s/${TIMEOUT}s)" +done + +[[ "$CSV_PHASE" == "Succeeded" ]] || { echo "ERROR: CSV did not reach Succeeded"; exit 1; } +``` + +### Step 6: Create DSCInitialization + +```bash +echo "Creating DSCInitialization..." +cat << EOF | oc apply -f - +apiVersion: dscinitialization.opendatahub.io/v1 +kind: DSCInitialization +metadata: + name: default-dsci +spec: + applicationsNamespace: opendatahub + monitoring: + managementState: Managed + namespace: opendatahub + trustedCABundle: + managementState: Managed + devFlags: + logMode: production +EOF +sleep 10 +``` + +### Step 7: Create DataScienceCluster + +```bash +echo "Creating DataScienceCluster..." +cat << EOF | oc apply -f - +apiVersion: datasciencecluster.opendatahub.io/v1 +kind: DataScienceCluster +metadata: + name: default-dsc +spec: + components: + dashboard: + managementState: Managed + workbenches: + managementState: Managed + datasciencepipelines: + managementState: Managed + kserve: + managementState: Managed + serving: + managementState: Removed + modelmeshserving: + managementState: Managed + ray: + managementState: Managed + kueue: + managementState: Managed + trainingoperator: + managementState: Managed + trustyai: + managementState: Managed + modelregistry: + managementState: Managed + feastoperator: + managementState: Managed +EOF +``` + +### Step 8: Wait for DSC Ready + +```bash +TIMEOUT=600 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + READY=$(oc get datasciencecluster default-dsc \ + -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || echo "") + echo "DSC Ready: ${READY:-Unknown}" + if [[ "$READY" == "True" ]]; then + echo "DataScienceCluster is Ready" + break + fi + sleep 15 + ELAPSED=$((ELAPSED + 15)) + echo "Waiting for DSC... (${ELAPSED}s/${TIMEOUT}s)" +done +``` + +### Step 9: Verify Installation + +```bash +echo "" +echo "=== ODH Installation Summary ===" +echo "" +echo "CSV:" +oc get csv -n openshift-operators | grep opendatahub-operator + +echo "" +echo "DSC Status:" +oc get datasciencecluster default-dsc \ + -o jsonpath='{range .status.conditions[*]}{.type}{": "}{.status}{"\n"}{end}' | grep -v "False" + +echo "" +echo "Dashboard:" +DASHBOARD=$(oc get route odh-dashboard -n opendatahub -o jsonpath='{.spec.host}' 2>/dev/null || echo "Not ready yet") +echo " https://$DASHBOARD" + +echo "" +echo "ODH installation complete!" +``` + +## Common Issues + +| Problem | Solution | +|---------|----------| +| CSV stuck in `Installing` | Check operator pod logs: `oc logs -n openshift-operators -l name=opendatahub-operator` | +| DSC not Ready | Check components: `oc get dsc default-dsc -o yaml \| grep -A5 conditions` | +| Feast label selector error | Delete old deployment: `oc delete deployment feast-operator-controller-manager -n opendatahub` | +| Catalog pod not starting | Check image pull: `oc describe pod -n openshift-marketplace -l olm.catalogSource=odh-catalog` | diff --git a/workflows/rhoai-manager/.claude/commands/odh-update.md b/workflows/rhoai-manager/.claude/commands/odh-update.md new file mode 100644 index 00000000..e1f103cd --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/odh-update.md @@ -0,0 +1,244 @@ +# /odh-update - Update Open Data Hub to Latest Nightly + +Update an existing ODH installation to the latest nightly build or a specific version. + +## Command Usage + +```bash +/odh-update # Pull latest odh-stable-nightly +/odh-update image=quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly +/odh-update image=quay.io/opendatahub/opendatahub-operator-catalog:latest +``` + +## Available Image Tags + +| Tag | Updated | Use Case | +|-----|---------|----------| +| `odh-stable-nightly` (default) | Daily at midnight UTC | Pull latest nightly | +| `latest` | On every push | Bleeding edge | +| `odh-stable` | Stable releases | Stable deployments | + +## How ODH Updates Work + +ODH nightlies typically bump the CSV version daily (unlike RHOAI stable which keeps the same version). This means: +- **Updating the CatalogSource + refreshing the catalog pod** is usually enough +- OLM detects the new CSV version and auto-creates an InstallPlan +- No forced reinstall needed in most cases (unlike RHOAI) + +If the CSV version doesn't change (component images only), this command handles the forced reinstall automatically. + +## Prerequisites + +1. **Existing ODH**: ODH must already be installed (use `/odh-install` for fresh installations) +2. **Cluster access**: Logged into OpenShift cluster with cluster-admin privileges (use `/oc-login`) + +## Process + +### Step 1: Parse Input Arguments + +```bash +CATALOG_IMAGE="quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly" + +for arg in "$@"; do + case "$arg" in + image=*) + CATALOG_IMAGE="${arg#*=}" + ;; + *) + echo "Unknown parameter: $arg (expected: image=)" + ;; + esac +done + +echo "Target catalog image: $CATALOG_IMAGE" +``` + +### Step 2: Verify Cluster Access and Existing Installation + +```bash +oc whoami &>/dev/null || { echo "ERROR: Not logged into OpenShift cluster"; exit 1; } +echo "Logged in as: $(oc whoami)" +echo "Cluster: $(oc whoami --show-server)" + +CSV_LINE=$(oc get csv -n openshift-operators 2>/dev/null | grep opendatahub-operator || echo "") +[[ -n "$CSV_LINE" ]] || { echo "ERROR: ODH not installed. Use /odh-install first."; exit 1; } + +CURRENT_CSV=$(echo "$CSV_LINE" | awk '{print $1}') +CURRENT_CHANNEL=$(oc get subscription opendatahub-operator -n openshift-operators \ + -o jsonpath='{.spec.channel}' 2>/dev/null || echo "fast") + +echo "Current CSV: $CURRENT_CSV" +echo "Current channel: $CURRENT_CHANNEL (will be preserved)" +``` + +### Step 3: Update CatalogSource + +```bash +echo "Updating ODH CatalogSource to: $CATALOG_IMAGE" +oc patch catalogsource odh-catalog -n openshift-marketplace --type=merge \ + -p "{\"spec\":{\"image\":\"${CATALOG_IMAGE}\"}}" 2>&1 || { + # CatalogSource may not exist yet, create it + cat << EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: CatalogSource +metadata: + name: odh-catalog + namespace: openshift-marketplace +spec: + sourceType: grpc + image: ${CATALOG_IMAGE} + displayName: Open Data Hub + publisher: ODH Community + updateStrategy: + registryPoll: + interval: 15m +EOF +} +``` + +### Step 4: Force Catalog Refresh + +```bash +echo "Forcing catalog pod to pull latest image..." +oc delete pod -n openshift-marketplace -l olm.catalogSource=odh-catalog 2>/dev/null || true + +TIMEOUT=120 +ELAPSED=0 +while [[ $ELAPSED -lt $TIMEOUT ]]; do + PHASE=$(oc get pod -n openshift-marketplace -l olm.catalogSource=odh-catalog \ + -o jsonpath='{.items[0].status.phase}' 2>/dev/null || echo "") + if [[ "$PHASE" == "Running" ]]; then + echo "Catalog refreshed with latest image" + break + fi + sleep 5 + ELAPSED=$((ELAPSED + 5)) + echo "Waiting for catalog pod... (${ELAPSED}s/${TIMEOUT}s)" +done +``` + +### Step 5: Wait for OLM to Detect New Version + +OLM polls the catalog every 15 minutes but also reacts within ~30s of the catalog pod coming up. + +```bash +echo "Waiting for OLM to detect new CSV version..." +sleep 30 + +NEW_CSV_LINE=$(oc get csv -n openshift-operators 2>/dev/null | grep opendatahub-operator || echo "") +NEW_CSV=$(echo "$NEW_CSV_LINE" | awk '{print $1}') + +if [[ "$NEW_CSV" != "$CURRENT_CSV" ]]; then + echo "New CSV detected: $NEW_CSV (was: $CURRENT_CSV)" + echo "OLM is auto-upgrading..." +else + echo "CSV version unchanged: $CURRENT_CSV" + echo "Checking for newer component images in catalog..." + + # Get catalog operator image + CATALOG_POD=$(oc get pod -n openshift-marketplace -l olm.catalogSource=odh-catalog -o name | head -1) + CATALOG_OP=$(oc exec -n openshift-marketplace $CATALOG_POD -- \ + sh -c "grep -B1 'odh_rhel9_operator_image\|manager_image' /configs/opendatahub-operator/catalog.yaml 2>/dev/null | grep 'image:' | tail -1 | awk '{print \$3}'" 2>/dev/null || echo "") + DEPLOYED_OP=$(oc get deployment opendatahub-operator-controller-manager -n openshift-operators \ + -o jsonpath='{.spec.template.spec.containers[0].image}' 2>/dev/null || echo "") + + if [[ -n "$CATALOG_OP" && "$DEPLOYED_OP" != "$CATALOG_OP" ]]; then + echo "Newer component images found — performing forced reinstall..." + + SUB=$(oc get subscription opendatahub-operator -n openshift-operators \ + -o jsonpath='{.metadata.name}' 2>/dev/null) + oc delete csv "$CURRENT_CSV" -n openshift-operators 2>&1 || true + sleep 5 + oc delete subscription "$SUB" -n openshift-operators 2>&1 || true + sleep 5 + + cat << EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: Subscription +metadata: + name: opendatahub-operator + namespace: openshift-operators +spec: + channel: ${CURRENT_CHANNEL} + name: opendatahub-operator + source: odh-catalog + sourceNamespace: openshift-marketplace + installPlanApproval: Automatic +EOF + echo "Subscription recreated — waiting for new CSV..." + else + echo "All component images are up to date — no reinstall needed" + fi +fi +``` + +### Step 6: Wait for CSV to Succeed + +```bash +TIMEOUT=600 +ELAPSED=0 +CSV_PHASE="" + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CSV_LINE=$(oc get csv -n openshift-operators 2>/dev/null | grep opendatahub-operator | grep -v Replacing || echo "") + if [[ -n "$CSV_LINE" ]]; then + CSV_NAME=$(echo "$CSV_LINE" | awk '{print $1}') + CSV_PHASE=$(echo "$CSV_LINE" | awk '{print $NF}') + echo "CSV: $CSV_NAME, Phase: $CSV_PHASE" + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + echo "ODH operator updated successfully" + break + fi + fi + sleep 10 + ELAPSED=$((ELAPSED + 10)) + echo "Waiting for CSV... (${ELAPSED}s/${TIMEOUT}s)" +done + +[[ "$CSV_PHASE" == "Succeeded" ]] || echo "WARNING: CSV not yet Succeeded — check manually" +``` + +### Step 7: Verify DSC Still Ready + +```bash +sleep 15 +READY=$(oc get datasciencecluster default-dsc \ + -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || echo "") + +echo "" +echo "=== ODH Update Summary ===" +echo "" +echo "CSV:" +oc get csv -n openshift-operators | grep opendatahub-operator + +echo "" +echo "Catalog image: $CATALOG_IMAGE" +echo "DSC Ready: ${READY:-Unknown}" + +if [[ "$READY" != "True" ]]; then + echo "" + echo "DSC not yet Ready — not-ready components:" + oc get datasciencecluster default-dsc \ + -o jsonpath='{range .status.conditions[*]}{.type}{": "}{.status}{" ("}{.reason}{")\n"}{end}' \ + 2>/dev/null | grep -v "True\|Removed" || true +fi + +echo "" +echo "ODH update complete!" +``` + +## Pulling the Latest Nightly Daily + +Since `odh-stable-nightly` is rebuilt every day at midnight UTC, just re-run: + +```bash +/odh-update +``` + +Or manually: +```bash +# Refresh catalog pod to pull latest nightly +oc delete pod -n openshift-marketplace -l olm.catalogSource=odh-catalog +``` + +OLM will detect the new CSV version and auto-upgrade within ~30 seconds. From aa081f2b4aead4348db556f4938fd57ac4373754 Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi <vamodi@redhat.com> Date: Fri, 3 Apr 2026 14:31:44 -0400 Subject: [PATCH 16/23] feat: detect RHOAI conflict in odh-install, add odh-uninstall command MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit odh-install: Add RHOAI detection — if rhods-operator CSV found, block install and tell user to run /rhoai-uninstall first with clear message explaining why (shared cluster-scoped DataScienceCluster CRD) odh-uninstall: New command to cleanly remove ODH including: - DataScienceCluster and DSCInitialization - Subscription, CSV, and CatalogSource - User namespaces and resources (optional keep-crds / keep-all flags) - CRDs (unless keep-crds) - Finalizer cleanup for stuck namespaces Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> --- .../.claude/commands/odh-install.md | 21 ++- .../.claude/commands/odh-uninstall.md | 171 ++++++++++++++++++ 2 files changed, 190 insertions(+), 2 deletions(-) create mode 100644 workflows/rhoai-manager/.claude/commands/odh-uninstall.md diff --git a/workflows/rhoai-manager/.claude/commands/odh-install.md b/workflows/rhoai-manager/.claude/commands/odh-install.md index 680f0a6c..f7059b3c 100644 --- a/workflows/rhoai-manager/.claude/commands/odh-install.md +++ b/workflows/rhoai-manager/.claude/commands/odh-install.md @@ -75,12 +75,29 @@ oc whoami &>/dev/null || { echo "ERROR: Not logged into OpenShift cluster"; exit echo "Logged in as: $(oc whoami)" echo "Cluster: $(oc whoami --show-server)" -# Check not already installed +# Check if RHOAI is installed — RHOAI and ODH cannot coexist +if oc get csv -n redhat-ods-operator 2>/dev/null | grep -q rhods-operator; then + RHOAI_CSV=$(oc get csv -n redhat-ods-operator --no-headers 2>/dev/null | grep rhods-operator | awk '{print $1}') + echo "" + echo "ERROR: RHOAI is installed on this cluster ($RHOAI_CSV)" + echo "" + echo "RHOAI and ODH cannot coexist — they both manage the same" + echo "cluster-scoped DataScienceCluster CRD and overlapping operators." + echo "" + echo "To install ODH, first uninstall RHOAI:" + echo " /rhoai-uninstall" + echo "" + echo "Then re-run:" + echo " /odh-install" + exit 1 +fi + +# Check if ODH is already installed if oc get csv -n openshift-operators 2>/dev/null | grep -q opendatahub-operator; then echo "ERROR: ODH already installed. Use /odh-update to update." exit 1 fi -echo "No existing ODH installation detected" +echo "No existing ODH or RHOAI installation detected — proceeding" ``` ### Step 3: Create CatalogSource diff --git a/workflows/rhoai-manager/.claude/commands/odh-uninstall.md b/workflows/rhoai-manager/.claude/commands/odh-uninstall.md new file mode 100644 index 00000000..11eba01c --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/odh-uninstall.md @@ -0,0 +1,171 @@ +# /odh-uninstall - Uninstall Open Data Hub from Cluster + +Completely uninstall Open Data Hub (ODH) from an OpenShift cluster, removing all related resources. + +## Command Usage + +```bash +/odh-uninstall # Standard uninstall (removes everything) +/odh-uninstall keep-crds # Uninstall but keep CRDs +/odh-uninstall keep-all # Keep CRDs and user resources (projects, models, etc.) +``` + +## Uninstall Options + +| Option | Removes Operator | Removes CRDs | Removes User Resources | +|--------|-----------------|--------------|----------------------| +| (default) | Yes | Yes | Yes | +| `keep-crds` | Yes | No | Yes | +| `keep-all` | Yes | No | No | + +## Prerequisites + +1. **Cluster access**: Logged into OpenShift cluster with cluster-admin privileges (use `/oc-login`) +2. **ODH installed**: ODH must be installed on the cluster + +## Process + +### Step 1: Parse Arguments and Verify + +```bash +KEEP_CRDS=false +KEEP_ALL=false + +for arg in "$@"; do + case "$arg" in + keep-crds) KEEP_CRDS=true ;; + keep-all) KEEP_CRDS=true; KEEP_ALL=true ;; + *) echo "Unknown option: $arg (valid: keep-crds, keep-all)" ;; + esac +done + +oc whoami &>/dev/null || { echo "ERROR: Not logged into OpenShift cluster"; exit 1; } +echo "Logged in as: $(oc whoami)" +echo "Cluster: $(oc whoami --show-server)" +echo "" +echo "Uninstall options: keep-crds=$KEEP_CRDS keep-all=$KEEP_ALL" + +# Verify ODH is installed +if ! oc get csv -n openshift-operators 2>/dev/null | grep -q opendatahub-operator; then + echo "ODH does not appear to be installed on this cluster" + exit 0 +fi + +ODH_CSV=$(oc get csv -n openshift-operators --no-headers 2>/dev/null | grep opendatahub-operator | awk '{print $1}') +echo "Found ODH: $ODH_CSV" +``` + +### Step 2: Delete DataScienceCluster and DSCInitialization + +```bash +echo "" +echo "=== Step 2: Removing DataScienceCluster and DSCInitialization ===" + +oc delete datasciencecluster --all --timeout=60s 2>/dev/null || true +oc delete dscinitializations.dscinitialization.opendatahub.io --all --timeout=60s 2>/dev/null || true +sleep 10 +``` + +### Step 3: Delete CSV and Subscription + +```bash +echo "" +echo "=== Step 3: Removing ODH operator subscription and CSV ===" + +oc delete subscription opendatahub-operator -n openshift-operators 2>/dev/null || true +oc delete csv "$ODH_CSV" -n openshift-operators 2>/dev/null || true + +# Remove catalog source +oc delete catalogsource odh-catalog -n openshift-marketplace 2>/dev/null || true +sleep 10 +``` + +### Step 4: Remove User Resources (unless keep-all) + +```bash +if [[ "$KEEP_ALL" != "true" ]]; then + echo "" + echo "=== Step 4: Removing user resources ===" + + # Delete data science projects + for ns in $(oc get namespace -l opendatahub.io/dashboard=true -o name 2>/dev/null); do + echo "Deleting namespace: $ns" + oc delete $ns --timeout=60s 2>/dev/null || true + done + + # Remove finalizers from any stuck resources + for crd in notebooks.kubeflow.org inferenceservices.serving.kserve.io \ + datasciencepipelinesapplications.datasciencepipelinesapplications.opendatahub.io; do + oc get $crd -A -o name 2>/dev/null | while read res; do + oc patch $res --type=json -p '[{"op":"remove","path":"/metadata/finalizers"}]' 2>/dev/null || true + done + done +else + echo "=== Step 4: Skipping user resources (keep-all) ===" +fi +``` + +### Step 5: Remove ODH Namespace + +```bash +echo "" +echo "=== Step 5: Removing ODH application namespace ===" + +if [[ "$KEEP_ALL" != "true" ]]; then + oc delete namespace opendatahub --timeout=120s 2>/dev/null || { + echo "Namespace stuck — removing finalizers..." + oc get namespace opendatahub -o json 2>/dev/null | \ + python3 -c "import sys,json; d=json.load(sys.stdin); d['spec']['finalizers']=[]; print(json.dumps(d))" | \ + oc replace --raw /api/v1/namespaces/opendatahub/finalize -f - 2>/dev/null || true + } +fi +``` + +### Step 6: Remove CRDs (unless keep-crds) + +```bash +if [[ "$KEEP_CRDS" != "true" ]]; then + echo "" + echo "=== Step 6: Removing ODH CRDs ===" + + # Get all CRDs owned by ODH + ODH_CRDS=$(oc get crd -o name 2>/dev/null | grep -E \ + "opendatahub|datasciencecluster|dscinitialization|featuretracker|datasciencepipeline" || true) + + for crd in $ODH_CRDS; do + echo "Deleting CRD: $crd" + oc delete $crd --timeout=30s 2>/dev/null || true + done +else + echo "=== Step 6: Skipping CRD removal (keep-crds) ===" +fi +``` + +### Step 7: Verify Cleanup + +```bash +echo "" +echo "=== Uninstall Complete ===" +echo "" + +# Check for remaining resources +REMAINING_CSV=$(oc get csv -n openshift-operators 2>/dev/null | grep opendatahub || echo "") +REMAINING_NS=$(oc get namespace opendatahub 2>/dev/null || echo "") + +if [[ -z "$REMAINING_CSV" && -z "$REMAINING_NS" ]]; then + echo "ODH successfully removed" +else + [[ -n "$REMAINING_CSV" ]] && echo "WARNING: CSV still present: $REMAINING_CSV" + [[ -n "$REMAINING_NS" ]] && echo "WARNING: Namespace 'opendatahub' still present" +fi + +echo "" +echo "To install ODH again: /odh-install" +echo "To install RHOAI: /rhoai-install" +``` + +## Notes + +- ODH and RHOAI share cluster-scoped CRDs — uninstalling ODH clears the path for RHOAI installation +- If the `opendatahub` namespace gets stuck on termination, the command attempts to remove its finalizers automatically +- User data (notebooks, pipelines, models) in data science project namespaces is deleted by default — use `keep-all` to preserve it From e15ffa258870ee568b7991042da2296622a1ea8f Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi <vamodi@redhat.com> Date: Fri, 3 Apr 2026 14:33:59 -0400 Subject: [PATCH 17/23] fix: detect ODH conflict in rhoai-install, clarify odh-uninstall for RHOAI switch MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit rhoai-install: Add ODH detection — if opendatahub-operator CSV found, block install and direct user to /odh-uninstall first odh-uninstall: Add explicit guidance that default (no flags) is required when switching to RHOAI — keep-crds/keep-all would leave conflicting CRDs Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> --- .../.claude/commands/odh-uninstall.md | 15 +++++++++++++-- .../.claude/commands/rhoai-install.md | 17 +++++++++++++++++ 2 files changed, 30 insertions(+), 2 deletions(-) diff --git a/workflows/rhoai-manager/.claude/commands/odh-uninstall.md b/workflows/rhoai-manager/.claude/commands/odh-uninstall.md index 11eba01c..808a014e 100644 --- a/workflows/rhoai-manager/.claude/commands/odh-uninstall.md +++ b/workflows/rhoai-manager/.claude/commands/odh-uninstall.md @@ -164,8 +164,19 @@ echo "To install ODH again: /odh-install" echo "To install RHOAI: /rhoai-install" ``` +## Switching from ODH to RHOAI + +If you want to install RHOAI after ODH, use the **default** uninstall (no flags): + +```bash +/odh-uninstall +/rhoai-install +``` + +Do **not** use `keep-crds` or `keep-all` when switching to RHOAI — RHOAI installs its own versions of the shared CRDs (`DataScienceCluster`, etc.) and leftover ODH CRDs will conflict. + ## Notes -- ODH and RHOAI share cluster-scoped CRDs — uninstalling ODH clears the path for RHOAI installation +- ODH and RHOAI share cluster-scoped CRDs (`DataScienceCluster`, `DSCInitialization`) — they cannot coexist - If the `opendatahub` namespace gets stuck on termination, the command attempts to remove its finalizers automatically -- User data (notebooks, pipelines, models) in data science project namespaces is deleted by default — use `keep-all` to preserve it +- User data (notebooks, pipelines, models) in data science project namespaces is deleted by default — use `keep-all` to preserve it (note: ODH user data is not compatible with RHOAI namespaces) diff --git a/workflows/rhoai-manager/.claude/commands/rhoai-install.md b/workflows/rhoai-manager/.claude/commands/rhoai-install.md index 5b9ef10b..d4c69e06 100644 --- a/workflows/rhoai-manager/.claude/commands/rhoai-install.md +++ b/workflows/rhoai-manager/.claude/commands/rhoai-install.md @@ -125,6 +125,23 @@ oc whoami &>/dev/null || die "Not logged into an OpenShift cluster" echo "Logged in as: $(oc whoami)" echo "Cluster: $(oc whoami --show-server)" +# Check if ODH is installed — RHOAI and ODH cannot coexist +if oc get csv -n openshift-operators 2>/dev/null | grep -q opendatahub-operator; then + ODH_CSV=$(oc get csv -n openshift-operators --no-headers 2>/dev/null | grep opendatahub-operator | awk '{print $1}') + echo "" + echo "ERROR: ODH (Open Data Hub) is installed on this cluster ($ODH_CSV)" + echo "" + echo "RHOAI and ODH cannot coexist — they both manage the same" + echo "cluster-scoped DataScienceCluster CRD and overlapping operators." + echo "" + echo "To install RHOAI, first uninstall ODH:" + echo " /odh-uninstall" + echo "" + echo "Then re-run:" + echo " /rhoai-install" + die "ODH must be uninstalled before installing RHOAI" +fi + # Verify RHOAI is not already installed if oc get csv -n redhat-ods-operator 2>/dev/null | grep -q rhods-operator; then die "RHOAI is already installed. Use /rhoai-update to update existing installation." From e6c9427c7ac670883ff60b31f5dd5dfe9dc23528 Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi <vamodi@redhat.com> Date: Fri, 3 Apr 2026 14:37:02 -0400 Subject: [PATCH 18/23] docs: update README and ambient.json for ODH commands - README: add ODH commands section, coexistence warning, switch workflows, updated structure tree with all 9 command files - ambient.json: update name/description/systemPrompt/startupPrompt to include ODH install/update/uninstall and coexistence guidance Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> --- workflows/rhoai-manager/.ambient/ambient.json | 6 +- workflows/rhoai-manager/README.md | 334 ++++++++---------- 2 files changed, 151 insertions(+), 189 deletions(-) diff --git a/workflows/rhoai-manager/.ambient/ambient.json b/workflows/rhoai-manager/.ambient/ambient.json index 6ae44abc..f3642a77 100644 --- a/workflows/rhoai-manager/.ambient/ambient.json +++ b/workflows/rhoai-manager/.ambient/ambient.json @@ -1,8 +1,8 @@ { "name": "RHOAI Manager", - "description": "Comprehensive workflow for managing Red Hat OpenShift AI lifecycle: installation, updates, version detection, and uninstallation.", - "systemPrompt": "You are an AI assistant specialized in managing the complete lifecycle of RHOAI (Red Hat OpenShift AI) installations.\n\n# Your Role\n\nYou help automate the process of:\n1. Logging into OpenShift clusters\n2. Installing RHOAI from scratch\n3. Detecting RHOAI version and build information\n4. Updating RHOAI to latest nightly builds\n5. Uninstalling RHOAI completely\n\n# Available Commands\n\nYou have access to the following commands:\n\n## /oc-login\nLogin to OpenShift cluster using credentials from Ambient session:\n- Checks for required credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Automatically installs oc CLI if not available\n- Executes login to the cluster\n- Verifies connection and displays cluster info\n\n## /rhoai-install\nInstall RHOAI from scratch on a cluster:\n- Sets up OLM catalog source for nightly builds\n- Creates operator namespace and subscription\n- Waits for operator installation to complete\n- Creates DataScienceCluster with component configuration\n- Verifies all components are healthy\n- Provides installation summary\n\n## /rhoai-version\nDetect RHOAI version and build information:\n- Checks RHOAI operator subscription and CSV\n- Reports DataScienceCluster status and components\n- Lists all component images with SHA digests\n- Provides comprehensive version summary\n\n## /rhoai-update\nUpdates RHOAI to the latest nightly build:\n- Verifies current version\n- Updates the OLM catalog source\n- Monitors the operator upgrade\n- Handles forced reinstall when component images update without CSV version change\n- Verifies component reconciliation\n- Provides status updates\n\n## /rhoai-uninstall\nCompletely uninstall RHOAI from an OpenShift cluster:\n- Supports graceful or forceful uninstall\n- Options to keep CRDs and/or user resources\n- Removes operator, custom resources, webhooks, namespaces\n- Verifies complete cleanup\n\n# Workflow Phases\n\n## Phase 0: Connect to Cluster\n- Login to OpenShift cluster using /oc-login\n- Verify cluster access and permissions\n\n## Phase 1: Install or Update RHOAI\n- For fresh installations: Use /rhoai-install\n- For existing installations: Use /rhoai-update\n- Verify successful installation/update\n- Document version changes\n\n## Phase 2: Version Management\n- Use /rhoai-version to check current installation\n- Track component versions and build information\n- Monitor for updates\n\n## Phase 3: Cleanup\n- Use /rhoai-uninstall when decommissioning\n- Options to preserve CRDs or user resources\n\n# Output Locations\n\n- Installation Reports: `artifacts/rhoai-manager/reports/*.md`\n- Update Reports: `artifacts/rhoai-manager/reports/*.md`\n- Version Info: `artifacts/rhoai-manager/version/*.md`\n- Execution Logs: `artifacts/rhoai-manager/logs/*.log`\n\n# Prerequisites\n\n- OpenShift cluster (version 4.12+)\n- `oc` CLI installed (auto-installed if missing)\n- Cluster credentials configured in Ambient session (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Cluster admin permissions\n", - "startupPrompt": "Welcome to the RHOAI Manager Workflow!\n\nI'll help you manage the complete lifecycle of Red Hat OpenShift AI installations.\n\n## What I Can Do\n\n- **Login to Cluster**: Use `/oc-login` to connect to your OpenShift cluster\n- **Install RHOAI**: Use `/rhoai-install` to deploy RHOAI from scratch\n- **Check Version**: Use `/rhoai-version` to detect current version and build info\n- **Update RHOAI**: Use `/rhoai-update` to upgrade to the latest nightly build\n- **Uninstall RHOAI**: Use `/rhoai-uninstall` to completely remove RHOAI from the cluster\n\n## Getting Started\n\nJust tell me what you'd like to do:\n- \"Login to my cluster\"\n- \"Install RHOAI on the cluster\"\n- \"Check current RHOAI version\"\n- \"Update RHOAI to latest nightly\"\n- \"Uninstall RHOAI\"\n\n**Note**: Make sure your cluster credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD) are configured in your Ambient session.\n\nLet's get started!", + "description": "Comprehensive workflow for managing Red Hat OpenShift AI (RHOAI) and Open Data Hub (ODH) lifecycle: installation, updates, version detection, and uninstallation.", + "systemPrompt": "You are an AI assistant specialized in managing the complete lifecycle of RHOAI (Red Hat OpenShift AI) and ODH (Open Data Hub) installations.\n\n# Your Role\n\nYou help automate the process of:\n1. Logging into OpenShift clusters\n2. Installing RHOAI or ODH from scratch\n3. Detecting RHOAI version and build information\n4. Updating RHOAI or ODH to latest nightly builds\n5. Uninstalling RHOAI or ODH completely\n6. Switching between RHOAI and ODH safely\n\n# Important: RHOAI and ODH Cannot Coexist\n\nRHOAI and ODH share cluster-scoped CRDs (DataScienceCluster, DSCInitialization) and overlapping operators. They CANNOT be installed on the same cluster at the same time.\n\n- To switch from RHOAI to ODH: run /rhoai-uninstall first, then /odh-install\n- To switch from ODH to RHOAI: run /odh-uninstall first, then /rhoai-install\n- Both /rhoai-install and /odh-install detect the other and block with a clear error message\n\n# Available Commands\n\n## /oc-login\nLogin to OpenShift cluster using credentials from Ambient session:\n- Checks for required credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Automatically installs oc CLI if not available\n- Executes login to the cluster\n- Verifies connection and displays cluster info\n\n## /rhoai-install\nInstall RHOAI from scratch on a cluster:\n- Detects and blocks if ODH is installed (directs to /odh-uninstall first)\n- Sets up OLM catalog source for nightly or GA builds\n- Creates operator namespace and subscription\n- Waits for operator installation to complete\n- Creates DataScienceCluster with component configuration\n- Verifies all components are healthy\n\n## /rhoai-version\nDetect RHOAI version and build information:\n- Checks RHOAI operator subscription and CSV\n- Reports DataScienceCluster status and components\n- Lists all component images with SHA digests\n\n## /rhoai-update\nUpdates RHOAI to the latest nightly build:\n- Verifies current version and preserves channel\n- Updates the OLM catalog source\n- Handles forced reinstall when component images update without CSV version change\n- Verifies component reconciliation\n\n## /rhoai-uninstall\nCompletely uninstall RHOAI from an OpenShift cluster:\n- Supports graceful or forceful uninstall\n- Options to keep CRDs and/or user resources\n- Removes operator, custom resources, webhooks, namespaces\n\n## /odh-install\nInstall Open Data Hub (ODH) nightly builds on a cluster:\n- Detects and blocks if RHOAI is installed (directs to /rhoai-uninstall first)\n- Creates CatalogSource using odh-stable-nightly floating tag\n- Creates Subscription in openshift-operators (uses existing global OperatorGroup)\n- Creates DSCInitialization and DataScienceCluster\n- Default catalog: quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly\n- Default channel: fast\n\n## /odh-update\nUpdate ODH to the latest nightly build:\n- Updates CatalogSource, forces catalog pod refresh\n- OLM auto-upgrades when CSV version changes (typical for ODH nightlies)\n- Falls back to forced reinstall if only component images changed\n\n## /odh-uninstall\nCompletely uninstall ODH from an OpenShift cluster:\n- Removes DataScienceCluster, DSCInitialization, subscription, CSV, CatalogSource\n- Options: keep-crds, keep-all\n- Use default (no flags) when switching to RHOAI\n\n# Workflow Phases\n\n## Phase 0: Connect to Cluster\n- Login to OpenShift cluster using /oc-login\n\n## Phase 1: Install or Update\n- Fresh RHOAI: /rhoai-install\n- Fresh ODH: /odh-install\n- Update RHOAI: /rhoai-update\n- Update ODH: /odh-update\n\n## Phase 2: Version Management\n- Check RHOAI: /rhoai-version\n\n## Phase 3: Cleanup / Switch\n- Remove RHOAI: /rhoai-uninstall\n- Remove ODH: /odh-uninstall\n\n# Output Locations\n\n- Installation Reports: artifacts/rhoai-manager/reports/*.md\n- Version Info: artifacts/rhoai-manager/version/*.md\n- Execution Logs: artifacts/rhoai-manager/logs/*.log\n\n# Prerequisites\n\n- OpenShift cluster (version 4.12+)\n- Cluster credentials in Ambient session (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Cluster admin permissions\n", + "startupPrompt": "Welcome to the RHOAI Manager Workflow!\n\nI manage the complete lifecycle of Red Hat OpenShift AI (RHOAI) and Open Data Hub (ODH) installations.\n\n## RHOAI Commands\n\n- `/rhoai-install` - Install RHOAI from scratch (nightly or GA)\n- `/rhoai-update` - Update to latest nightly\n- `/rhoai-version` - Check current version and build info\n- `/rhoai-uninstall` - Remove RHOAI completely\n\n## ODH Commands\n\n- `/odh-install` - Install ODH nightly (odh-stable-nightly, fast channel)\n- `/odh-update` - Update ODH to latest nightly\n- `/odh-uninstall` - Remove ODH completely\n\n## Cluster Connection\n\n- `/oc-login` - Connect to your OpenShift cluster\n\n## Important Note\n\nRHOAI and ODH **cannot coexist** on the same cluster. To switch between them, uninstall one before installing the other. Both install commands detect the other and will guide you.\n\n**Getting started**: Make sure your cluster credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD) are configured in your Ambient session, then use /oc-login.\n\nWhat would you like to do?", "results": { "Installation Reports": "artifacts/rhoai-manager/reports/*.md", "Update Reports": "artifacts/rhoai-manager/reports/*.md", diff --git a/workflows/rhoai-manager/README.md b/workflows/rhoai-manager/README.md index 0ecaa044..dbd3adbf 100644 --- a/workflows/rhoai-manager/README.md +++ b/workflows/rhoai-manager/README.md @@ -1,30 +1,39 @@ # RHOAI Manager -Comprehensive workflow for managing the complete lifecycle of Red Hat OpenShift AI: installation, updates, version detection, and uninstallation. +Comprehensive workflow for managing the complete lifecycle of Red Hat OpenShift AI (RHOAI) and Open Data Hub (ODH): installation, updates, version detection, and uninstallation. ## Overview This workflow provides an AI-powered pipeline for: -- Installing RHOAI from scratch on OpenShift clusters -- Updating RHOAI to latest nightly builds -- Detecting version and build information -- Completely uninstalling RHOAI when needed +- Installing RHOAI or ODH from scratch on OpenShift clusters +- Updating RHOAI or ODH to latest nightly builds +- Detecting RHOAI version and build information +- Completely uninstalling RHOAI or ODH when needed - Managing cluster connections and authentication +- Safely switching between RHOAI and ODH + +## Important: RHOAI and ODH Cannot Coexist + +RHOAI and ODH share cluster-scoped CRDs (`DataScienceCluster`, `DSCInitialization`) and overlapping operators. They **cannot** be installed on the same cluster simultaneously. Both `/rhoai-install` and `/odh-install` detect the other and block with a clear error. ## Structure ``` workflows/rhoai-manager/ ├── .ambient/ -│ └── ambient.json # Workflow configuration +│ └── ambient.json # Workflow configuration ├── .claude/ │ └── commands/ -│ ├── oc-login.md # OpenShift cluster login command -│ ├── rhoai-install.md # RHOAI installation command -│ ├── rhoai-version.md # RHOAI version detection command -│ ├── rhoai-update.md # RHOAI update command -│ └── rhoai-uninstall.md # RHOAI uninstall command -└── README.md # This file +│ ├── oc-login.md # OpenShift cluster login +│ ├── rhoai-install.md # RHOAI installation +│ ├── rhoai-version.md # RHOAI version detection +│ ├── rhoai-update.md # RHOAI update to latest nightly +│ ├── rhoai-uninstall.md # RHOAI uninstall +│ ├── odh-install.md # ODH installation +│ ├── odh-update.md # ODH update to latest nightly +│ ├── odh-uninstall.md # ODH uninstall +│ └── odh-pr-tracker.md # Track ODH PRs in RHOAI builds +└── README.md # This file ``` ## Commands @@ -33,239 +42,192 @@ workflows/rhoai-manager/ Login to OpenShift cluster using credentials from Ambient session. -**What it does:** -- Checks for required credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD) -- Verifies `oc` CLI is installed -- Executes login to the cluster -- Verifies connection and displays cluster info +**Usage:** `/oc-login` + +**Required env vars:** `OCP_SERVER`, `OCP_USERNAME`, `OCP_PASSWORD` + +--- + +### /rhoai-install + +Install RHOAI from scratch on an OpenShift cluster. **Usage:** -``` -/oc-login +```bash +/rhoai-install # Latest dev nightly (default) +/rhoai-install channel=stable-3.4 # GA stable-3.4 channel +/rhoai-install catalog=redhat-operators # GA production catalog ``` -Or simply ask: -- "Login to my cluster" -- "Connect to OpenShift" -- "Login to OCP" +**Prerequisite:** No existing RHOAI **or ODH** installation (detected automatically). -**Required Environment Variables:** -- `OCP_SERVER` - OpenShift cluster API URL (e.g., `https://api.cluster.example.com:6443`) -- `OCP_USERNAME` - Your OpenShift username -- `OCP_PASSWORD` - Your OpenShift password +**What gets deployed:** +- Operator namespace: `redhat-ods-operator` +- Application namespace: `redhat-ods-applications` +- DataScienceCluster with all components -### /rhoai-install +--- -Install RHOAI from scratch on an OpenShift cluster. +### /rhoai-update -**What it does:** -- Sets up OLM catalog source (dev or GA production) -- Creates operator namespace and subscription -- Waits for ClusterServiceVersion (CSV) to be ready -- Creates DataScienceCluster with component configuration -- Patches component states (Managed/Removed) as needed -- Verifies all components are healthy and reconciled -- Provides detailed installation summary +Update RHOAI to the latest nightly or GA build. **Usage:** ```bash -# Development/Nightly builds (default) -/rhoai-install # Latest dev catalog -/rhoai-install channel=beta # Dev catalog, beta channel -/rhoai-install image=quay.io/modh/rhoai-catalog:latest-release-3.5 - -# GA Production releases -/rhoai-install catalog=redhat-operators # GA catalog, stable channel -/rhoai-install catalog=redhat-operators channel=fast # GA catalog, fast channel +/rhoai-update # Pull latest (preserves current channel) +/rhoai-update 3.4 -c stable-3.4 # Update with explicit channel ``` -**Parameters:** -- `catalog` - Catalog source (`rhoai-catalog-dev` for nightly, `redhat-operators` for GA) -- `channel` - Subscription channel (`beta`, `fast`, or `stable`) -- `image` - Custom catalog image (only for `rhoai-catalog-dev`) +**Features:** Preserves channel, auto-detects newer component images, forces reinstall if needed. -Or simply ask: -- "Install RHOAI on the cluster" -- "Deploy RHOAI from production catalog" -- "Set up RHOAI for testing" +--- -**Prerequisites:** -- OpenShift cluster (version 4.12+) -- Logged into cluster with admin permissions (use `/oc-login`) -- No existing RHOAI installation +### /rhoai-version -**What gets deployed:** -- **Operator namespace**: `redhat-ods-operator` -- **Application namespace**: `redhat-ods-applications` -- **Monitoring namespace**: `redhat-ods-monitoring` -- **DataScienceCluster**: Custom resource managing all RHOAI components -- **Component operators**: Dashboard, Workbenches, Model Serving, Pipelines, etc. +Check installed RHOAI version, CSV, catalog digest, and all component image SHAs. -**Note:** Defaults to `rhoai-catalog-dev` for nightly builds. Use `catalog=redhat-operators` for GA production releases. +**Usage:** `/rhoai-version` -### /rhoai-version +--- -Detect RHOAI version and build information. +### /rhoai-uninstall -**What it does:** -- Checks RHOAI operator subscription and ClusterServiceVersion -- Reports DataScienceCluster status and component states -- Lists all component images with SHA256 digests -- Provides comprehensive version summary +Completely uninstall RHOAI from an OpenShift cluster. **Usage:** +```bash +/rhoai-uninstall # Remove everything (use this before installing ODH) +/rhoai-uninstall graceful # Graceful then forceful cleanup +/rhoai-uninstall keep-crds # Keep CRDs +/rhoai-uninstall keep-all # Keep CRDs and user resources ``` -/rhoai-version -``` - -Or simply ask: -- "What version of RHOAI is installed?" -- "Check RHOAI version" -- "Show me RHOAI build info" -**Note:** You must be logged into the cluster first (use `/oc-login`) +--- -### /rhoai-update - -Updates RHOAI to the latest nightly build. +### /odh-install -**What it does:** -- Checks current RHOAI version and component states -- Updates the OLM catalog source to latest nightly -- Monitors the operator upgrade process -- Handles special scenarios: - - **Scenario E**: Forced subscription reinstall when component images update without CSV version change - - Channel preservation across updates - - Component state preservation (Managed/Removed) -- Verifies component reconciliation -- Reports final status with before/after comparison +Install Open Data Hub (ODH) nightly on an OpenShift cluster. **Usage:** -``` -/rhoai-update +```bash +/odh-install # odh-stable-nightly catalog, fast channel (default) +/odh-install channel=fast image=quay.io/opendatahub/opendatahub-operator-catalog:latest ``` -Or simply ask: -- "Update RHOAI to latest nightly" -- "Upgrade to RHOAI 3.4 nightly" -- "Update RHOAI" +**Prerequisite:** No existing ODH **or RHOAI** installation (detected automatically). -**Note:** You must be logged into the cluster first (use `/oc-login`) +**Key differences from RHOAI:** -**Advanced Features:** -- Detects when component images have newer builds without CSV version changes -- Automatically triggers forced reinstall in these cases -- Preserves DataScienceCluster component configuration across updates -- Waits for all components to reconcile before completing +| | RHOAI | ODH | +|-|-------|-----| +| Package | `rhods-operator` | `opendatahub-operator` | +| Operator namespace | `redhat-ods-operator` | `openshift-operators` | +| App namespace | `redhat-ods-applications` | `opendatahub` | +| Default channel | `stable-3.4` / `beta` | `fast` | +| Nightly tag | `rhoai-3.4` (floating) | `odh-stable-nightly` (floating) | -### /rhoai-uninstall +--- -Completely uninstall RHOAI from an OpenShift cluster. +### /odh-update -**What it does:** -- Removes RHOAI operator and subscriptions -- Deletes custom resources (DataScienceCluster, DSCInitialization, etc.) -- Cleans up webhooks and finalizers -- Removes RHOAI namespaces -- Deletes CRDs (optional) -- Cleans up user data science projects (optional) +Update ODH to the latest nightly build. **Usage:** -``` -/rhoai-uninstall # Standard forceful uninstall -/rhoai-uninstall graceful # Graceful uninstall followed by cleanup -/rhoai-uninstall keep-crds # Keep CRDs installed -/rhoai-uninstall keep-all # Keep CRDs and user resources +```bash +/odh-update # Pull latest odh-stable-nightly +/odh-update image=quay.io/opendatahub/opendatahub-operator-catalog:latest ``` -Or simply ask: -- "Uninstall RHOAI from the cluster" -- "Remove RHOAI completely" -- "Clean up RHOAI installation" +**Note:** ODH nightlies typically bump the CSV version daily, so OLM auto-upgrades without a forced reinstall in most cases. -**Warning:** This will delete all RHOAI resources including user workbenches, models, and data. Backup important work first. +--- -**Note:** You must be logged into the cluster first (use `/oc-login`) and have cluster-admin permissions. +### /odh-uninstall -## Prerequisites +Completely uninstall ODH from an OpenShift cluster. -- OpenShift cluster (version 4.12+) -- `oc` CLI installed (auto-installed if missing) -- Cluster credentials configured in Ambient session: - - `OCP_SERVER` - OpenShift cluster API URL - - `OCP_USERNAME` - Your OpenShift username - - `OCP_PASSWORD` - Your OpenShift password -- Cluster admin permissions +**Usage:** +```bash +/odh-uninstall # Remove everything (use this before installing RHOAI) +/odh-uninstall keep-crds # Keep CRDs +/odh-uninstall keep-all # Keep CRDs and user resources +``` -## Output Artifacts +**Note:** Use the default (no flags) when switching to RHOAI — `keep-crds` or `keep-all` would leave conflicting CRDs. -All artifacts are stored in `artifacts/rhoai-manager/`: +--- -- `reports/*.md` - Installation and update reports with version changes -- `version/*.md` - Version detection summaries -- `logs/*.log` - Detailed execution logs +### /odh-pr-tracker + +Track whether an ODH pull request has been included in the latest RHOAI build. + +**Usage:** `/odh-pr-tracker <pr-number>` + +--- ## Typical Workflows -### Fresh Installation +### Fresh RHOAI Installation ``` -1. /oc-login # Connect to cluster -2. /rhoai-install # Install RHOAI from scratch -3. /rhoai-version # Verify installation +1. /oc-login +2. /rhoai-install +3. /rhoai-version ``` -### Regular Updates +### Fresh ODH Installation ``` -1. /oc-login # Connect to cluster -2. /rhoai-version # Check current version -3. /rhoai-update # Update to latest nightly -4. /rhoai-version # Verify new version +1. /oc-login +2. /odh-install +3. /rhoai-version # (check via version command — ODH has no dedicated version command yet) ``` -### Decommissioning +### Pull Latest Nightly (RHOAI) ``` -1. /oc-login # Connect to cluster -2. /rhoai-uninstall # Remove RHOAI completely +1. /oc-login +2. /rhoai-update +3. /rhoai-version ``` -## GitHub Actions Integration +### Pull Latest Nightly (ODH) +``` +1. /oc-login +2. /odh-update +``` -This workflow is designed to run via GitHub Actions with Ambient: +### Switch from RHOAI to ODH +``` +1. /oc-login +2. /rhoai-uninstall # Standard uninstall (removes CRDs) +3. /odh-install +``` -```yaml -- name: Update RHOAI to Latest Nightly - uses: ambient-code/ambient-action@v0.0.2 - with: - api-token: ${{ secrets.AMBIENT_API_TOKEN }} - workflow: workflows/rhoai-manager - prompt: Update RHOAI to the latest nightly build +### Switch from ODH to RHOAI +``` +1. /oc-login +2. /odh-uninstall # Standard uninstall (removes CRDs) +3. /rhoai-install ``` -## Technical Details - -### Catalog Source -- Uses `rhoai-catalog-dev` for nightly builds -- Image: `quay.io/modh/rhoai-catalog:latest-release-3.4` -- Updates trigger operator upgrades automatically - -### Component Management -- DataScienceCluster manages component states (Managed/Removed) -- Component states preserved across updates -- Individual component image tracking with SHA digests - -### Update Scenarios -The workflow handles several update scenarios: -- **Normal CSV upgrade**: Operator version changes -- **Forced reinstall**: Component images update without CSV version change -- **Channel migration**: Updates preserve subscription channel -- **Catalog refresh**: Forces OLM to re-evaluate available updates - -## Future Enhancements - -- [ ] Automated test suite execution after updates -- [ ] Test result parsing and analysis -- [ ] JIRA integration for issue updates -- [ ] Slack/email notifications -- [ ] Rollback capabilities -- [ ] Pre-upgrade validation checks -- [ ] Multi-cluster support +### Decommission +``` +1. /oc-login +2. /rhoai-uninstall # or /odh-uninstall +``` + +## Prerequisites + +- OpenShift cluster (version 4.12+) +- `oc` CLI installed (auto-installed if missing) +- Cluster credentials configured in Ambient session: + - `OCP_SERVER` - OpenShift cluster API URL + - `OCP_USERNAME` - Your OpenShift username + - `OCP_PASSWORD` - Your OpenShift password +- Cluster admin permissions + +## Output Artifacts + +All artifacts are stored in `artifacts/rhoai-manager/`: + +- `reports/*.md` - Installation and update reports +- `version/*.md` - Version detection summaries +- `logs/*.log` - Detailed execution logs From 07be97c001222c2c88f096e6af23786fea426662 Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi <vamodi@redhat.com> Date: Sun, 12 Apr 2026 19:18:52 -0400 Subject: [PATCH 19/23] feat: add /mirror-images command for disconnected cluster image mirroring Adds a new command to mirror RHOAI operator and component images from a connected cluster to disconnected bastion registries using pod-based mirroring for fast AWS-internal transfers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --- workflows/rhoai-manager/.ambient/ambient.json | 4 +- .../.claude/commands/mirror-images.md | 249 ++++++++++++++++++ workflows/rhoai-manager/README.md | 28 +- 3 files changed, 278 insertions(+), 3 deletions(-) create mode 100644 workflows/rhoai-manager/.claude/commands/mirror-images.md diff --git a/workflows/rhoai-manager/.ambient/ambient.json b/workflows/rhoai-manager/.ambient/ambient.json index f3642a77..e310fe22 100644 --- a/workflows/rhoai-manager/.ambient/ambient.json +++ b/workflows/rhoai-manager/.ambient/ambient.json @@ -1,8 +1,8 @@ { "name": "RHOAI Manager", "description": "Comprehensive workflow for managing Red Hat OpenShift AI (RHOAI) and Open Data Hub (ODH) lifecycle: installation, updates, version detection, and uninstallation.", - "systemPrompt": "You are an AI assistant specialized in managing the complete lifecycle of RHOAI (Red Hat OpenShift AI) and ODH (Open Data Hub) installations.\n\n# Your Role\n\nYou help automate the process of:\n1. Logging into OpenShift clusters\n2. Installing RHOAI or ODH from scratch\n3. Detecting RHOAI version and build information\n4. Updating RHOAI or ODH to latest nightly builds\n5. Uninstalling RHOAI or ODH completely\n6. Switching between RHOAI and ODH safely\n\n# Important: RHOAI and ODH Cannot Coexist\n\nRHOAI and ODH share cluster-scoped CRDs (DataScienceCluster, DSCInitialization) and overlapping operators. They CANNOT be installed on the same cluster at the same time.\n\n- To switch from RHOAI to ODH: run /rhoai-uninstall first, then /odh-install\n- To switch from ODH to RHOAI: run /odh-uninstall first, then /rhoai-install\n- Both /rhoai-install and /odh-install detect the other and block with a clear error message\n\n# Available Commands\n\n## /oc-login\nLogin to OpenShift cluster using credentials from Ambient session:\n- Checks for required credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Automatically installs oc CLI if not available\n- Executes login to the cluster\n- Verifies connection and displays cluster info\n\n## /rhoai-install\nInstall RHOAI from scratch on a cluster:\n- Detects and blocks if ODH is installed (directs to /odh-uninstall first)\n- Sets up OLM catalog source for nightly or GA builds\n- Creates operator namespace and subscription\n- Waits for operator installation to complete\n- Creates DataScienceCluster with component configuration\n- Verifies all components are healthy\n\n## /rhoai-version\nDetect RHOAI version and build information:\n- Checks RHOAI operator subscription and CSV\n- Reports DataScienceCluster status and components\n- Lists all component images with SHA digests\n\n## /rhoai-update\nUpdates RHOAI to the latest nightly build:\n- Verifies current version and preserves channel\n- Updates the OLM catalog source\n- Handles forced reinstall when component images update without CSV version change\n- Verifies component reconciliation\n\n## /rhoai-uninstall\nCompletely uninstall RHOAI from an OpenShift cluster:\n- Supports graceful or forceful uninstall\n- Options to keep CRDs and/or user resources\n- Removes operator, custom resources, webhooks, namespaces\n\n## /odh-install\nInstall Open Data Hub (ODH) nightly builds on a cluster:\n- Detects and blocks if RHOAI is installed (directs to /rhoai-uninstall first)\n- Creates CatalogSource using odh-stable-nightly floating tag\n- Creates Subscription in openshift-operators (uses existing global OperatorGroup)\n- Creates DSCInitialization and DataScienceCluster\n- Default catalog: quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly\n- Default channel: fast\n\n## /odh-update\nUpdate ODH to the latest nightly build:\n- Updates CatalogSource, forces catalog pod refresh\n- OLM auto-upgrades when CSV version changes (typical for ODH nightlies)\n- Falls back to forced reinstall if only component images changed\n\n## /odh-uninstall\nCompletely uninstall ODH from an OpenShift cluster:\n- Removes DataScienceCluster, DSCInitialization, subscription, CSV, CatalogSource\n- Options: keep-crds, keep-all\n- Use default (no flags) when switching to RHOAI\n\n# Workflow Phases\n\n## Phase 0: Connect to Cluster\n- Login to OpenShift cluster using /oc-login\n\n## Phase 1: Install or Update\n- Fresh RHOAI: /rhoai-install\n- Fresh ODH: /odh-install\n- Update RHOAI: /rhoai-update\n- Update ODH: /odh-update\n\n## Phase 2: Version Management\n- Check RHOAI: /rhoai-version\n\n## Phase 3: Cleanup / Switch\n- Remove RHOAI: /rhoai-uninstall\n- Remove ODH: /odh-uninstall\n\n# Output Locations\n\n- Installation Reports: artifacts/rhoai-manager/reports/*.md\n- Version Info: artifacts/rhoai-manager/version/*.md\n- Execution Logs: artifacts/rhoai-manager/logs/*.log\n\n# Prerequisites\n\n- OpenShift cluster (version 4.12+)\n- Cluster credentials in Ambient session (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Cluster admin permissions\n", - "startupPrompt": "Welcome to the RHOAI Manager Workflow!\n\nI manage the complete lifecycle of Red Hat OpenShift AI (RHOAI) and Open Data Hub (ODH) installations.\n\n## RHOAI Commands\n\n- `/rhoai-install` - Install RHOAI from scratch (nightly or GA)\n- `/rhoai-update` - Update to latest nightly\n- `/rhoai-version` - Check current version and build info\n- `/rhoai-uninstall` - Remove RHOAI completely\n\n## ODH Commands\n\n- `/odh-install` - Install ODH nightly (odh-stable-nightly, fast channel)\n- `/odh-update` - Update ODH to latest nightly\n- `/odh-uninstall` - Remove ODH completely\n\n## Cluster Connection\n\n- `/oc-login` - Connect to your OpenShift cluster\n\n## Important Note\n\nRHOAI and ODH **cannot coexist** on the same cluster. To switch between them, uninstall one before installing the other. Both install commands detect the other and will guide you.\n\n**Getting started**: Make sure your cluster credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD) are configured in your Ambient session, then use /oc-login.\n\nWhat would you like to do?", + "systemPrompt": "You are an AI assistant specialized in managing the complete lifecycle of RHOAI (Red Hat OpenShift AI) and ODH (Open Data Hub) installations.\n\n# Your Role\n\nYou help automate the process of:\n1. Logging into OpenShift clusters\n2. Installing RHOAI or ODH from scratch\n3. Detecting RHOAI version and build information\n4. Updating RHOAI or ODH to latest nightly builds\n5. Uninstalling RHOAI or ODH completely\n6. Switching between RHOAI and ODH safely\n\n# Important: RHOAI and ODH Cannot Coexist\n\nRHOAI and ODH share cluster-scoped CRDs (DataScienceCluster, DSCInitialization) and overlapping operators. They CANNOT be installed on the same cluster at the same time.\n\n- To switch from RHOAI to ODH: run /rhoai-uninstall first, then /odh-install\n- To switch from ODH to RHOAI: run /odh-uninstall first, then /rhoai-install\n- Both /rhoai-install and /odh-install detect the other and block with a clear error message\n\n# Available Commands\n\n## /oc-login\nLogin to OpenShift cluster using credentials from Ambient session:\n- Checks for required credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Automatically installs oc CLI if not available\n- Executes login to the cluster\n- Verifies connection and displays cluster info\n\n## /rhoai-install\nInstall RHOAI from scratch on a cluster:\n- Detects and blocks if ODH is installed (directs to /odh-uninstall first)\n- Sets up OLM catalog source for nightly or GA builds\n- Creates operator namespace and subscription\n- Waits for operator installation to complete\n- Creates DataScienceCluster with component configuration\n- Verifies all components are healthy\n\n## /rhoai-version\nDetect RHOAI version and build information:\n- Checks RHOAI operator subscription and CSV\n- Reports DataScienceCluster status and components\n- Lists all component images with SHA digests\n\n## /rhoai-update\nUpdates RHOAI to the latest nightly build:\n- Verifies current version and preserves channel\n- Updates the OLM catalog source\n- Handles forced reinstall when component images update without CSV version change\n- Verifies component reconciliation\n\n## /rhoai-uninstall\nCompletely uninstall RHOAI from an OpenShift cluster:\n- Supports graceful or forceful uninstall\n- Options to keep CRDs and/or user resources\n- Removes operator, custom resources, webhooks, namespaces\n\n## /odh-install\nInstall Open Data Hub (ODH) nightly builds on a cluster:\n- Detects and blocks if RHOAI is installed (directs to /rhoai-uninstall first)\n- Creates CatalogSource using odh-stable-nightly floating tag\n- Creates Subscription in openshift-operators (uses existing global OperatorGroup)\n- Creates DSCInitialization and DataScienceCluster\n- Default catalog: quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly\n- Default channel: fast\n\n## /odh-update\nUpdate ODH to the latest nightly build:\n- Updates CatalogSource, forces catalog pod refresh\n- OLM auto-upgrades when CSV version changes (typical for ODH nightlies)\n- Falls back to forced reinstall if only component images changed\n\n## /odh-uninstall\nCompletely uninstall ODH from an OpenShift cluster:\n- Removes DataScienceCluster, DSCInitialization, subscription, CSV, CatalogSource\n- Options: keep-crds, keep-all\n- Use default (no flags) when switching to RHOAI\n\n## /mirror-images\nMirror RHOAI operator and component images to disconnected cluster bastion registries:\n- Extracts image list from connected cluster CSV relatedImages and running pods\n- Builds combined pull secret with source registry and bastion credentials\n- Deploys a mirror pod on the connected cluster for fast AWS-internal transfers\n- Mirrors all images to both bastions with retries and verification\n- Uses --keep-manifest-list=true --filter-by-os=\".*\" to preserve manifest list digests\n- Tags destinations with :latest to prevent Quay tagless manifest GC\n\n# Workflow Phases\n\n## Phase 0: Connect to Cluster\n- Login to OpenShift cluster using /oc-login\n\n## Phase 1: Install or Update\n- Fresh RHOAI: /rhoai-install\n- Fresh ODH: /odh-install\n- Update RHOAI: /rhoai-update\n- Update ODH: /odh-update\n\n## Phase 2: Version Management\n- Check RHOAI: /rhoai-version\n\n## Phase 3: Cleanup / Switch\n- Remove RHOAI: /rhoai-uninstall\n- Remove ODH: /odh-uninstall\n\n# Output Locations\n\n- Installation Reports: artifacts/rhoai-manager/reports/*.md\n- Version Info: artifacts/rhoai-manager/version/*.md\n- Execution Logs: artifacts/rhoai-manager/logs/*.log\n\n# Prerequisites\n\n- OpenShift cluster (version 4.12+)\n- Cluster credentials in Ambient session (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Cluster admin permissions\n", + "startupPrompt": "Welcome to the RHOAI Manager Workflow!\n\nI manage the complete lifecycle of Red Hat OpenShift AI (RHOAI) and Open Data Hub (ODH) installations.\n\n## RHOAI Commands\n\n- `/rhoai-install` - Install RHOAI from scratch (nightly or GA)\n- `/rhoai-update` - Update to latest nightly\n- `/rhoai-version` - Check current version and build info\n- `/rhoai-uninstall` - Remove RHOAI completely\n\n## ODH Commands\n\n- `/odh-install` - Install ODH nightly (odh-stable-nightly, fast channel)\n- `/odh-update` - Update ODH to latest nightly\n- `/odh-uninstall` - Remove ODH completely\n\n## Disconnected Cluster Operations\n\n- `/mirror-images` - Mirror RHOAI images to disconnected cluster bastion registries\n\n## Cluster Connection\n\n- `/oc-login` - Connect to your OpenShift cluster\n\n## Important Note\n\nRHOAI and ODH **cannot coexist** on the same cluster. To switch between them, uninstall one before installing the other. Both install commands detect the other and will guide you.\n\n**Getting started**: Make sure your cluster credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD) are configured in your Ambient session, then use /oc-login.\n\nWhat would you like to do?", "results": { "Installation Reports": "artifacts/rhoai-manager/reports/*.md", "Update Reports": "artifacts/rhoai-manager/reports/*.md", diff --git a/workflows/rhoai-manager/.claude/commands/mirror-images.md b/workflows/rhoai-manager/.claude/commands/mirror-images.md new file mode 100644 index 00000000..b77352ec --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/mirror-images.md @@ -0,0 +1,249 @@ +# /mirror-images - Mirror RHOAI Images to Disconnected Bastion Registries + +## Purpose + +Copy all RHOAI operator, FBC (File-Based Catalog), and component images deployed on a connected OpenShift cluster to both disconnected cluster bastion registries. Runs the mirror job from a pod on the connected cluster for fast AWS-internal transfers. + +## Prerequisites + +- `oc` CLI installed and authenticated to the **connected** OpenShift cluster +- The connected cluster has RHOAI operator installed and running +- Network access from the connected cluster to both bastion registries +- Bastion registry credentials (username/password) for both disconnected clusters + +## Inputs + +The user must provide (or you must ask for): + +| Input | Description | Example | +|-------|-------------|---------| +| `PIPELINE_BASTION` | Bastion registry host:port for the pipeline cluster | `bastion.ods-dis-pipeline.aws.rh-ods.com:8443` | +| `RHOAI_TEST_BASTION` | Bastion registry host:port for the rhoai-test cluster | `bastion.ods-dis-rhoai-test.aws.rh-ods.com:8443` | +| `BASTION_USER` | Registry username for both bastions | `mir_reg` | +| `BASTION_PASSWORD` | Registry password for both bastions | (prompt securely) | +| `RHOAI_VERSION` | RHOAI version label (for naming) | `3.4` | +| `EXCLUDE_PATTERNS` | Optional image name patterns to skip | `workbench,training,pipeline-runtime,spark` | + +## Process + +### Phase 1: Extract Image List from Connected Cluster + +1. **Get RHOAI CSV (ClusterServiceVersion)** + + ```bash + CSV_NAME=$(oc get csv -n redhat-ods-operator -o name | grep rhods-operator) + ``` + +2. **Extract relatedImages from CSV** + + ```bash + oc get "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.relatedImages[*]}' | jq -r '.[] | "\(.name) \(.image)"' + ``` + +3. **Extract images from running pods** + + ```bash + oc get pods --all-namespaces -o jsonpath='{range .items[*]}{range .spec.containers[*]}{.image}{"\n"}{end}{range .spec.initContainers[*]}{.image}{"\n"}{end}{end}' | grep -E '(rhoai|rhods|odh)' | sort -u + ``` + +4. **Merge and deduplicate** the two lists. For each image, extract: + - Source registry (e.g., `quay.io/rhoai/` or `registry.redhat.io/`) + - Repository name (e.g., `odh-dashboard-rhel9`) + - Digest (`sha256:...`) + +5. **Apply exclusion filters** — remove images matching `EXCLUDE_PATTERNS` + +6. **Save the image list** to `artifacts/rhoai-manager/mirror-images-{version}.txt` with format: + + ```text + quay.io/rhoai/odh-dashboard-rhel9@sha256:abc123... + quay.io/rhoai/odh-model-controller-rhel9@sha256:def456... + ``` + +### Phase 2: Build Combined Pull Secret + +1. **Get the connected cluster's pull secret** + + ```bash + oc get secret/pull-secret -n openshift-config -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d > /tmp/cluster-pull-secret.json + ``` + +2. **Add bastion registry credentials** — merge both bastion auths into the pull secret: + + ```bash + # Generate base64 auth for bastions + BASTION_AUTH=$(printf '%s:%s' "$BASTION_USER" "$BASTION_PASSWORD" | base64 | tr -d '\n') + ``` + + Use `jq` to merge the bastion auths into `.auths`: + + ```bash + jq --arg pipeline "$PIPELINE_BASTION" \ + --arg rhoaitest "$RHOAI_TEST_BASTION" \ + --arg auth "$BASTION_AUTH" \ + '.auths[$pipeline] = {"auth": $auth} | .auths[$rhoaitest] = {"auth": $auth}' \ + /tmp/cluster-pull-secret.json > /tmp/combined-pull-secret.json + ``` + +3. **Create the secret in the mirror namespace** + + ```bash + oc new-project image-mirror 2>/dev/null || true + oc delete secret mirror-pull-secret -n image-mirror 2>/dev/null || true + oc create secret generic mirror-pull-secret \ + --from-file=auth.json=/tmp/combined-pull-secret.json \ + -n image-mirror + ``` + +4. **Clean up local temp files** + + ```bash + rm -f /tmp/cluster-pull-secret.json /tmp/combined-pull-secret.json + ``` + +### Phase 3: Generate Mirror Script + +Generate a bash script that mirrors all images. The script must: + +- Accept the pull secret path, and both bastion hostnames +- For each image in the list: + - Determine the source reference (`registry/repo@digest`) + - Mirror to both bastions with retries (3 attempts) + - Use `oc image mirror` with these critical flags: + - `--keep-manifest-list=true` — preserves manifest list digests referenced by the CSV + - `--filter-by-os=".*"` — mirrors all architectures (prevents manifest list stripping) + - `--insecure=true` — bastion registries use self-signed certs + - `-a "$PULL_SECRET"` — combined auth file + - Tag destination as `:latest` to prevent Quay tagless manifest garbage collection + - Verify each mirror with `oc image info` +- Track and report: verified count, failed count, skipped count +- Print a summary at the end + +**Mirror command pattern per image:** + +```bash +oc image mirror \ + "${SOURCE_REGISTRY}/${REPO}@${DIGEST}" \ + "${BASTION}/${REPO}:latest" \ + --insecure=true \ + -a "$PULL_SECRET" \ + --keep-manifest-list=true \ + --filter-by-os=".*" +``` + +**Verification command per image:** + +```bash +oc image info "${BASTION}/${REPO}:latest" --insecure=true -a "$PULL_SECRET" +``` + +### Phase 4: Deploy Mirror Pod + +1. **Create ConfigMap from the mirror script** + + ```bash + oc delete configmap mirror-script -n image-mirror 2>/dev/null || true + oc create configmap mirror-script \ + --from-file=mirror.sh=/tmp/mirror-script.sh \ + -n image-mirror + ``` + +2. **Create the mirror pod** using this manifest: + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: image-mirror + namespace: image-mirror + spec: + restartPolicy: Never + activeDeadlineSeconds: 14400 + containers: + - name: mirror + image: registry.redhat.io/openshift4/ose-cli-rhel9:latest + command: ["/bin/bash", "/scripts/mirror.sh"] + volumeMounts: + - name: auth + mountPath: /auth + readOnly: true + - name: script + mountPath: /scripts + readOnly: true + resources: + requests: + memory: "512Mi" + cpu: "500m" + limits: + memory: "2Gi" + cpu: "2" + volumes: + - name: auth + secret: + secretName: mirror-pull-secret + - name: script + configMap: + name: mirror-script + defaultMode: 0755 + ``` + +3. **Apply the pod manifest** + + ```bash + oc delete pod image-mirror -n image-mirror 2>/dev/null || true + oc apply -f /tmp/mirror-pod.yaml + ``` + +### Phase 5: Monitor and Verify + +1. **Wait for the pod to start** + + ```bash + oc wait --for=condition=Ready pod/image-mirror -n image-mirror --timeout=120s + ``` + +2. **Stream logs** periodically to check progress: + + ```bash + oc logs image-mirror -n image-mirror --tail=50 + ``` + +3. **Check at intervals** (every 10-15 minutes) until the pod completes: + + ```bash + oc get pod image-mirror -n image-mirror -o jsonpath='{.status.phase}' + ``` + +4. **When the pod finishes**, retrieve the full log and parse the summary: + + ```bash + oc logs image-mirror -n image-mirror > artifacts/rhoai-manager/mirror-log-{version}.txt + ``` + +5. **If any images failed**, report them and offer to create a retry script with only the failed images. + +### Phase 6: Cleanup + +After successful verification: + +```bash +oc delete pod image-mirror -n image-mirror +oc delete configmap mirror-script -n image-mirror +oc delete secret mirror-pull-secret -n image-mirror +oc delete project image-mirror +``` + +Clean up any local temp files. + +## Important Notes + +- **Why pod-based mirroring**: Running `oc image mirror` from a pod on the connected AWS cluster uses AWS internal networking (40-116 MB/s) instead of local internet (~2 MB/s). This eliminates connection drops on large blob uploads (some RHOAI images are 5-7 GB). +- **Why `:latest` tag**: Quay garbage-collects manifests that have no tags. Even though clusters pull by digest, pushing with `:latest` prevents GC from removing the manifests. +- **Why `--filter-by-os=".*"`**: Using `--filter-by-os=linux/amd64` strips the manifest list and replaces it with a single-arch manifest. The CSV references the manifest list digest, so this would break image resolution. `".*"` preserves the full manifest list. +- **Why `--keep-manifest-list=true`**: Ensures the manifest list is pushed as-is to the destination, preserving the exact digest the CSV references. +- **Large images**: Some RHOAI images (automl ~5.5GB, autorag ~7.2GB, ta-lmes-job ~6.7GB) take 5-15 minutes each. The 4-hour `activeDeadlineSeconds` on the pod accommodates this. + +## Output + +- `artifacts/rhoai-manager/mirror-images-{version}.txt` — full image list extracted from the connected cluster +- `artifacts/rhoai-manager/mirror-log-{version}.txt` — complete mirror pod log with verification results diff --git a/workflows/rhoai-manager/README.md b/workflows/rhoai-manager/README.md index dbd3adbf..8ed1f620 100644 --- a/workflows/rhoai-manager/README.md +++ b/workflows/rhoai-manager/README.md @@ -32,7 +32,8 @@ workflows/rhoai-manager/ │ ├── odh-install.md # ODH installation │ ├── odh-update.md # ODH update to latest nightly │ ├── odh-uninstall.md # ODH uninstall -│ └── odh-pr-tracker.md # Track ODH PRs in RHOAI builds +│ ├── odh-pr-tracker.md # Track ODH PRs in RHOAI builds +│ └── mirror-images.md # Mirror images to disconnected bastions └── README.md # This file ``` @@ -165,6 +166,25 @@ Track whether an ODH pull request has been included in the latest RHOAI build. --- +### /mirror-images + +Mirror all RHOAI operator and component images from a connected cluster to both disconnected cluster bastion registries. + +**Usage:** `/mirror-images` + +**What it does:** + +1. Extracts images from connected cluster's CSV relatedImages + running pods +2. Builds a combined pull secret with source registry and bastion credentials +3. Deploys a mirror pod on the connected cluster (fast AWS-internal transfers) +4. Mirrors all images to both bastions with `--keep-manifest-list=true --filter-by-os=".*"` +5. Tags destinations with `:latest` to prevent Quay tagless manifest GC +6. Verifies every image on both bastions, reports failures + +**Required inputs:** Both bastion addresses, bastion credentials, RHOAI version, optional exclude patterns. + +--- + ## Typical Workflows ### Fresh RHOAI Installation @@ -208,6 +228,12 @@ Track whether an ODH pull request has been included in the latest RHOAI build. 3. /rhoai-install ``` +### Mirror Images to Disconnected Clusters +``` +1. /oc-login # Connect to the connected cluster +2. /mirror-images # Mirror all RHOAI images to both bastions +``` + ### Decommission ``` 1. /oc-login From 53c7842abdef679792164eb5a020b23902fedf9d Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi <vamodi@redhat.com> Date: Sun, 12 Apr 2026 19:21:10 -0400 Subject: [PATCH 20/23] fix: auto-detect RHOAI version from connected cluster CSV MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Remove RHOAI_VERSION as a user input — detect it automatically from the deployed CSV version on the connected cluster instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --- .../rhoai-manager/.claude/commands/mirror-images.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/workflows/rhoai-manager/.claude/commands/mirror-images.md b/workflows/rhoai-manager/.claude/commands/mirror-images.md index b77352ec..9103aca7 100644 --- a/workflows/rhoai-manager/.claude/commands/mirror-images.md +++ b/workflows/rhoai-manager/.claude/commands/mirror-images.md @@ -21,19 +21,27 @@ The user must provide (or you must ask for): | `RHOAI_TEST_BASTION` | Bastion registry host:port for the rhoai-test cluster | `bastion.ods-dis-rhoai-test.aws.rh-ods.com:8443` | | `BASTION_USER` | Registry username for both bastions | `mir_reg` | | `BASTION_PASSWORD` | Registry password for both bastions | (prompt securely) | -| `RHOAI_VERSION` | RHOAI version label (for naming) | `3.4` | | `EXCLUDE_PATTERNS` | Optional image name patterns to skip | `workbench,training,pipeline-runtime,spark` | +**Auto-detected (no user input needed):** + +| Value | Source | +|-------|--------| +| `RHOAI_VERSION` | Extracted from the CSV version on the connected cluster (e.g., `rhods-operator.3.4.0` -> `3.4`) | + ## Process ### Phase 1: Extract Image List from Connected Cluster -1. **Get RHOAI CSV (ClusterServiceVersion)** +1. **Get RHOAI CSV and detect version** ```bash CSV_NAME=$(oc get csv -n redhat-ods-operator -o name | grep rhods-operator) + RHOAI_VERSION=$(oc get "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.version}' | grep -oE '^[0-9]+\.[0-9]+') ``` + This auto-detects the deployed version (e.g., `3.4`) and uses it for naming artifact files. + 2. **Extract relatedImages from CSV** ```bash From b3dbef35431e1b9df4ec4171b3467c510dbd2669 Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi <vamodi@redhat.com> Date: Mon, 13 Apr 2026 09:32:51 -0400 Subject: [PATCH 21/23] fix: address review gaps in mirror-images and rhoai-disconnected commands MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit mirror-images: - Add registry fallback (registry.redhat.io → quay.io/rhoai) for nightlies - Add bastion duplicate check to skip already-mirrored images - Add IDMS YAML auto-generation script after mirroring - Expand image categories and infrastructure coverage rhoai-disconnected: - Add dependent operator prerequisite check (Service Mesh, Serverless, Pipelines, cert-manager) - Add disconnected DSC config (nim.airGapped, KServe Headless, TrustyAI) - Use IDMS entries for bastion path resolution instead of hardcoding - Make persistenceAgent TLS fix proactive instead of reactive-only - Expand IDMS verification to cover all required registries Also includes rhoai-verify command and updated README/ambient.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --- workflows/rhoai-manager/.ambient/ambient.json | 4 +- .../.claude/commands/mirror-images.md | 334 +++++- .../.claude/commands/rhoai-disconnected.md | 1041 +++++++++++++++++ .../.claude/commands/rhoai-verify.md | 654 +++++++++++ workflows/rhoai-manager/README.md | 97 +- 5 files changed, 2057 insertions(+), 73 deletions(-) create mode 100644 workflows/rhoai-manager/.claude/commands/rhoai-disconnected.md create mode 100644 workflows/rhoai-manager/.claude/commands/rhoai-verify.md diff --git a/workflows/rhoai-manager/.ambient/ambient.json b/workflows/rhoai-manager/.ambient/ambient.json index e310fe22..c7e85d3b 100644 --- a/workflows/rhoai-manager/.ambient/ambient.json +++ b/workflows/rhoai-manager/.ambient/ambient.json @@ -1,8 +1,8 @@ { "name": "RHOAI Manager", "description": "Comprehensive workflow for managing Red Hat OpenShift AI (RHOAI) and Open Data Hub (ODH) lifecycle: installation, updates, version detection, and uninstallation.", - "systemPrompt": "You are an AI assistant specialized in managing the complete lifecycle of RHOAI (Red Hat OpenShift AI) and ODH (Open Data Hub) installations.\n\n# Your Role\n\nYou help automate the process of:\n1. Logging into OpenShift clusters\n2. Installing RHOAI or ODH from scratch\n3. Detecting RHOAI version and build information\n4. Updating RHOAI or ODH to latest nightly builds\n5. Uninstalling RHOAI or ODH completely\n6. Switching between RHOAI and ODH safely\n\n# Important: RHOAI and ODH Cannot Coexist\n\nRHOAI and ODH share cluster-scoped CRDs (DataScienceCluster, DSCInitialization) and overlapping operators. They CANNOT be installed on the same cluster at the same time.\n\n- To switch from RHOAI to ODH: run /rhoai-uninstall first, then /odh-install\n- To switch from ODH to RHOAI: run /odh-uninstall first, then /rhoai-install\n- Both /rhoai-install and /odh-install detect the other and block with a clear error message\n\n# Available Commands\n\n## /oc-login\nLogin to OpenShift cluster using credentials from Ambient session:\n- Checks for required credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Automatically installs oc CLI if not available\n- Executes login to the cluster\n- Verifies connection and displays cluster info\n\n## /rhoai-install\nInstall RHOAI from scratch on a cluster:\n- Detects and blocks if ODH is installed (directs to /odh-uninstall first)\n- Sets up OLM catalog source for nightly or GA builds\n- Creates operator namespace and subscription\n- Waits for operator installation to complete\n- Creates DataScienceCluster with component configuration\n- Verifies all components are healthy\n\n## /rhoai-version\nDetect RHOAI version and build information:\n- Checks RHOAI operator subscription and CSV\n- Reports DataScienceCluster status and components\n- Lists all component images with SHA digests\n\n## /rhoai-update\nUpdates RHOAI to the latest nightly build:\n- Verifies current version and preserves channel\n- Updates the OLM catalog source\n- Handles forced reinstall when component images update without CSV version change\n- Verifies component reconciliation\n\n## /rhoai-uninstall\nCompletely uninstall RHOAI from an OpenShift cluster:\n- Supports graceful or forceful uninstall\n- Options to keep CRDs and/or user resources\n- Removes operator, custom resources, webhooks, namespaces\n\n## /odh-install\nInstall Open Data Hub (ODH) nightly builds on a cluster:\n- Detects and blocks if RHOAI is installed (directs to /rhoai-uninstall first)\n- Creates CatalogSource using odh-stable-nightly floating tag\n- Creates Subscription in openshift-operators (uses existing global OperatorGroup)\n- Creates DSCInitialization and DataScienceCluster\n- Default catalog: quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly\n- Default channel: fast\n\n## /odh-update\nUpdate ODH to the latest nightly build:\n- Updates CatalogSource, forces catalog pod refresh\n- OLM auto-upgrades when CSV version changes (typical for ODH nightlies)\n- Falls back to forced reinstall if only component images changed\n\n## /odh-uninstall\nCompletely uninstall ODH from an OpenShift cluster:\n- Removes DataScienceCluster, DSCInitialization, subscription, CSV, CatalogSource\n- Options: keep-crds, keep-all\n- Use default (no flags) when switching to RHOAI\n\n## /mirror-images\nMirror RHOAI operator and component images to disconnected cluster bastion registries:\n- Extracts image list from connected cluster CSV relatedImages and running pods\n- Builds combined pull secret with source registry and bastion credentials\n- Deploys a mirror pod on the connected cluster for fast AWS-internal transfers\n- Mirrors all images to both bastions with retries and verification\n- Uses --keep-manifest-list=true --filter-by-os=\".*\" to preserve manifest list digests\n- Tags destinations with :latest to prevent Quay tagless manifest GC\n\n# Workflow Phases\n\n## Phase 0: Connect to Cluster\n- Login to OpenShift cluster using /oc-login\n\n## Phase 1: Install or Update\n- Fresh RHOAI: /rhoai-install\n- Fresh ODH: /odh-install\n- Update RHOAI: /rhoai-update\n- Update ODH: /odh-update\n\n## Phase 2: Version Management\n- Check RHOAI: /rhoai-version\n\n## Phase 3: Cleanup / Switch\n- Remove RHOAI: /rhoai-uninstall\n- Remove ODH: /odh-uninstall\n\n# Output Locations\n\n- Installation Reports: artifacts/rhoai-manager/reports/*.md\n- Version Info: artifacts/rhoai-manager/version/*.md\n- Execution Logs: artifacts/rhoai-manager/logs/*.log\n\n# Prerequisites\n\n- OpenShift cluster (version 4.12+)\n- Cluster credentials in Ambient session (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Cluster admin permissions\n", - "startupPrompt": "Welcome to the RHOAI Manager Workflow!\n\nI manage the complete lifecycle of Red Hat OpenShift AI (RHOAI) and Open Data Hub (ODH) installations.\n\n## RHOAI Commands\n\n- `/rhoai-install` - Install RHOAI from scratch (nightly or GA)\n- `/rhoai-update` - Update to latest nightly\n- `/rhoai-version` - Check current version and build info\n- `/rhoai-uninstall` - Remove RHOAI completely\n\n## ODH Commands\n\n- `/odh-install` - Install ODH nightly (odh-stable-nightly, fast channel)\n- `/odh-update` - Update ODH to latest nightly\n- `/odh-uninstall` - Remove ODH completely\n\n## Disconnected Cluster Operations\n\n- `/mirror-images` - Mirror RHOAI images to disconnected cluster bastion registries\n\n## Cluster Connection\n\n- `/oc-login` - Connect to your OpenShift cluster\n\n## Important Note\n\nRHOAI and ODH **cannot coexist** on the same cluster. To switch between them, uninstall one before installing the other. Both install commands detect the other and will guide you.\n\n**Getting started**: Make sure your cluster credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD) are configured in your Ambient session, then use /oc-login.\n\nWhat would you like to do?", + "systemPrompt": "You are an AI assistant specialized in managing the complete lifecycle of RHOAI (Red Hat OpenShift AI) and ODH (Open Data Hub) installations.\n\n# Your Role\n\nYou help automate the process of:\n1. Logging into OpenShift clusters\n2. Installing RHOAI or ODH from scratch\n3. Detecting RHOAI version and build information\n4. Updating RHOAI or ODH to latest nightly builds\n5. Uninstalling RHOAI or ODH completely\n6. Switching between RHOAI and ODH safely\n7. Installing or updating RHOAI on disconnected (air-gapped) clusters\n\n# Important: RHOAI and ODH Cannot Coexist\n\nRHOAI and ODH share cluster-scoped CRDs (DataScienceCluster, DSCInitialization) and overlapping operators. They CANNOT be installed on the same cluster at the same time.\n\n- To switch from RHOAI to ODH: run /rhoai-uninstall first, then /odh-install\n- To switch from ODH to RHOAI: run /odh-uninstall first, then /rhoai-install\n- Both /rhoai-install and /odh-install detect the other and block with a clear error message\n\n# Available Commands\n\n## /oc-login\nLogin to OpenShift cluster using credentials from Ambient session:\n- Checks for required credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Automatically installs oc CLI if not available\n- Executes login to the cluster\n- Verifies connection and displays cluster info\n\n## /rhoai-install\nInstall RHOAI from scratch on a cluster:\n- Detects and blocks if ODH is installed (directs to /odh-uninstall first)\n- Sets up OLM catalog source for nightly or GA builds\n- Creates operator namespace and subscription\n- Waits for operator installation to complete\n- Creates DataScienceCluster with component configuration\n- Verifies all components are healthy\n\n## /rhoai-version\nDetect RHOAI version and build information:\n- Checks RHOAI operator subscription and CSV\n- Reports DataScienceCluster status and components\n- Lists all component images with SHA digests\n\n## /rhoai-update\nUpdates RHOAI to the latest nightly build:\n- Verifies current version and preserves channel\n- Updates the OLM catalog source\n- Handles forced reinstall when component images update without CSV version change\n- Verifies component reconciliation\n\n## /rhoai-uninstall\nCompletely uninstall RHOAI from an OpenShift cluster:\n- Supports graceful or forceful uninstall\n- Options to keep CRDs and/or user resources\n- Removes operator, custom resources, webhooks, namespaces\n\n## /odh-install\nInstall Open Data Hub (ODH) nightly builds on a cluster:\n- Detects and blocks if RHOAI is installed (directs to /rhoai-uninstall first)\n- Creates CatalogSource using odh-stable-nightly floating tag\n- Creates Subscription in openshift-operators (uses existing global OperatorGroup)\n- Creates DSCInitialization and DataScienceCluster\n- Default catalog: quay.io/opendatahub/opendatahub-operator-catalog:odh-stable-nightly\n- Default channel: fast\n\n## /odh-update\nUpdate ODH to the latest nightly build:\n- Updates CatalogSource, forces catalog pod refresh\n- OLM auto-upgrades when CSV version changes (typical for ODH nightlies)\n- Falls back to forced reinstall if only component images changed\n\n## /odh-uninstall\nCompletely uninstall ODH from an OpenShift cluster:\n- Removes DataScienceCluster, DSCInitialization, subscription, CSV, CatalogSource\n- Options: keep-crds, keep-all\n- Use default (no flags) when switching to RHOAI\n\n## /mirror-images\nMirror all images (RHOAI operator, components, and infrastructure) from a connected cluster to disconnected bastion registries:\n- Extracts RHOAI CSV relatedImages, all running pod images, and catalog images\n- Includes infrastructure: minio, mariadb, postgres, keycloak, vLLM, milvus, service mesh, cert-manager, kuadrant\n- No images excluded by default - mirrors everything needed for a complete disconnected setup\n- Builds combined pull secret with source registry and bastion credentials\n- Deploys a mirror pod on the connected cluster for fast AWS-internal transfers\n- Mirrors to one or more bastion registries with retries and verification\n- Uses --keep-manifest-list=true --filter-by-os=\".*\" to preserve manifest list digests\n- Generates IDMS YAML for the disconnected cluster\n\n## /rhoai-verify\nPost-install/update verification tests for RHOAI:\n- Checks operator CSV phase and subscription health\n- Verifies DataScienceCluster phase and all component conditions\n- Scans all RHOAI namespace pods for ImagePullBackOff, CrashLoopBackOff, or not-ready containers\n- Tests dashboard deployment, route, and HTTP response\n- Verifies pipeline operator, notebook controllers, KServe, ModelMesh, model registry, TrustyAI\n- Checks EvalHub namespace if present\n- Validates dependent operators (service mesh, serverless, pipelines, cert-manager)\n- Auto-detects disconnected clusters and runs IDMS + cluster-wide ImagePullBackOff checks\n- Reports PASS/FAIL/WARN with troubleshooting guidance\n\n## /rhoai-disconnected\nInstall or update RHOAI on a disconnected (air-gapped) OpenShift cluster:\n- Takes FBC (File-Based Catalog) image as required input (digest-pinned)\n- Auto-detects install vs update mode from cluster state\n- Auto-detects bastion registry from IDMS entries\n- Pre-flight verification: checks ALL relatedImages exist on bastion before proceeding\n- Verifies IDMS entries for all required source registries\n- Creates/updates OLM CatalogSource, Subscription, and DataScienceCluster\n- Forced reinstall for updates (handles CSV version unchanged case)\n- Post-install health check: detects ImagePullBackOff and CrashLoopBackOff pods\n- Applies known workarounds: podToPodTLS bug fix, persistenceagent TLS cert fix\n- Configures dashboard feature flags (automl, autorag, genAiStudio)\n- Documents EvalHub cross-namespace issues and manual fixes\n\n# Workflow Phases\n\n## Phase 0: Connect to Cluster\n- Login to OpenShift cluster using /oc-login\n\n## Phase 1: Install or Update\n- Fresh RHOAI: /rhoai-install\n- Fresh ODH: /odh-install\n- Update RHOAI: /rhoai-update\n- Update ODH: /odh-update\n\n## Phase 2: Version Management\n- Check RHOAI: /rhoai-version\n\n## Phase 3: Disconnected Cluster Operations\n- Mirror images: /mirror-images (from connected cluster)\n- Install/Update on disconnected cluster: /rhoai-disconnected\n\n## Phase 4: Cleanup / Switch\n- Remove RHOAI: /rhoai-uninstall\n- Remove ODH: /odh-uninstall\n\n# Output Locations\n\n- Installation Reports: artifacts/rhoai-manager/reports/*.md\n- Version Info: artifacts/rhoai-manager/version/*.md\n- Execution Logs: artifacts/rhoai-manager/logs/*.log\n\n# Prerequisites\n\n- OpenShift cluster (version 4.12+)\n- Cluster credentials in Ambient session (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD)\n- Cluster admin permissions\n", + "startupPrompt": "Welcome to the RHOAI Manager Workflow!\n\nI manage the complete lifecycle of Red Hat OpenShift AI (RHOAI) and Open Data Hub (ODH) installations.\n\n## RHOAI Commands\n\n- `/rhoai-install` - Install RHOAI from scratch (nightly or GA)\n- `/rhoai-update` - Update to latest nightly\n- `/rhoai-version` - Check current version and build info\n- `/rhoai-uninstall` - Remove RHOAI completely\n\n## ODH Commands\n\n- `/odh-install` - Install ODH nightly (odh-stable-nightly, fast channel)\n- `/odh-update` - Update ODH to latest nightly\n- `/odh-uninstall` - Remove ODH completely\n\n## Disconnected Cluster Operations\n\n- `/mirror-images` - Mirror RHOAI images to disconnected cluster bastion registries\n- `/rhoai-disconnected` - Install or update RHOAI on a disconnected cluster\n- `/rhoai-verify` - Run post-install/update verification tests\n\n## Cluster Connection\n\n- `/oc-login` - Connect to your OpenShift cluster\n\n## Important Note\n\nRHOAI and ODH **cannot coexist** on the same cluster. To switch between them, uninstall one before installing the other. Both install commands detect the other and will guide you.\n\n**Getting started**: Make sure your cluster credentials (OCP_SERVER, OCP_USERNAME, OCP_PASSWORD) are configured in your Ambient session, then use /oc-login.\n\nWhat would you like to do?", "results": { "Installation Reports": "artifacts/rhoai-manager/reports/*.md", "Update Reports": "artifacts/rhoai-manager/reports/*.md", diff --git a/workflows/rhoai-manager/.claude/commands/mirror-images.md b/workflows/rhoai-manager/.claude/commands/mirror-images.md index 9103aca7..fa3339f7 100644 --- a/workflows/rhoai-manager/.claude/commands/mirror-images.md +++ b/workflows/rhoai-manager/.claude/commands/mirror-images.md @@ -1,15 +1,18 @@ -# /mirror-images - Mirror RHOAI Images to Disconnected Bastion Registries +# /mirror-images - Mirror All Images from Connected Cluster to Disconnected Bastion Registries ## Purpose -Copy all RHOAI operator, FBC (File-Based Catalog), and component images deployed on a connected OpenShift cluster to both disconnected cluster bastion registries. Runs the mirror job from a pod on the connected cluster for fast AWS-internal transfers. +Mirror all images required for a complete RHOAI deployment (operator, components, and infrastructure services) from a connected OpenShift cluster to one or more disconnected cluster bastion registries. This includes RHOAI operator and component images, FBC (File-Based Catalog) images, and all infrastructure images (databases, object storage, authentication, model serving runtimes, vector databases, etc.) so that a fresh disconnected cluster can be fully set up from scratch. + +Runs the mirror job from a pod on the connected cluster for fast AWS-internal transfers. ## Prerequisites - `oc` CLI installed and authenticated to the **connected** OpenShift cluster -- The connected cluster has RHOAI operator installed and running -- Network access from the connected cluster to both bastion registries -- Bastion registry credentials (username/password) for both disconnected clusters +- The connected cluster has RHOAI operator installed and running with all components deployed +- All infrastructure services (minio, keycloak, postgres, model serving, etc.) should be running on the connected cluster so their images can be captured +- Network access from the connected cluster to the bastion registries +- Bastion registry credentials (username/password) for each target registry ## Inputs @@ -17,56 +20,174 @@ The user must provide (or you must ask for): | Input | Description | Example | |-------|-------------|---------| -| `PIPELINE_BASTION` | Bastion registry host:port for the pipeline cluster | `bastion.ods-dis-pipeline.aws.rh-ods.com:8443` | -| `RHOAI_TEST_BASTION` | Bastion registry host:port for the rhoai-test cluster | `bastion.ods-dis-rhoai-test.aws.rh-ods.com:8443` | -| `BASTION_USER` | Registry username for both bastions | `mir_reg` | -| `BASTION_PASSWORD` | Registry password for both bastions | (prompt securely) | -| `EXCLUDE_PATTERNS` | Optional image name patterns to skip | `workbench,training,pipeline-runtime,spark` | +| `BASTION_REGISTRIES` | Comma-separated list of bastion registry host:port | `bastion.ods-dis-rhoai-test.aws.rh-ods.com:8443` | +| `BASTION_USER` | Registry username for the bastions | `mir_reg` | +| `BASTION_PASSWORD` | Registry password for the bastions | (prompt securely) | +| `EXCLUDE_PATTERNS` | Optional image name patterns to skip (empty by default) | `spark,habana` | +| `EXTRA_NAMESPACES` | Optional additional namespaces to scan (beyond auto-detected) | `my-custom-ns` | **Auto-detected (no user input needed):** | Value | Source | |-------|--------| | `RHOAI_VERSION` | Extracted from the CSV version on the connected cluster (e.g., `rhods-operator.3.4.0` -> `3.4`) | +| `INFRA_NAMESPACES` | Auto-detected from running pods (minio, keycloak, milvus, evalhub, postgresql, llama-stack, llm-models, etc.) | ## Process -### Phase 1: Extract Image List from Connected Cluster +### Phase 1: Extract Complete Image List from Connected Cluster -1. **Get RHOAI CSV and detect version** +The goal is to capture **every** image needed for a fully functional disconnected RHOAI deployment, organized into categories. - ```bash - CSV_NAME=$(oc get csv -n redhat-ods-operator -o name | grep rhods-operator) - RHOAI_VERSION=$(oc get "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.version}' | grep -oE '^[0-9]+\.[0-9]+') - ``` +#### 1a. Get RHOAI CSV and detect version - This auto-detects the deployed version (e.g., `3.4`) and uses it for naming artifact files. +```bash +CSV_NAME=$(oc get csv -n redhat-ods-operator -o name | grep rhods-operator) +RHOAI_VERSION=$(oc get "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.version}' | grep -oE '^[0-9]+\.[0-9]+') +``` -2. **Extract relatedImages from CSV** +#### 1b. Extract relatedImages from RHOAI CSV - ```bash - oc get "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.relatedImages[*]}' | jq -r '.[] | "\(.name) \(.image)"' - ``` +These are ALL images the operator references, including ones not currently running (workbenches, pipeline runtimes, training images, etc.). **Mirror all of them** — do NOT skip any by default. -3. **Extract images from running pods** +```bash +oc get "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.relatedImages[*]}' | jq -r '.[] | .image' | sort -u +``` - ```bash - oc get pods --all-namespaces -o jsonpath='{range .items[*]}{range .spec.containers[*]}{.image}{"\n"}{end}{range .spec.initContainers[*]}{.image}{"\n"}{end}{end}' | grep -E '(rhoai|rhods|odh)' | sort -u - ``` +**Registry Fallback for Nightly Images:** RHOAI nightly CSV references images as `registry.redhat.io/rhoai/...@sha256:...`, but these images often do NOT exist at `registry.redhat.io` — they only exist at `quay.io/rhoai/...`. Before mirroring, verify each `registry.redhat.io/rhoai/` image exists at the source. If it returns "manifest unknown" or "unauthorized", retry from `quay.io/rhoai/` with the same repo name and digest. Apply this fallback automatically in the mirror script: + +```bash +# For each image from registry.redhat.io/rhoai/: +# 1. Try: oc image info registry.redhat.io/rhoai/IMAGE@sha256:DIGEST +# 2. If fails: try quay.io/rhoai/IMAGE@sha256:DIGEST +# 3. Use whichever source succeeds for the mirror operation +``` -4. **Merge and deduplicate** the two lists. For each image, extract: - - Source registry (e.g., `quay.io/rhoai/` or `registry.redhat.io/`) - - Repository name (e.g., `odh-dashboard-rhel9`) - - Digest (`sha256:...`) +#### 1c. Extract images from ALL relevant running pods -5. **Apply exclusion filters** — remove images matching `EXCLUDE_PATTERNS` +Scan all namespaces for running pod images. Include both containers and initContainers. Do NOT filter by rhoai/rhods/odh — capture everything except core OpenShift platform images (`openshift-*`, `kube-*` namespaces) and GPU operator images (`nvcr.io/nvidia`). -6. **Save the image list** to `artifacts/rhoai-manager/mirror-images-{version}.txt` with format: +```bash +# Get all images from non-platform namespaces +oc get pods --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{range .spec.containers[*]}{.image}{"\n"}{end}{range .spec.initContainers[*]}{.image}{"\n"}{end}{end}' \ + | grep -vE '^(openshift-|kube-)' \ + | awk '{print $NF}' \ + | sort -u +``` - ```text - quay.io/rhoai/odh-dashboard-rhel9@sha256:abc123... - quay.io/rhoai/odh-model-controller-rhel9@sha256:def456... - ``` +This captures infrastructure images from namespaces like: + +| Namespace | Images Captured | +|-----------|----------------| +| `redhat-ods-operator` | RHOAI operator | +| `redhat-ods-applications` | All RHOAI component operators and controllers | +| `minio` | MinIO object storage (`quay.io/minio/minio`) | +| `keycloak` | Red Hat Build of Keycloak server and operator (`registry.redhat.io/rhbk/keycloak-rhel9`, `keycloak-rhel9-operator`) | +| `evalhub` | EvalHub server and PostgreSQL (`odh-eval-hub-rhel9`, `postgresql-15`) | +| `postgresql` | Standalone PostgreSQL instances (`postgresql-15`, `postgresql-16`) | +| `llama-stack` | LlamaStack core runtime (`odh-llama-stack-core-rhel9`) | +| `llm-models` | vLLM serving runtime (`vllm-cuda-rhel9`), model download jobs | +| `milvus` | Milvus vector database (`milvusdb/milvus`) | +| `ai-pipelines`, `test`, `zj` | DSP pipeline components, MariaDB (`mariadb-105`), Argo workflow controller, service mesh proxy | +| `cert-manager` | Cert-manager and operator | +| `kuadrant-system` | Authorino, Limitador, DNS operators (API gateway) | +| `rhoai-model-registries` | Model registry and its PostgreSQL | +| `tenant` | TrustyAI LMEval job runner | + +#### 1d. Extract images from CatalogSources + +```bash +oc get catalogsource --all-namespaces -o jsonpath='{range .items[*]}{.spec.image}{"\n"}{end}' | sort -u +``` + +#### 1e. Extract images from RHOAI Dashboard config (module architecture images) + +These are images referenced by the Dashboard for model arch features (AutoML, AutoRAG, EvalHub, GenAI, MaaS, MLflow, Model Registry). They appear as running pods on the connected cluster but are important to capture explicitly: + +```bash +# These are typically in the CSV relatedImages but verify by checking running mod-arch pods +oc get pods --all-namespaces -o jsonpath='{range .items[*]}{range .spec.containers[*]}{.image}{"\n"}{end}{end}' | grep 'mod-arch' | sort -u +``` + +#### 1f. Merge and deduplicate + +Combine all image lists from 1b, 1c, 1d, and 1e. Deduplicate by full image reference (registry/repo@digest). For each image, extract: +- Source registry (e.g., `registry.redhat.io/rhoai/`, `quay.io/minio/`, `milvusdb/`) +- Repository name (e.g., `odh-dashboard-rhel9`, `minio`, `milvus`) +- Digest (`sha256:...`) or tag + +#### 1g. Apply exclusion filters (only if user specified) + +Only remove images matching user-provided `EXCLUDE_PATTERNS`. **No images are excluded by default.** + +#### 1h. Check for images already on bastion (skip duplicates) + +Before mirroring, check each image against the bastion to avoid re-mirroring images that already exist. This significantly speeds up incremental mirrors (e.g., when only a few images changed in a nightly build): + +```bash +# For each image, compute the bastion destination path and check if it exists: +BASTION_DEST="${BASTION}/${DEST_REPO}@${DIGEST}" +if oc image info "$BASTION_DEST" --insecure=true -a "$PULL_SECRET" &>/dev/null; then + echo "SKIP (already on bastion): $BASTION_DEST" + SKIPPED_COUNT=$((SKIPPED_COUNT + 1)) + continue +fi +``` + +Report the skip count in the summary. This check adds ~1-2 seconds per image but can save hours of mirroring for unchanged images. + +#### 1i. Filter out images that don't need mirroring + +Skip images that: +- Are already on the bastion registry (`bastion.*:8443/`) +- Are from `nvcr.io/nvidia` (GPU operator images managed separately) +- Are from `quay.io/openshift-release-dev` (OCP platform images managed by OCP mirroring) + +#### 1j. Save the image list + +Save to `artifacts/rhoai-manager/mirror-images-{version}.txt` with format: + +```text +# RHOAI Operator and Components +registry.redhat.io/rhoai/odh-rhel9-operator@sha256:abc123... +registry.redhat.io/rhoai/odh-dashboard-rhel9@sha256:def456... +... + +# Infrastructure: Databases +registry.redhat.io/rhel9/mariadb-105@sha256:... +registry.redhat.io/rhel9/postgresql-15@sha256:... +registry.redhat.io/rhel9/postgresql-16@sha256:... + +# Infrastructure: Object Storage +quay.io/minio/minio@sha256:... + +# Infrastructure: Authentication +registry.redhat.io/rhbk/keycloak-rhel9@sha256:... +registry.redhat.io/rhbk/keycloak-rhel9-operator@sha256:... + +# Infrastructure: Model Serving +registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9@sha256:... + +# Infrastructure: Vector Database +milvusdb/milvus@sha256:... + +# Infrastructure: Service Mesh +registry.redhat.io/openshift-service-mesh/proxyv2-rhel9@sha256:... + +# Infrastructure: Cert Manager +registry.redhat.io/cert-manager/jetstack-cert-manager-rhel9@sha256:... + +# Infrastructure: API Gateway (Kuadrant) +registry.redhat.io/rhcl-1/authorino-rhel9@sha256:... + +# FBC Catalog +quay.io/rhoai/rhoai-fbc-fragment@sha256:... + +# Base Images +registry.redhat.io/ubi9/nginx-126@sha256:... +``` + +Print a summary showing the count of images per category. ### Phase 2: Build Combined Pull Secret @@ -76,24 +197,29 @@ The user must provide (or you must ask for): oc get secret/pull-secret -n openshift-config -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d > /tmp/cluster-pull-secret.json ``` -2. **Add bastion registry credentials** — merge both bastion auths into the pull secret: +2. **Add bastion registry credentials** — merge each bastion auth into the pull secret: ```bash # Generate base64 auth for bastions BASTION_AUTH=$(printf '%s:%s' "$BASTION_USER" "$BASTION_PASSWORD" | base64 | tr -d '\n') ``` - Use `jq` to merge the bastion auths into `.auths`: + Use `jq` to merge all bastion auths into `.auths`: ```bash - jq --arg pipeline "$PIPELINE_BASTION" \ - --arg rhoaitest "$RHOAI_TEST_BASTION" \ - --arg auth "$BASTION_AUTH" \ - '.auths[$pipeline] = {"auth": $auth} | .auths[$rhoaitest] = {"auth": $auth}' \ - /tmp/cluster-pull-secret.json > /tmp/combined-pull-secret.json + # Build jq expression dynamically for each bastion + JQ_EXPR='.' + for BASTION in ${BASTION_REGISTRIES//,/ }; do + JQ_EXPR="$JQ_EXPR | .auths[\"$BASTION\"] = {\"auth\": \"$BASTION_AUTH\"}" + done + jq "$JQ_EXPR" /tmp/cluster-pull-secret.json > /tmp/combined-pull-secret.json ``` -3. **Create the secret in the mirror namespace** +3. **Add auth for third-party registries** that may require authentication (docker.io, quay.io): + + If the connected cluster's pull secret already has auth for these registries, it will be included automatically. If images from registries like `docker.io` or `milvusdb` (Docker Hub) need mirroring, the pull secret must include Docker Hub credentials. Check and warn if missing. + +4. **Create the secret in the mirror namespace** ```bash oc new-project image-mirror 2>/dev/null || true @@ -103,7 +229,7 @@ The user must provide (or you must ask for): -n image-mirror ``` -4. **Clean up local temp files** +5. **Clean up local temp files** ```bash rm -f /tmp/cluster-pull-secret.json /tmp/combined-pull-secret.json @@ -113,26 +239,45 @@ The user must provide (or you must ask for): Generate a bash script that mirrors all images. The script must: -- Accept the pull secret path, and both bastion hostnames +- Accept the pull secret path and bastion hostnames as arguments - For each image in the list: - - Determine the source reference (`registry/repo@digest`) - - Mirror to both bastions with retries (3 attempts) + - Determine the source reference (`registry/repo@digest` or `registry/repo:tag`) + - Compute the destination path based on the source registry: + - `registry.redhat.io/rhoai/foo` -> `BASTION/rhoai/foo` + - `quay.io/minio/minio` -> `BASTION/minio/minio` + - `registry.redhat.io/rhbk/foo` -> `BASTION/rhbk/foo` + - `registry.redhat.io/rhel9/foo` -> `BASTION/rhel9/foo` + - `milvusdb/milvus` -> `BASTION/milvusdb/milvus` (Docker Hub library) + - `quay.io/opendatahub/foo` -> `BASTION/opendatahub/foo` + - `docker.io/library/foo` -> `BASTION/library/foo` + - Mirror to all bastion registries with retries (3 attempts per image) - Use `oc image mirror` with these critical flags: - - `--keep-manifest-list=true` — preserves manifest list digests referenced by the CSV - - `--filter-by-os=".*"` — mirrors all architectures (prevents manifest list stripping) - - `--insecure=true` — bastion registries use self-signed certs - - `-a "$PULL_SECRET"` — combined auth file + - `--keep-manifest-list=true` -- preserves manifest list digests referenced by the CSV + - `--filter-by-os=".*"` -- mirrors all architectures (prevents manifest list stripping) + - `--insecure=true` -- bastion registries use self-signed certs + - `-a "$PULL_SECRET"` -- combined auth file - Tag destination as `:latest` to prevent Quay tagless manifest garbage collection - Verify each mirror with `oc image info` -- Track and report: verified count, failed count, skipped count -- Print a summary at the end + - Handle images with tags (not digests) by using `skopeo copy` as fallback if `oc image mirror` fails +- Track and report: verified count, failed count, skipped count, per category +- Print a summary at the end with per-category breakdown **Mirror command pattern per image:** ```bash +# For digest-referenced images oc image mirror \ "${SOURCE_REGISTRY}/${REPO}@${DIGEST}" \ - "${BASTION}/${REPO}:latest" \ + "${BASTION}/${DEST_REPO}:latest" \ + --insecure=true \ + -a "$PULL_SECRET" \ + --keep-manifest-list=true \ + --filter-by-os=".*" + +# For tag-referenced images (e.g., milvusdb/milvus:v2.5.4) +oc image mirror \ + "${SOURCE_IMAGE}" \ + "${BASTION}/${DEST_REPO}:${TAG}" \ --insecure=true \ -a "$PULL_SECRET" \ --keep-manifest-list=true \ @@ -142,7 +287,7 @@ oc image mirror \ **Verification command per image:** ```bash -oc image info "${BASTION}/${REPO}:latest" --insecure=true -a "$PULL_SECRET" +oc image info "${BASTION}/${DEST_REPO}:latest" --insecure=true -a "$PULL_SECRET" ``` ### Phase 4: Deploy Mirror Pod @@ -228,7 +373,44 @@ oc image info "${BASTION}/${REPO}:latest" --insecure=true -a "$PULL_SECRET" oc logs image-mirror -n image-mirror > artifacts/rhoai-manager/mirror-log-{version}.txt ``` -5. **If any images failed**, report them and offer to create a retry script with only the failed images. +5. **If any images failed**, report them by category and offer to create a retry script with only the failed images. + +### Phase 5b: Generate IDMS YAML + +After mirroring completes, generate the ImageDigestMirrorSet YAML from the list of source registries that were mirrored. This YAML must be applied to the disconnected cluster so it knows to pull from the bastion instead of the original source. + +```bash +# Extract unique source registry prefixes from the mirrored image list +# Group by registry/namespace (e.g., registry.redhat.io/rhoai, quay.io/minio, milvusdb) +SOURCE_PREFIXES=$(cat mirror-images-list.txt | grep -v '^#' | grep -v '^$' \ + | sed -E 's|^([^/]+/[^/@]+).*|\1|' | sort -u) + +# Generate IDMS YAML +cat > artifacts/rhoai-manager/mirror-idms-${RHOAI_VERSION}.yaml << 'HEADER' +apiVersion: config.openshift.io/v1 +kind: ImageDigestMirrorSet +metadata: + name: rhoai-mirror +spec: + imageDigestMirrors: +HEADER + +for prefix in $SOURCE_PREFIXES; do + # Compute bastion mirror path (strip registry hostname) + MIRROR_PATH=$(echo "$prefix" | sed -E 's|^[^/]+/||') + cat >> artifacts/rhoai-manager/mirror-idms-${RHOAI_VERSION}.yaml << EOF + - source: $prefix + mirrors: + - ${BASTION}/${MIRROR_PATH} + mirrorSourcePolicy: NeverContactSource +EOF +done + +echo "IDMS YAML saved to: artifacts/rhoai-manager/mirror-idms-${RHOAI_VERSION}.yaml" +echo "Apply to disconnected cluster: oc apply -f artifacts/rhoai-manager/mirror-idms-${RHOAI_VERSION}.yaml" +``` + +**Important:** For Docker Hub images without an explicit registry (e.g., `milvusdb/milvus`), the IDMS source should use the full Docker Hub URL: `docker.io/milvusdb`. For images under `docker.io/library/`, use `docker.io/library`. ### Phase 6: Cleanup @@ -243,15 +425,49 @@ oc delete project image-mirror Clean up any local temp files. +## Image Categories Reference + +The following table lists all image categories that must be mirrored for a complete disconnected RHOAI deployment: + +| Category | Source Registry | Example Images | Notes | +|----------|----------------|----------------|-------| +| RHOAI Operator | `registry.redhat.io/rhoai/` | `odh-rhel9-operator`, `odh-operator-bundle` | Core operator | +| RHOAI Components | `registry.redhat.io/rhoai/` | `odh-dashboard-rhel9`, `odh-kserve-controller-rhel9`, `odh-notebook-controller-rhel9`, all `odh-*` images | All CSV relatedImages | +| FBC Catalog | `quay.io/rhoai/` or `quay.io/modh/` | `rhoai-fbc-fragment`, `rhoai-catalog` | OLM catalog source | +| Module Architecture | `registry.redhat.io/rhoai/` | `odh-mod-arch-automl-rhel9`, `odh-mod-arch-autorag-rhel9`, `odh-mod-arch-eval-hub-rhel9`, `odh-mod-arch-gen-ai-rhel9`, `odh-mod-arch-maas-rhel9`, `odh-mod-arch-mlflow-rhel9`, `odh-mod-arch-model-registry-rhel9` | Dashboard module images | +| Model Serving Runtime | `registry.redhat.io/rhaii-early-access/` | `vllm-cuda-rhel9` | vLLM CUDA runtime | +| LlamaStack | `registry.redhat.io/rhoai/` | `odh-llama-stack-core-rhel9`, `odh-llama-stack-k8s-operator-rhel9` | LLM orchestration | +| EvalHub | `registry.redhat.io/rhoai/` | `odh-eval-hub-rhel9`, `odh-ta-lmes-job-rhel9` | Evaluation hub + LMEval job | +| TrustyAI | `registry.redhat.io/rhoai/` | `odh-trustyai-service-operator-rhel9` | AI explainability | +| MariaDB | `registry.redhat.io/rhel9/` | `mariadb-105` | DSP metadata store | +| PostgreSQL | `registry.redhat.io/rhel9/` | `postgresql-15`, `postgresql-16` | EvalHub, Model Registry DBs | +| MinIO | `quay.io/minio/` | `minio` | S3-compatible object storage | +| Keycloak | `registry.redhat.io/rhbk/` | `keycloak-rhel9`, `keycloak-rhel9-operator` | Authentication (LlamaStack, etc.) | +| Milvus | `milvusdb/` (Docker Hub) | `milvus` | Vector database for RAG | +| Service Mesh | `registry.redhat.io/openshift-service-mesh/` | `proxyv2-rhel9`, `istio-pilot-rhel9`, `istio-proxyv2-rhel9`, `istio-rhel9-operator` | Envoy sidecar for pipelines | +| Cert Manager | `registry.redhat.io/cert-manager/` | `jetstack-cert-manager-rhel9`, `cert-manager-operator-rhel9` | TLS certificate management | +| Kuadrant/API Gateway | `registry.redhat.io/rhcl-1/` | `authorino-rhel9`, `limitador-rhel9`, `rhcl-rhel9-operator`, `rhcl-console-plugin-rhel9`, `dns-rhel9-operator` | API auth and rate limiting | +| Model Registry | `registry.redhat.io/rhoai/` | `odh-model-registry-rhel9`, `odh-model-registry-operator-rhel9` | ML model registry | +| DSP Components | `registry.redhat.io/rhoai/` | `odh-ml-pipelines-api-server-v2-rhel9`, `odh-ml-pipelines-persistenceagent-v2-rhel9`, `odh-ml-pipelines-scheduledworkflow-v2-rhel9`, `odh-mlmd-grpc-server-rhel9`, `odh-data-science-pipelines-argo-workflowcontroller-rhel9` | Data Science Pipelines | +| Base Images | `registry.redhat.io/ubi9/` | `nginx-126` | Dashboard web server | +| ODH Components | `quay.io/opendatahub/` | `odh-model-controller` | Upstream ODH images | +| Kube Auth Proxy | `registry.redhat.io/rhoai/` | `odh-kube-auth-proxy-rhel9` | Auth proxy for RHOAI services | +| Metadata/Perf | `registry.redhat.io/rhoai/` | `odh-model-metadata-collection-rhel9`, `odh-model-performance-data-rhel9` | Telemetry images | + ## Important Notes - **Why pod-based mirroring**: Running `oc image mirror` from a pod on the connected AWS cluster uses AWS internal networking (40-116 MB/s) instead of local internet (~2 MB/s). This eliminates connection drops on large blob uploads (some RHOAI images are 5-7 GB). - **Why `:latest` tag**: Quay garbage-collects manifests that have no tags. Even though clusters pull by digest, pushing with `:latest` prevents GC from removing the manifests. - **Why `--filter-by-os=".*"`**: Using `--filter-by-os=linux/amd64` strips the manifest list and replaces it with a single-arch manifest. The CSV references the manifest list digest, so this would break image resolution. `".*"` preserves the full manifest list. - **Why `--keep-manifest-list=true`**: Ensures the manifest list is pushed as-is to the destination, preserving the exact digest the CSV references. -- **Large images**: Some RHOAI images (automl ~5.5GB, autorag ~7.2GB, ta-lmes-job ~6.7GB) take 5-15 minutes each. The 4-hour `activeDeadlineSeconds` on the pod accommodates this. +- **Why mirror ALL CSV relatedImages**: Previously, workbench, training, pipeline-runtime, and spark images were excluded by default. This caused failures when users tried to create workbenches or run training jobs on the disconnected cluster. Mirror everything by default. +- **Docker Hub images (milvusdb)**: These images may require Docker Hub credentials in the pull secret. The connected cluster may or may not have these. If `oc image mirror` fails for Docker Hub images, the script should warn and continue, reporting them as needing manual attention. +- **Tag-based images**: Some images (e.g., `milvusdb/milvus:v2.5.4`, `quay.io/opendatahub/odh-model-controller:odh-model-serving-api-stable`) use tags instead of digest references. These need special handling since `--keep-manifest-list` may not apply. Mirror them with the original tag preserved. +- **Large images**: Some RHOAI images (automl ~5.5GB, autorag ~7.2GB, ta-lmes-job ~6.7GB, vllm-cuda ~8GB) take 5-15 minutes each. The 4-hour `activeDeadlineSeconds` on the pod accommodates this. +- **IDMS requirements**: After mirroring, the disconnected cluster needs ImageDigestMirrorSet entries for all source registries. Registries commonly needing IDMS entries: `registry.redhat.io/rhoai`, `registry.redhat.io/rhbk`, `registry.redhat.io/rhel9`, `registry.redhat.io/rhcl-1`, `registry.redhat.io/cert-manager`, `quay.io/minio`, `quay.io/opendatahub`, `milvusdb` (Docker Hub). The mirror script should output the required IDMS YAML for any registries that were mirrored. ## Output -- `artifacts/rhoai-manager/mirror-images-{version}.txt` — full image list extracted from the connected cluster -- `artifacts/rhoai-manager/mirror-log-{version}.txt` — complete mirror pod log with verification results +- `artifacts/rhoai-manager/mirror-images-{version}.txt` -- categorized image list extracted from the connected cluster +- `artifacts/rhoai-manager/mirror-log-{version}.txt` -- complete mirror pod log with verification results +- `artifacts/rhoai-manager/mirror-idms-{version}.yaml` -- ImageDigestMirrorSet YAML for the disconnected cluster (generated from the mirrored image list) diff --git a/workflows/rhoai-manager/.claude/commands/rhoai-disconnected.md b/workflows/rhoai-manager/.claude/commands/rhoai-disconnected.md new file mode 100644 index 00000000..04d707c7 --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/rhoai-disconnected.md @@ -0,0 +1,1041 @@ +# /rhoai-disconnected - Install or Update RHOAI on a Disconnected OpenShift Cluster + +Install or update Red Hat OpenShift AI (RHOAI) on a disconnected (air-gapped) OpenShift cluster. This command handles the unique requirements of disconnected environments: verifying images exist on the bastion registry, using digest-pinned FBC catalogs, applying known workarounds for disconnected-specific issues, and validating all pods can pull their images. + +## Command Usage + +```bash +# Install RHOAI on a fresh disconnected cluster +/rhoai-disconnected install fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:fe1157d5... + +# Update existing RHOAI to a new build +/rhoai-disconnected update fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:abc123... + +# Auto-detect install vs update +/rhoai-disconnected fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:fe1157d5... + +# With explicit bastion and channel +/rhoai-disconnected fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:fe1157d5... bastion=bastion.example.com:8443 channel=stable-3.4 +``` + +## Inputs + +| Input | Required | Description | Example | +|-------|----------|-------------|---------| +| `fbc` | **Yes** | FBC (File-Based Catalog) image reference. Must include `@sha256:` digest. This is the **source** reference (IDMS rewrites to bastion). | `quay.io/rhoai/rhoai-fbc-fragment@sha256:fe1157d5...` | +| `bastion` | No (auto-detected) | Bastion registry host:port. Auto-detected from IDMS if not specified. | `bastion.ods-dis-rhoai-test.aws.rh-ods.com:8443` | +| `channel` | No | OLM subscription channel. Default: `stable-3.4` for install, preserved for update. | `stable-3.4`, `beta` | +| `install` / `update` | No | Force install or update mode. Auto-detected if omitted. | | + +**Auto-detected:** + +| Value | Source | +|-------|--------| +| `BASTION` | Extracted from IDMS entries for `quay.io/rhoai` or `registry.redhat.io/rhoai` | +| `MODE` | `install` if no RHOAI CSV exists, `update` if one does | +| `RHOAI_VERSION` | Extracted from the CSV version after install/update | + +## Prerequisites + +1. Logged into the **disconnected** OpenShift cluster with cluster-admin privileges (`/oc-login`) +2. `oc` CLI and `jq` available +3. FBC image and ALL component images already mirrored to the bastion (use `/mirror-images` on the connected cluster first) +4. IDMS (ImageDigestMirrorSet) entries configured for all source registries +5. No ODH installation on the cluster (RHOAI and ODH cannot coexist) +6. **Dependent operators installed** — RHOAI DSC requires these operators to fully reconcile: + - Red Hat OpenShift Service Mesh (provides `DestinationRule` CRD — required for KServe/gateway) + - Red Hat OpenShift Serverless (provides `KnativeServing` — required for KServe) + - Red Hat OpenShift Pipelines (provides Tekton — required for DSP) + - cert-manager for Red Hat OpenShift (provides `Certificate` CRD — required for TLS) + +## Process + +### Step 1: Parse Input Arguments + +```bash +# Defaults +FBC_IMAGE="" +BASTION="" +CHANNEL="" +MODE="" # install or update, auto-detected if empty + +# Parse key=value arguments +for arg in "$@"; do + case "$arg" in + fbc=*) FBC_IMAGE="${arg#*=}" ;; + bastion=*) BASTION="${arg#*=}" ;; + channel=*) CHANNEL="${arg#*=}" ;; + install) MODE="install" ;; + update) MODE="update" ;; + esac +done + +# Validate FBC image is provided and uses digest +if [[ -z "$FBC_IMAGE" ]]; then + die "FBC image is required. Usage: /rhoai-disconnected fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:..." +fi + +if [[ "$FBC_IMAGE" != *"@sha256:"* ]]; then + echo "WARNING: FBC image should use @sha256: digest for reproducibility on disconnected clusters." + echo " Provided: $FBC_IMAGE" + echo " Floating tags may resolve to different images if the bastion cache is stale." +fi +``` + +### Step 2: Verify Cluster Access and Detect Mode + +```bash +command -v oc &>/dev/null || die "oc command not found" +command -v jq &>/dev/null || die "jq command not found" +oc whoami &>/dev/null || die "Not logged into an OpenShift cluster" + +echo "Logged in as: $(oc whoami)" +echo "Cluster: $(oc whoami --show-server)" + +# Check ODH conflict +if oc get csv -n openshift-operators 2>/dev/null | grep -q opendatahub-operator; then + die "ODH is installed. Uninstall ODH first with /odh-uninstall before installing RHOAI." +fi + +# Auto-detect install vs update +if [[ -z "$MODE" ]]; then + if oc get csv -n redhat-ods-operator 2>/dev/null | grep -q rhods-operator; then + MODE="update" + echo "Detected existing RHOAI installation -> UPDATE mode" + else + MODE="install" + echo "No existing RHOAI installation -> INSTALL mode" + fi +fi + +# Set default channel +if [[ -z "$CHANNEL" ]]; then + if [[ "$MODE" == "update" ]]; then + CHANNEL=$(oc get subscription -n redhat-ods-operator -o jsonpath='{.items[0].spec.channel}' 2>/dev/null || echo "stable-3.4") + echo "Preserving existing channel: $CHANNEL" + else + CHANNEL="stable-3.4" + echo "Using default channel: $CHANNEL" + fi +fi +``` + +### Step 2b: Verify Dependent Operators + +RHOAI DSC cannot fully reconcile without dependent operators. Missing operators cause specific component failures (e.g., KServe fails without Service Mesh, DSP fails without Pipelines). Check and warn early. + +```bash +echo "" +echo "=== Checking Dependent Operators ===" + +MISSING_DEPS=() + +# Service Mesh — required for KServe gateway (DestinationRule CRD) +if oc get crd destinationrules.networking.istio.io &>/dev/null; then + echo " Service Mesh: OK" +else + MISSING_DEPS+=("Red Hat OpenShift Service Mesh (DestinationRule CRD missing — KServe gateway will fail)") +fi + +# Serverless — required for KServe (KnativeServing) +if oc get crd knativeservings.operator.knative.dev &>/dev/null; then + echo " Serverless: OK" +else + MISSING_DEPS+=("Red Hat OpenShift Serverless (KnativeServing CRD missing — KServe will fail)") +fi + +# Pipelines — required for Data Science Pipelines (Tekton) +if oc get crd pipelines.tekton.dev &>/dev/null; then + echo " Pipelines: OK" +else + MISSING_DEPS+=("Red Hat OpenShift Pipelines (Tekton CRD missing — DSP will fail)") +fi + +# Cert Manager — required for TLS certificate management +if oc get crd certificates.cert-manager.io &>/dev/null; then + echo " Cert Manager: OK" +else + MISSING_DEPS+=("cert-manager for Red Hat OpenShift (Certificate CRD missing — TLS cert management will fail)") +fi + +if [[ ${#MISSING_DEPS[@]} -gt 0 ]]; then + echo "" + echo "WARNING: ${#MISSING_DEPS[@]} dependent operator(s) are missing:" + for dep in "${MISSING_DEPS[@]}"; do + echo " - $dep" + done + echo "" + echo "RHOAI will install but DSC may not fully reconcile." + echo "Install these operators from the disconnected catalog before proceeding, or continue with partial functionality." + echo "" + echo "Continuing in 10 seconds... (Ctrl+C to cancel)" + sleep 10 +fi +``` + +### Step 3: Auto-Detect Bastion from IDMS + +```bash +if [[ -z "$BASTION" ]]; then + # Extract bastion from IDMS entries for rhoai source + BASTION=$(oc get imagedigestmirrorset -o jsonpath='{range .items[*]}{range .spec.imageDigestMirrors[*]}{.source}{"|"}{.mirrors[0]}{"\n"}{end}{end}' 2>/dev/null \ + | grep 'registry.redhat.io/rhoai' \ + | head -1 \ + | awk -F'|' '{print $2}' \ + | sed 's|/rhoai$||') + + if [[ -z "$BASTION" ]]; then + die "Could not auto-detect bastion from IDMS. Provide it explicitly: bastion=host:port" + fi + + echo "Auto-detected bastion: $BASTION" +fi +``` + +### Step 4: Pre-Flight Image Verification + +This is the critical step that prevents the ImagePullBackOff failures seen on disconnected clusters. Verify that the FBC image and key component images exist on the bastion BEFORE proceeding. + +```bash +echo "" +echo "=== Pre-Flight Image Verification ===" +echo "Checking that required images exist on bastion: $BASTION" + +PULL_SECRET_JSON=$(oc get secret/pull-secret -n openshift-config -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d) +TMPFILE=$(mktemp) +chmod 600 "$TMPFILE" +trap 'rm -f "$TMPFILE"' EXIT +echo "$PULL_SECRET_JSON" > "$TMPFILE" + +MISSING_IMAGES=() +VERIFIED_COUNT=0 + +# 4a. Verify FBC image on bastion +# Compute bastion FBC path from the source FBC reference +FBC_REPO=$(echo "$FBC_IMAGE" | sed 's|@sha256:.*||' | awk -F'/' '{print $NF}') +FBC_DIGEST=$(echo "$FBC_IMAGE" | grep -oE 'sha256:[a-f0-9]+') + +# The FBC may be mirrored under different paths depending on IDMS config +# Try the IDMS-mapped path first, then common paths +FBC_BASTION_CANDIDATES=( + "${BASTION}/rhoai/${FBC_REPO}@${FBC_DIGEST}" + "${BASTION}/catalogs/${FBC_REPO}@${FBC_DIGEST}" + "${BASTION}/modh/${FBC_REPO}@${FBC_DIGEST}" +) + +FBC_FOUND=false +for candidate in "${FBC_BASTION_CANDIDATES[@]}"; do + if oc image info "$candidate" --insecure=true -a "$TMPFILE" &>/dev/null; then + echo "FBC image verified: $candidate" + FBC_FOUND=true + FBC_BASTION_REF="$candidate" + break + fi +done + +if [[ "$FBC_FOUND" != "true" ]]; then + MISSING_IMAGES+=("FBC: $FBC_IMAGE") + echo "MISSING: FBC image not found on bastion" +fi + +# 4b. Extract relatedImages from the FBC catalog and verify key images +# Render the catalog from the FBC image to get the CSV's relatedImages +echo "" +echo "Extracting relatedImages from FBC catalog..." + +# Create a temporary pod to read the FBC catalog content +CATALOG_CONTENT=$(oc run fbc-verify --image="$FBC_IMAGE" --restart=Never \ + --command -- cat /configs/rhods-operator/catalog.yaml 2>/dev/null && \ + oc logs fbc-verify 2>/dev/null; oc delete pod fbc-verify --force 2>/dev/null || true) + +# If pod-based extraction fails (common on disconnected), use the CatalogSource approach: +# Create a temporary CatalogSource, wait for it, then query via the catalog pod +if [[ -z "$CATALOG_CONTENT" ]]; then + echo "Direct extraction failed, using CatalogSource approach..." + + # Create temp CatalogSource + cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: CatalogSource +metadata: + name: rhoai-catalog-verify + namespace: openshift-marketplace +spec: + displayName: "RHOAI Verify (temp)" + image: $FBC_IMAGE + sourceType: grpc +EOF + + # Wait for catalog pod to be ready + TIMEOUT=120 + ELAPSED=0 + while [[ $ELAPSED -lt $TIMEOUT ]]; do + CATALOG_STATE=$(oc get catalogsource rhoai-catalog-verify -n openshift-marketplace \ + -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null || echo "") + if [[ "$CATALOG_STATE" == "READY" ]]; then + break + fi + sleep 5 + ELAPSED=$((ELAPSED + 5)) + done + + if [[ "$CATALOG_STATE" == "READY" ]]; then + CATALOG_POD=$(oc get pod -n openshift-marketplace -l olm.catalogSource=rhoai-catalog-verify -o name 2>/dev/null | head -1) + if [[ -n "$CATALOG_POD" ]]; then + CATALOG_CONTENT=$(oc exec -n openshift-marketplace "$CATALOG_POD" -- cat /configs/rhods-operator/catalog.yaml 2>/dev/null || echo "") + fi + fi +fi + +# 4c. Parse relatedImages and verify each on bastion +# Pre-fetch full IDMS source-to-mirror mappings for path resolution +IDMS_SOURCES_FULL=$(oc get imagedigestmirrorset -o jsonpath='{range .items[*]}{range .spec.imageDigestMirrors[*]}{.source}{"|"}{.mirrors[0]}{"\n"}{end}{end}' 2>/dev/null || echo "") + +if [[ -n "$CATALOG_CONTENT" ]]; then + # Extract all image references from the catalog + RELATED_IMAGES=$(echo "$CATALOG_CONTENT" | grep -oE 'registry\.[^"]+@sha256:[a-f0-9]+|quay\.io[^"]+@sha256:[a-f0-9]+' | sort -u) + + TOTAL_IMAGES=$(echo "$RELATED_IMAGES" | wc -l | tr -d ' ') + echo "Found $TOTAL_IMAGES relatedImages in FBC catalog" + echo "Verifying each image exists on bastion..." + + while IFS= read -r img; do + [[ -z "$img" ]] && continue + + # Compute bastion path using IDMS entries to find the correct mirror path + # Extract source prefix from the image (e.g., registry.redhat.io/rhoai from registry.redhat.io/rhoai/odh-dashboard-rhel9@sha256:abc) + IMG_SOURCE_PREFIX=$(echo "$img" | sed -E 's|/[^/]+@sha256:.*||') + IMG_NAME_DIGEST=$(echo "$img" | sed -E "s|^${IMG_SOURCE_PREFIX}/||") + + # Look up the mirror path from IDMS for this source prefix + IDMS_MIRROR=$(echo "$IDMS_SOURCES_FULL" | grep "^${IMG_SOURCE_PREFIX}|" | head -1 | awk -F'|' '{print $2}') + + if [[ -n "$IDMS_MIRROR" ]]; then + BASTION_IMG="${IDMS_MIRROR}/${IMG_NAME_DIGEST}" + else + # Fallback: strip registry hostname, prepend bastion + IMG_PATH=$(echo "$img" | sed -E 's|^[^/]+/||') + BASTION_IMG="${BASTION}/${IMG_PATH}" + fi + + if oc image info "$BASTION_IMG" --insecure=true -a "$TMPFILE" &>/dev/null; then + VERIFIED_COUNT=$((VERIFIED_COUNT + 1)) + else + MISSING_IMAGES+=("$img") + fi + done <<< "$RELATED_IMAGES" +else + echo "WARNING: Could not extract relatedImages from FBC. Skipping image verification." + echo "Proceed with caution - pods may fail with ImagePullBackOff if images are missing." +fi + +# Clean up temp CatalogSource +oc delete catalogsource rhoai-catalog-verify -n openshift-marketplace 2>/dev/null || true + +# 4d. Report results +echo "" +echo "=== Pre-Flight Results ===" +echo "Verified: $VERIFIED_COUNT images" +echo "Missing: ${#MISSING_IMAGES[@]} images" + +if [[ ${#MISSING_IMAGES[@]} -gt 0 ]]; then + echo "" + echo "MISSING IMAGES:" + for img in "${MISSING_IMAGES[@]}"; do + echo " $img" + done + echo "" + echo "ERROR: ${#MISSING_IMAGES[@]} images are missing from the bastion registry." + echo "Run /mirror-images on the connected cluster to mirror these images first." + die "Pre-flight image verification failed" +fi + +echo "All images verified on bastion" +``` + +### Step 5: Verify IDMS Entries + +```bash +echo "" +echo "=== Verifying IDMS Entries ===" + +# Check that IDMS entries exist for all source registries used by RHOAI +REQUIRED_SOURCES=( + "registry.redhat.io/rhoai" + "registry.redhat.io/rhel9" + "registry.redhat.io/ubi9" + "registry.redhat.io/openshift-service-mesh" + "registry.redhat.io/rhbk" + "registry.redhat.io/cert-manager" + "registry.redhat.io/rhcl-1" + "registry.redhat.io/rhaii-early-access" + "quay.io/rhoai" + "quay.io/minio" + "quay.io/opendatahub" + "docker.io/milvusdb" +) + +IDMS_SOURCES=$(oc get imagedigestmirrorset -o jsonpath='{range .items[*]}{range .spec.imageDigestMirrors[*]}{.source}{"\n"}{end}{end}' 2>/dev/null | sort -u) + +MISSING_IDMS=() +for source in "${REQUIRED_SOURCES[@]}"; do + if echo "$IDMS_SOURCES" | grep -q "$source"; then + echo " IDMS OK: $source" + else + MISSING_IDMS+=("$source") + echo " IDMS MISSING: $source" + fi +done + +if [[ ${#MISSING_IDMS[@]} -gt 0 ]]; then + echo "" + echo "WARNING: ${#MISSING_IDMS[@]} IDMS entries are missing." + echo "Pods pulling from these registries will fail with ImagePullBackOff." + echo "The IDMS YAML can be generated by /mirror-images." + echo "" + echo "Continuing anyway - but watch for ImagePullBackOff errors." +fi +``` + +### Step 6: Create or Update CatalogSource + +```bash +echo "" +echo "=== Setting Up OLM Catalog ===" + +# Use the FBC image reference directly - IDMS handles rewriting to bastion +cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: CatalogSource +metadata: + name: rhoai-catalog-dev + namespace: openshift-marketplace +spec: + displayName: "Red Hat OpenShift AI" + image: $FBC_IMAGE + publisher: Red Hat + sourceType: grpc + updateStrategy: + registryPoll: + interval: 30m +EOF + +echo "CatalogSource created/updated with image: $FBC_IMAGE" + +# Force catalog pod refresh to ensure it picks up the new image +CATALOG_POD=$(oc get pod -n openshift-marketplace -l olm.catalogSource=rhoai-catalog-dev -o name 2>/dev/null | head -1) +if [[ -n "$CATALOG_POD" ]]; then + echo "Deleting old catalog pod to force image refresh..." + oc delete "$CATALOG_POD" -n openshift-marketplace 2>/dev/null || true +fi + +# Wait for catalog to be READY +TIMEOUT=180 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CATALOG_STATE=$(oc get catalogsource rhoai-catalog-dev -n openshift-marketplace \ + -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null || echo "") + + if [[ "$CATALOG_STATE" == "READY" ]]; then + echo "CatalogSource is READY" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) + echo " CatalogSource state: ${CATALOG_STATE:-Unknown} (${ELAPSED}s/${TIMEOUT}s)" +done + +[[ "$CATALOG_STATE" == "READY" ]] || die "CatalogSource not READY after ${TIMEOUT}s. Check that the FBC image is accessible on the bastion." +``` + +### Step 7: Install - Create Namespace, OperatorGroup, Subscription (Install mode only) + +```bash +if [[ "$MODE" == "install" ]]; then + OPERATOR_NAMESPACE="redhat-ods-operator" + + # Create namespace + if ! oc get namespace "$OPERATOR_NAMESPACE" &>/dev/null; then + oc create namespace "$OPERATOR_NAMESPACE" + echo "Created namespace: $OPERATOR_NAMESPACE" + fi + + # Create OperatorGroup + cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1 +kind: OperatorGroup +metadata: + name: rhods-operator + namespace: $OPERATOR_NAMESPACE +spec: + targetNamespaces: + - $OPERATOR_NAMESPACE +EOF + + echo "OperatorGroup created" + + # Create Subscription + cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: Subscription +metadata: + name: rhoai-operator-dev + namespace: $OPERATOR_NAMESPACE +spec: + channel: $CHANNEL + installPlanApproval: Automatic + name: rhods-operator + source: rhoai-catalog-dev + sourceNamespace: openshift-marketplace +EOF + + echo "Subscription created (channel: $CHANNEL)" +fi +``` + +### Step 8: Update - Forced Reinstall to Pick Up New Images (Update mode only) + +On disconnected clusters, OLM may not auto-update if only component images changed (CSV version unchanged). Force a reinstall. + +```bash +if [[ "$MODE" == "update" ]]; then + echo "" + echo "=== Forcing Operator Reinstall ===" + + # Record current state + OLD_CSV=$(oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator | grep -v Replacing | awk '{print $1}') + SUB_NAME=$(oc get subscription -n redhat-ods-operator -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) + + echo "Current CSV: $OLD_CSV" + echo "Current subscription: $SUB_NAME" + + # Delete CSV to force OLM to reinstall from updated catalog + if [[ -n "$OLD_CSV" ]]; then + echo "Deleting CSV: $OLD_CSV" + oc delete csv "$OLD_CSV" -n redhat-ods-operator || true + sleep 10 + fi + + # Delete and recreate subscription + if [[ -n "$SUB_NAME" ]]; then + echo "Deleting subscription: $SUB_NAME" + oc delete subscription "$SUB_NAME" -n redhat-ods-operator || true + sleep 5 + fi + + # Recreate subscription pointing to updated catalog + cat <<EOF | oc apply -f - +apiVersion: operators.coreos.com/v1alpha1 +kind: Subscription +metadata: + name: rhoai-operator-dev + namespace: redhat-ods-operator +spec: + channel: $CHANNEL + installPlanApproval: Automatic + name: rhods-operator + source: rhoai-catalog-dev + sourceNamespace: openshift-marketplace +EOF + + echo "Subscription recreated (channel: $CHANNEL)" +fi +``` + +### Step 9: Wait for Operator CSV + +```bash +echo "" +echo "=== Waiting for Operator CSV ===" + +CSV_PHASE="" +TIMEOUT=600 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + CSV_LINE=$(oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator | grep -v Replacing || echo "") + + if [[ -n "$CSV_LINE" ]]; then + CSV_NAME=$(echo "$CSV_LINE" | awk '{print $1}') + CSV_PHASE=$(echo "$CSV_LINE" | awk '{print $NF}') + echo "CSV: $CSV_NAME, Phase: $CSV_PHASE" + + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + echo "Operator CSV installed successfully" + break + fi + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) +done + +[[ "$CSV_PHASE" == "Succeeded" ]] || die "Operator did not reach Succeeded phase within ${TIMEOUT}s" + +# Extract version +RHOAI_VERSION=$(oc get csv "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.version}' 2>/dev/null | grep -oE '^[0-9]+\.[0-9]+') +echo "RHOAI Version: $RHOAI_VERSION" +``` + +### Step 10: Create/Configure DataScienceCluster + +```bash +echo "" +echo "=== Configuring DataScienceCluster ===" + +# Wait for DSCInitialization +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get dscinitializations default-dsci &>/dev/null; then + echo "DSCInitialization found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) +done + +# For install mode, create DSC from CSV initialization-resource +if [[ "$MODE" == "install" ]]; then + CSV_NAME=$(oc get csv -n redhat-ods-operator 2>/dev/null | awk '/rhods-operator/{print $1; exit}') + if [[ -n "$CSV_NAME" ]]; then + oc get csv "$CSV_NAME" -n redhat-ods-operator \ + -o jsonpath='{.metadata.annotations.operatorframework\.io/initialization-resource}' \ + > /tmp/default-dsc.json + oc apply -f /tmp/default-dsc.json + echo "DSC created from CSV initialization-resource" + rm -f /tmp/default-dsc.json + fi +fi + +# Wait for DSC to exist +TIMEOUT=120 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get datasciencecluster default-dsc &>/dev/null; then + echo "DataScienceCluster found" + break + fi + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) +done + +# Patch DSC to enable required components with disconnected-specific settings +cat > /tmp/dsc-patch.yaml << 'YAML' +spec: + components: + aipipelines: + managementState: Managed + argoWorkflowsControllers: + managementState: Managed + kserve: + serving: + managementState: Managed + rawDeploymentServiceConfig: Headless + nim: + managementState: Managed + airGapped: true + llamastackoperator: + managementState: Managed + mlflowoperator: + managementState: Managed + trustyai: + managementState: Managed + trainer: + managementState: Removed +YAML + +oc patch datasciencecluster default-dsc --type merge --patch-file /tmp/dsc-patch.yaml || \ + die "Failed to patch DataScienceCluster" + +echo "DSC component configuration applied:" +echo " - aipipelines: Managed (with argoWorkflowsControllers)" +echo " - kserve: Managed (rawDeploymentServiceConfig: Headless for disconnected)" +echo " - nim: Managed (airGapped: true for disconnected)" +echo " - llamastackoperator: Managed" +echo " - mlflowoperator: Managed" +echo " - trustyai: Managed" +echo " - trainer: Removed (requires JobSet operator)" + +rm -f /tmp/dsc-patch.yaml +``` + +### Step 11: Wait for DSC Ready + +```bash +echo "" +echo "=== Waiting for DataScienceCluster ===" + +TIMEOUT=600 +INTERVAL=15 +ELAPSED=0 +DSC_PHASE="" + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + DSC_PHASE=$(oc get datasciencecluster -o jsonpath='{.items[0].status.phase}' 2>/dev/null || echo "Unknown") + echo "DSC phase: $DSC_PHASE" + + if [[ "$DSC_PHASE" == "Ready" ]]; then + echo "DataScienceCluster is Ready" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) +done + +if [[ "$DSC_PHASE" != "Ready" ]]; then + echo "WARNING: DSC is not Ready after ${TIMEOUT}s (current: ${DSC_PHASE:-Unknown})" + echo "Not-ready components:" + oc get dsc default-dsc -o json 2>/dev/null | \ + jq -r '.status.conditions[] | select(.status=="False") | select(.message | test("Removed") | not) | " \(.type): \(.message)"' 2>/dev/null || true +fi +``` + +### Step 12: Post-Install/Update Health Check - Verify No ImagePullBackOff + +This is critical for disconnected clusters. After the operator reconciles, check ALL pods in RHOAI namespaces for ImagePullBackOff or ErrImagePull errors. + +```bash +echo "" +echo "=== Post-Install Health Check ===" +echo "Waiting 60 seconds for operator to reconcile pods..." +sleep 60 + +PROBLEM_PODS=() + +# Check pods in all RHOAI-related namespaces +for ns in redhat-ods-operator redhat-ods-applications; do + PODS=$(oc get pods -n "$ns" --no-headers 2>/dev/null || echo "") + + while IFS= read -r line; do + [[ -z "$line" ]] && continue + POD_NAME=$(echo "$line" | awk '{print $1}') + STATUS=$(echo "$line" | awk '{print $3}') + + if [[ "$STATUS" == "ImagePullBackOff" || "$STATUS" == "ErrImagePull" ]]; then + # Get the failing image + FAILING_IMAGE=$(oc get pod "$POD_NAME" -n "$ns" -o jsonpath='{range .status.containerStatuses[*]}{.state.waiting.message}{"\n"}{end}' 2>/dev/null | grep -oE 'image "[^"]+"' | head -1) + PROBLEM_PODS+=("$ns/$POD_NAME: $STATUS ($FAILING_IMAGE)") + elif [[ "$STATUS" == "CrashLoopBackOff" ]]; then + PROBLEM_PODS+=("$ns/$POD_NAME: $STATUS") + fi + done <<< "$PODS" +done + +if [[ ${#PROBLEM_PODS[@]} -gt 0 ]]; then + echo "" + echo "WARNING: ${#PROBLEM_PODS[@]} pods have issues:" + for pod in "${PROBLEM_PODS[@]}"; do + echo " $pod" + done + echo "" + echo "For ImagePullBackOff: The image is missing from the bastion. Run /mirror-images to mirror it." + echo "For CrashLoopBackOff: Check pod logs for root cause (may be the podToPodTLS bug - see Step 13)." +else + echo "All pods in RHOAI namespaces are running normally" +fi +``` + +### Step 13: Apply Known Disconnected Workarounds + +#### 13a. podToPodTLS Bug Workaround + +In some RHOAI nightly builds, the DSP operator sets `--caCertPath` flag in pipeline component deployments, but the binary only supports `--mlPipelineServiceTLSCert`. This causes CrashLoopBackOff for `scheduledworkflow` and other pipeline pods with error: `flag provided but not defined: -caCertPath`. + +The workaround is to set `podToPodTLS: false` on all DataSciencePipelinesApplication (DSPA) CRs. This must be applied AFTER the operator creates the DSPA resources. + +```bash +echo "" +echo "=== Applying Known Disconnected Workarounds ===" + +# 13a. Check for and fix podToPodTLS bug +# Only apply if pipeline components are enabled and DSPAs exist +DSPA_LIST=$(oc get datasciencepipelinesapplication --all-namespaces --no-headers 2>/dev/null || echo "") + +if [[ -n "$DSPA_LIST" ]]; then + echo "Found DataSciencePipelinesApplication resources. Checking for podToPodTLS bug..." + + while IFS= read -r line; do + [[ -z "$line" ]] && continue + DSPA_NS=$(echo "$line" | awk '{print $1}') + DSPA_NAME=$(echo "$line" | awk '{print $2}') + + # Check if any pipeline pods are in CrashLoopBackOff with caCertPath error + CRASH_PODS=$(oc get pods -n "$DSPA_NS" --no-headers 2>/dev/null | grep CrashLoopBackOff || echo "") + + if [[ -n "$CRASH_PODS" ]]; then + # Check logs for the specific caCertPath error + for crash_pod in $(echo "$CRASH_PODS" | awk '{print $1}'); do + if oc logs "$crash_pod" -n "$DSPA_NS" --tail=5 2>/dev/null | grep -q "caCertPath"; then + echo " Found podToPodTLS bug in $DSPA_NS/$DSPA_NAME" + echo " Applying workaround: podToPodTLS=false" + oc patch datasciencepipelinesapplication "$DSPA_NAME" -n "$DSPA_NS" \ + --type='merge' -p '{"spec":{"podToPodTLS":false}}' + fi + done + fi + + # Also proactively set podToPodTLS=false to prevent the issue + CURRENT_TLS=$(oc get datasciencepipelinesapplication "$DSPA_NAME" -n "$DSPA_NS" \ + -o jsonpath='{.spec.podToPodTLS}' 2>/dev/null || echo "") + if [[ "$CURRENT_TLS" != "false" ]]; then + echo " Setting podToPodTLS=false on $DSPA_NS/$DSPA_NAME (proactive)" + oc patch datasciencepipelinesapplication "$DSPA_NAME" -n "$DSPA_NS" \ + --type='merge' -p '{"spec":{"podToPodTLS":false}}' + fi + done <<< "$DSPA_LIST" +else + echo "No DSPAs found (pipelines not yet configured). podToPodTLS workaround will need to be applied after creating DSPAs." + echo " Command: oc patch datasciencepipelinesapplication <name> -n <namespace> --type='merge' -p '{\"spec\":{\"podToPodTLS\":false}}'" +fi +``` + +#### 13b. PersistenceAgent TLS Certificate Fix (Proactive + Reactive) + +The pipeline persistenceagent may fail with `x509: certificate signed by unknown authority` when connecting to the pipeline API server. This happens because the trusted CA bundle doesn't include the OpenShift service-ca that signed the pipeline API server cert. + +**Proactive fix:** Apply the service-ca to ALL DSPA trusted CA configmaps immediately, before waiting for a crash. This prevents the issue entirely. + +```bash +# 13b. Proactively fix persistenceagent TLS for all DSPAs +SERVICE_CA=$(oc get configmap openshift-service-ca.crt -n openshift-config-managed \ + -o jsonpath='{.data.service-ca\.crt}' 2>/dev/null || echo "") + +if [[ -n "$SERVICE_CA" && -n "$DSPA_LIST" ]]; then + while IFS= read -r line; do + [[ -z "$line" ]] && continue + DSPA_NS=$(echo "$line" | awk '{print $1}') + DSPA_NAME=$(echo "$line" | awk '{print $2}') + + CM_NAME="dsp-trusted-ca-${DSPA_NAME}" + + # Wait for the configmap to be created by the operator (up to 60s) + TIMEOUT=60 + ELAPSED=0 + while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get configmap "$CM_NAME" -n "$DSPA_NS" &>/dev/null; then + break + fi + sleep 5 + ELAPSED=$((ELAPSED + 5)) + done + + if oc get configmap "$CM_NAME" -n "$DSPA_NS" &>/dev/null; then + CURRENT_CA=$(oc get configmap "$CM_NAME" -n "$DSPA_NS" -o jsonpath='{.data.dsp-ca\.crt}' 2>/dev/null || echo "") + + if [[ -n "$CURRENT_CA" ]] && ! echo "$CURRENT_CA" | grep -q "openshift-service-serving-signer"; then + echo " Proactively appending service-ca to $CM_NAME in $DSPA_NS" + COMBINED_CA="${CURRENT_CA} +${SERVICE_CA}" + TMPCA=$(mktemp) + chmod 600 "$TMPCA" + echo "$COMBINED_CA" > "$TMPCA" + oc create configmap "$CM_NAME" -n "$DSPA_NS" \ + --from-file=dsp-ca.crt="$TMPCA" \ + --dry-run=client -o yaml | oc replace -f - + rm -f "$TMPCA" + echo " Service-ca appended to $CM_NAME" + + # Restart persistenceagent if it exists (may or may not be running yet) + PA_POD=$(oc get pods -n "$DSPA_NS" --no-headers 2>/dev/null | grep persistenceagent | awk '{print $1}') + if [[ -n "$PA_POD" ]]; then + oc delete pod "$PA_POD" -n "$DSPA_NS" 2>/dev/null || true + echo " Restarted persistenceagent pod" + fi + else + echo " $CM_NAME in $DSPA_NS already has service-ca (or empty)" + fi + else + echo " WARNING: $CM_NAME not found in $DSPA_NS after ${TIMEOUT}s. Will need manual fix after DSPA creates it." + fi + done <<< "$DSPA_LIST" +elif [[ -z "$SERVICE_CA" ]]; then + echo " WARNING: Could not retrieve openshift-service-ca.crt — persistenceagent TLS fix skipped" +fi +``` + +### Step 14: Configure Dashboard Features + +```bash +echo "" +echo "=== Configuring Dashboard ===" + +# Wait for dashboard +TIMEOUT=300 +INTERVAL=10 +ELAPSED=0 + +while [[ $ELAPSED -lt $TIMEOUT ]]; do + READY=$(oc get deployment rhods-dashboard -n redhat-ods-applications -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo "0") + DESIRED=$(oc get deployment rhods-dashboard -n redhat-ods-applications -o jsonpath='{.spec.replicas}' 2>/dev/null || echo "0") + + if [[ "$READY" -gt 0 && "$READY" -eq "$DESIRED" ]]; then + echo "Dashboard deployment is ready ($READY/$DESIRED)" + break + fi + + sleep "$INTERVAL" + ELAPSED=$((ELAPSED + INTERVAL)) +done + +# Wait for OdhDashboardConfig +TIMEOUT=120 +ELAPSED=0 +while [[ $ELAPSED -lt $TIMEOUT ]]; do + if oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + break + fi + sleep 10 + ELAPSED=$((ELAPSED + 10)) +done + +# Enable feature flags +if oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + oc patch odhdashboardconfig odh-dashboard-config -n redhat-ods-applications --type merge -p '{ + "spec": { + "dashboardConfig": { + "automl": true, + "autorag": true, + "genAiStudio": true + } + } + }' 2>/dev/null || echo "WARNING: Failed to patch dashboard config" + + echo "Dashboard feature flags configured (automl, autorag, genAiStudio)" + + # Restart dashboard to pick up changes + oc rollout restart deployment rhods-dashboard -n redhat-ods-applications 2>/dev/null || true +else + echo "WARNING: OdhDashboardConfig not found. Feature flags will need manual configuration." +fi +``` + +### Step 15: Final Verification + +```bash +echo "" +echo "==========================================" +echo " RHOAI ${MODE^^} Summary (Disconnected)" +echo "==========================================" + +# CSV info +echo "" +echo "Operator CSV:" +oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator || echo " WARNING: CSV not found" + +# Version +echo "" +CSV_NAME=$(oc get csv -n redhat-ods-operator 2>/dev/null | awk '/rhods-operator/{print $1; exit}') +if [[ -n "$CSV_NAME" ]]; then + VERSION=$(oc get csv "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.version}' 2>/dev/null) + echo "RHOAI Version: $VERSION" +fi + +# FBC image +echo "FBC Image: $FBC_IMAGE" +echo "Channel: $CHANNEL" +echo "Bastion: $BASTION" + +# DSC status +echo "" +echo "DataScienceCluster:" +DSC_PHASE=$(oc get datasciencecluster -o jsonpath='{.items[0].status.phase}' 2>/dev/null || echo "Unknown") +echo " Phase: $DSC_PHASE" + +# Dashboard URL +echo "" +echo "Dashboard:" +DASHBOARD_ROUTE=$(oc get route rhods-dashboard -n redhat-ods-applications -o jsonpath='{.spec.host}' 2>/dev/null || echo "") +if [[ -n "$DASHBOARD_ROUTE" ]]; then + echo " https://$DASHBOARD_ROUTE" +else + echo " Route not found yet" +fi + +# Pod health summary +echo "" +echo "Pod Health (RHOAI namespaces):" +for ns in redhat-ods-operator redhat-ods-applications; do + TOTAL=$(oc get pods -n "$ns" --no-headers 2>/dev/null | wc -l | tr -d ' ') + RUNNING=$(oc get pods -n "$ns" --no-headers 2>/dev/null | grep Running | wc -l | tr -d ' ') + ISSUES=$(oc get pods -n "$ns" --no-headers 2>/dev/null | grep -cE 'ImagePullBackOff|ErrImagePull|CrashLoopBackOff' | tr -d ' ') + echo " $ns: $RUNNING/$TOTAL running, $ISSUES with issues" +done + +echo "" +if [[ "$DSC_PHASE" == "Ready" ]]; then + echo "RHOAI ${MODE} on disconnected cluster complete!" +else + echo "RHOAI ${MODE} completed but DSC is not fully Ready." + echo "Check pod status and apply workarounds if needed." +fi +``` + +## Known Issues and Workarounds + +### 1. podToPodTLS CrashLoopBackOff (DSP Components) + +**Symptom:** Pipeline pods (`scheduledworkflow`, `persistenceagent`) crash with `flag provided but not defined: -caCertPath` + +**Cause:** RHOAI nightly build bug -- operator sets `--caCertPath` in deployment spec but the binary only supports `--mlPipelineServiceTLSCert` + +**Fix:** Applied automatically in Step 13a. For new DSPAs created after install: +```bash +oc patch datasciencepipelinesapplication <name> -n <namespace> --type='merge' -p '{"spec":{"podToPodTLS":false}}' +``` + +### 2. PersistenceAgent x509 Certificate Error + +**Symptom:** `persistenceagent` crashes with `x509: certificate signed by unknown authority` when connecting to `ds-pipeline-*.svc.cluster.local:8888` + +**Cause:** The DSP trusted CA configmap has Mozilla CA bundle but NOT the OpenShift service-ca that signed the pipeline API server cert + +**Fix:** Applied automatically in Step 13b. Manual fix: +```bash +# Get the service-ca +SERVICE_CA=$(oc get configmap openshift-service-ca.crt -n openshift-config-managed -o jsonpath='{.data.service-ca\.crt}') +# Append to the existing DSP CA configmap +``` + +### 3. Missing Images on Bastion + +**Symptom:** Multiple pods in `ImagePullBackOff` state after install/update + +**Cause:** Not all RHOAI images were mirrored to the bastion before install/update + +**Prevention:** Step 4 (pre-flight verification) catches this before proceeding. Always run `/mirror-images` on the connected cluster first. + +### 4. EvalHub Cross-Namespace Issues + +**Symptom:** EvalHub evaluation jobs fail when running in a different namespace than `evalhub` + +**Cause:** EvalHub operator creates K8s Jobs in the target namespace but doesn't create the required ServiceAccount (`evalhub-evalhub-job`) or ConfigMap (`evalhub-service-ca`) there + +**Fix:** Manually create the SA and copy the ConfigMap: +```bash +oc create sa evalhub-evalhub-job -n <target-namespace> +oc adm policy add-role-to-user edit system:serviceaccount:<target-namespace>:evalhub-evalhub-job -n <target-namespace> +oc get configmap evalhub-service-ca -n evalhub -o json | \ + jq 'del(.metadata.namespace,.metadata.resourceVersion,.metadata.uid,.metadata.creationTimestamp,.metadata.managedFields,.metadata.ownerReferences)' | \ + oc create -n <target-namespace> -f - +``` + +## Output + +The command creates a report at `artifacts/rhoai-manager/reports/disconnected-{install|update}-report-[timestamp].md` with: +- FBC image reference and digest +- Pre-flight verification results +- Operator CSV details +- DataScienceCluster status +- Pod health check results +- Workarounds applied +- Dashboard URL diff --git a/workflows/rhoai-manager/.claude/commands/rhoai-verify.md b/workflows/rhoai-manager/.claude/commands/rhoai-verify.md new file mode 100644 index 00000000..469c8dc7 --- /dev/null +++ b/workflows/rhoai-manager/.claude/commands/rhoai-verify.md @@ -0,0 +1,654 @@ +# /rhoai-verify - Post-Install/Update Verification Tests for RHOAI + +Run a comprehensive suite of verification tests after RHOAI install or update to confirm all components are healthy and functional. Works on both connected and disconnected clusters. + +## Command Usage + +```bash +# Run all tests +/rhoai-verify + +# Run specific test categories +/rhoai-verify quick # Operator + DSC + pod health only +/rhoai-verify full # All tests including smoke tests +``` + +## Inputs + +| Input | Required | Default | Description | +|-------|----------|---------|-------------| +| `quick` / `full` | No | `full` | Test scope | + +## Process + +### Step 1: Initialize Test Report + +```bash +REPORT_FILE="artifacts/rhoai-manager/reports/verify-$(date +%Y%m%d-%H%M%S).md" +mkdir -p artifacts/rhoai-manager/reports + +PASS_COUNT=0 +FAIL_COUNT=0 +WARN_COUNT=0 + +pass() { PASS_COUNT=$((PASS_COUNT + 1)); echo " PASS: $1"; } +fail() { FAIL_COUNT=$((FAIL_COUNT + 1)); echo " FAIL: $1"; } +warn() { WARN_COUNT=$((WARN_COUNT + 1)); echo " WARN: $1"; } + +echo "=== RHOAI Post-Install/Update Verification ===" +echo "Cluster: $(oc whoami --show-server 2>/dev/null)" +echo "User: $(oc whoami 2>/dev/null)" +echo "Date: $(date -u +%Y-%m-%dT%H:%M:%SZ)" +echo "" +``` + +### Step 2: Operator Health + +Verify the RHOAI operator CSV is installed and in Succeeded phase. + +```bash +echo "=== Test 1: Operator Health ===" + +# 2a. Check CSV +CSV_LINE=$(oc get csv -n redhat-ods-operator 2>/dev/null | grep rhods-operator | grep -v Replacing || echo "") + +if [[ -z "$CSV_LINE" ]]; then + fail "No RHOAI CSV found in redhat-ods-operator namespace" +else + CSV_NAME=$(echo "$CSV_LINE" | awk '{print $1}') + CSV_PHASE=$(echo "$CSV_LINE" | awk '{print $NF}') + CSV_VERSION=$(oc get csv "$CSV_NAME" -n redhat-ods-operator -o jsonpath='{.spec.version}' 2>/dev/null) + + if [[ "$CSV_PHASE" == "Succeeded" ]]; then + pass "CSV $CSV_NAME is Succeeded (version: $CSV_VERSION)" + else + fail "CSV $CSV_NAME phase is $CSV_PHASE (expected: Succeeded)" + fi +fi + +# 2b. Check Subscription +SUB=$(oc get subscription -n redhat-ods-operator -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "") +if [[ -n "$SUB" ]]; then + SUB_STATE=$(oc get subscription "$SUB" -n redhat-ods-operator -o jsonpath='{.status.state}' 2>/dev/null || echo "Unknown") + SUB_CHANNEL=$(oc get subscription "$SUB" -n redhat-ods-operator -o jsonpath='{.spec.channel}' 2>/dev/null || echo "Unknown") + SUB_SOURCE=$(oc get subscription "$SUB" -n redhat-ods-operator -o jsonpath='{.spec.source}' 2>/dev/null || echo "Unknown") + + if [[ "$SUB_STATE" == "AtLatestKnown" ]]; then + pass "Subscription $SUB state: $SUB_STATE (channel: $SUB_CHANNEL, source: $SUB_SOURCE)" + else + warn "Subscription $SUB state: $SUB_STATE (expected: AtLatestKnown)" + fi +else + fail "No RHOAI subscription found" +fi + +# 2c. Check CatalogSource +CATALOG=$(oc get subscription "$SUB" -n redhat-ods-operator -o jsonpath='{.spec.source}' 2>/dev/null || echo "") +if [[ -n "$CATALOG" ]]; then + CATALOG_STATE=$(oc get catalogsource "$CATALOG" -n openshift-marketplace \ + -o jsonpath='{.status.connectionState.lastObservedState}' 2>/dev/null || echo "Unknown") + + if [[ "$CATALOG_STATE" == "READY" ]]; then + pass "CatalogSource $CATALOG is READY" + else + fail "CatalogSource $CATALOG state: $CATALOG_STATE (expected: READY)" + fi +fi + +echo "" +``` + +### Step 3: DataScienceCluster Health + +Verify DSC exists and is in Ready phase with all managed components healthy. + +```bash +echo "=== Test 2: DataScienceCluster Health ===" + +# 3a. Check DSCInitialization +DSCI_PHASE=$(oc get dscinitializations default-dsci -o jsonpath='{.status.phase}' 2>/dev/null || echo "NotFound") +if [[ "$DSCI_PHASE" == "Ready" ]]; then + pass "DSCInitialization phase: Ready" +else + fail "DSCInitialization phase: $DSCI_PHASE (expected: Ready)" +fi + +# 3b. Check DSC phase +DSC_PHASE=$(oc get datasciencecluster -o jsonpath='{.items[0].status.phase}' 2>/dev/null || echo "NotFound") +if [[ "$DSC_PHASE" == "Ready" ]]; then + pass "DataScienceCluster phase: Ready" +else + fail "DataScienceCluster phase: $DSC_PHASE (expected: Ready)" +fi + +# 3c. Check individual component conditions +DSC_CONDITIONS=$(oc get datasciencecluster -o json 2>/dev/null | \ + jq -r '.items[0].status.conditions[] | "\(.type)|\(.status)|\(.message // "")"' 2>/dev/null || echo "") + +if [[ -n "$DSC_CONDITIONS" ]]; then + while IFS='|' read -r ctype cstatus cmsg; do + [[ -z "$ctype" ]] && continue + # Skip conditions that are about Removed components + if echo "$cmsg" | grep -qi "removed"; then + continue + fi + if [[ "$cstatus" == "True" ]]; then + pass "Component $ctype: Ready" + else + fail "Component $ctype: Not Ready ($cmsg)" + fi + done <<< "$DSC_CONDITIONS" +fi + +echo "" +``` + +### Step 4: Pod Health Across RHOAI Namespaces + +Check all pods in RHOAI-related namespaces for failures. + +```bash +echo "=== Test 3: Pod Health ===" + +RHOAI_NAMESPACES="redhat-ods-operator redhat-ods-applications redhat-ods-monitoring" + +for ns in $RHOAI_NAMESPACES; do + # Skip if namespace doesn't exist + if ! oc get namespace "$ns" &>/dev/null; then + continue + fi + + PODS=$(oc get pods -n "$ns" --no-headers 2>/dev/null || echo "") + if [[ -z "$PODS" ]]; then + warn "No pods found in $ns" + continue + fi + + TOTAL=0 + RUNNING=0 + COMPLETED=0 + ISSUES=0 + ISSUE_DETAILS="" + + while IFS= read -r line; do + [[ -z "$line" ]] && continue + TOTAL=$((TOTAL + 1)) + POD_NAME=$(echo "$line" | awk '{print $1}') + STATUS=$(echo "$line" | awk '{print $3}') + READY=$(echo "$line" | awk '{print $2}') + + case "$STATUS" in + Running) + # Check if all containers are ready + READY_NUM=$(echo "$READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$READY" | cut -d/ -f2) + if [[ "$READY_NUM" == "$TOTAL_NUM" ]]; then + RUNNING=$((RUNNING + 1)) + else + ISSUES=$((ISSUES + 1)) + ISSUE_DETAILS="${ISSUE_DETAILS}\n $POD_NAME: Running but not ready ($READY)" + fi + ;; + Completed|Succeeded) + COMPLETED=$((COMPLETED + 1)) + ;; + ImagePullBackOff|ErrImagePull) + ISSUES=$((ISSUES + 1)) + ISSUE_DETAILS="${ISSUE_DETAILS}\n $POD_NAME: $STATUS (missing image on registry)" + ;; + CrashLoopBackOff) + ISSUES=$((ISSUES + 1)) + ISSUE_DETAILS="${ISSUE_DETAILS}\n $POD_NAME: $STATUS (check logs: oc logs $POD_NAME -n $ns)" + ;; + *) + ISSUES=$((ISSUES + 1)) + ISSUE_DETAILS="${ISSUE_DETAILS}\n $POD_NAME: $STATUS" + ;; + esac + done <<< "$PODS" + + if [[ $ISSUES -eq 0 ]]; then + pass "$ns: $RUNNING running, $COMPLETED completed, $TOTAL total" + else + fail "$ns: $ISSUES pods with issues out of $TOTAL total" + echo -e "$ISSUE_DETAILS" + fi +done + +echo "" +``` + +### Step 5: Dashboard Accessibility + +Verify the RHOAI dashboard is reachable and responding. + +```bash +echo "=== Test 4: Dashboard Accessibility ===" + +# 5a. Check deployment +DASH_READY=$(oc get deployment rhods-dashboard -n redhat-ods-applications \ + -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo "0") +DASH_DESIRED=$(oc get deployment rhods-dashboard -n redhat-ods-applications \ + -o jsonpath='{.spec.replicas}' 2>/dev/null || echo "0") + +if [[ "$DASH_READY" -gt 0 && "$DASH_READY" -eq "$DASH_DESIRED" ]]; then + pass "Dashboard deployment ready ($DASH_READY/$DASH_DESIRED replicas)" +else + fail "Dashboard deployment not ready ($DASH_READY/$DASH_DESIRED replicas)" +fi + +# 5b. Check route exists +DASH_ROUTE=$(oc get route rhods-dashboard -n redhat-ods-applications \ + -o jsonpath='{.spec.host}' 2>/dev/null || echo "") + +if [[ -n "$DASH_ROUTE" ]]; then + pass "Dashboard route exists: https://$DASH_ROUTE" +else + fail "Dashboard route not found" +fi + +# 5c. HTTP health check (expect 403 or 200 — both mean dashboard is responding) +if [[ -n "$DASH_ROUTE" ]]; then + HTTP_CODE=$(/usr/bin/curl -sk -o /dev/null -w '%{http_code}' "https://$DASH_ROUTE" 2>/dev/null || echo "000") + + if [[ "$HTTP_CODE" == "200" || "$HTTP_CODE" == "403" || "$HTTP_CODE" == "302" ]]; then + pass "Dashboard HTTP response: $HTTP_CODE (responding)" + else + fail "Dashboard HTTP response: $HTTP_CODE (expected 200, 302, or 403)" + fi +fi + +# 5d. Check dashboard feature flags +if oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications &>/dev/null; then + AUTOML=$(oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications \ + -o jsonpath='{.spec.dashboardConfig.automl}' 2>/dev/null || echo "unset") + AUTORAG=$(oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications \ + -o jsonpath='{.spec.dashboardConfig.autorag}' 2>/dev/null || echo "unset") + GENAISTUDIO=$(oc get odhdashboardconfig odh-dashboard-config -n redhat-ods-applications \ + -o jsonpath='{.spec.dashboardConfig.genAiStudio}' 2>/dev/null || echo "unset") + + echo " Dashboard features: automl=$AUTOML, autorag=$AUTORAG, genAiStudio=$GENAISTUDIO" +fi + +echo "" +``` + +### Step 6: Pipeline (Data Science Pipelines) Readiness + +Verify the DSP operator and controllers are running. If DSPAs exist, verify their health. + +```bash +echo "=== Test 5: Data Science Pipelines ===" + +# 6a. Check DSP operator deployment +DSP_OPERATOR=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "data-science-pipelines-operator" || echo "") + +if [[ -n "$DSP_OPERATOR" ]]; then + DSP_NAME=$(echo "$DSP_OPERATOR" | awk '{print $1}') + DSP_READY=$(echo "$DSP_OPERATOR" | awk '{print $2}') + READY_NUM=$(echo "$DSP_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$DSP_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "DSP operator deployment ready ($DSP_READY)" + else + fail "DSP operator deployment not ready ($DSP_READY)" + fi +else + warn "DSP operator deployment not found (pipelines may be set to Removed)" +fi + +# 6b. Check existing DSPAs +DSPA_LIST=$(oc get datasciencepipelinesapplication --all-namespaces --no-headers 2>/dev/null || echo "") + +if [[ -n "$DSPA_LIST" ]]; then + while IFS= read -r line; do + [[ -z "$line" ]] && continue + DSPA_NS=$(echo "$line" | awk '{print $1}') + DSPA_NAME=$(echo "$line" | awk '{print $2}') + DSPA_READY=$(echo "$line" | awk '{print $NF}') + + # Check DSPA status + DSPA_PHASE=$(oc get datasciencepipelinesapplication "$DSPA_NAME" -n "$DSPA_NS" \ + -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || echo "Unknown") + + if [[ "$DSPA_PHASE" == "True" ]]; then + pass "DSPA $DSPA_NS/$DSPA_NAME: Ready" + else + fail "DSPA $DSPA_NS/$DSPA_NAME: Not Ready" + fi + + # Check podToPodTLS (known issue) + POD_TLS=$(oc get datasciencepipelinesapplication "$DSPA_NAME" -n "$DSPA_NS" \ + -o jsonpath='{.spec.podToPodTLS}' 2>/dev/null || echo "unset") + if [[ "$POD_TLS" != "false" ]]; then + warn "DSPA $DSPA_NS/$DSPA_NAME: podToPodTLS=$POD_TLS (set to false if pipeline pods crash with caCertPath error)" + fi + + # Check pipeline pods in that namespace + CRASH_PODS=$(oc get pods -n "$DSPA_NS" --no-headers 2>/dev/null | grep -E "CrashLoopBackOff|ImagePullBackOff" || echo "") + if [[ -n "$CRASH_PODS" ]]; then + fail "DSPA $DSPA_NS has crashing/failing pods:" + echo "$CRASH_PODS" | while read -r pline; do + echo " $(echo "$pline" | awk '{print $1, $3}')" + done + fi + done <<< "$DSPA_LIST" +else + echo " No DSPAs configured yet (create one to test pipelines)" +fi + +echo "" +``` + +### Step 7: Workbench / Notebook Controller Readiness + +```bash +echo "=== Test 6: Workbench / Notebook Controller ===" + +# 7a. Check notebook controller +NB_CONTROLLER=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "notebook-controller" | head -1 || echo "") + +if [[ -n "$NB_CONTROLLER" ]]; then + NB_NAME=$(echo "$NB_CONTROLLER" | awk '{print $1}') + NB_READY=$(echo "$NB_CONTROLLER" | awk '{print $2}') + READY_NUM=$(echo "$NB_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$NB_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "Notebook controller ready ($NB_READY)" + else + fail "Notebook controller not ready ($NB_READY)" + fi +else + warn "Notebook controller deployment not found" +fi + +# 7b. Check ODH notebook controller +ODH_NB=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "odh-notebook-controller" | head -1 || echo "") + +if [[ -n "$ODH_NB" ]]; then + ODH_NB_NAME=$(echo "$ODH_NB" | awk '{print $1}') + ODH_NB_READY=$(echo "$ODH_NB" | awk '{print $2}') + READY_NUM=$(echo "$ODH_NB_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$ODH_NB_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "ODH notebook controller ready ($ODH_NB_READY)" + else + fail "ODH notebook controller not ready ($ODH_NB_READY)" + fi +fi + +# 7c. Check workbench namespace +WB_NS=$(oc get datasciencecluster -o jsonpath='{.items[0].spec.components.workbenches.workbenchNamespace}' 2>/dev/null || echo "rhods-notebooks") +if oc get namespace "$WB_NS" &>/dev/null; then + pass "Workbench namespace $WB_NS exists" +else + warn "Workbench namespace $WB_NS not found" +fi + +echo "" +``` + +### Step 8: Model Serving Readiness (KServe / ModelMesh) + +```bash +echo "=== Test 7: Model Serving ===" + +# 8a. Check KServe controller +KSERVE=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "kserve-controller" | head -1 || echo "") + +if [[ -n "$KSERVE" ]]; then + KS_READY=$(echo "$KSERVE" | awk '{print $2}') + READY_NUM=$(echo "$KS_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$KS_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "KServe controller ready ($KS_READY)" + else + fail "KServe controller not ready ($KS_READY)" + fi +else + warn "KServe controller not found (kserve may be Removed)" +fi + +# 8b. Check ModelMesh controller +MODELMESH=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "modelmesh-controller" | head -1 || echo "") + +if [[ -n "$MODELMESH" ]]; then + MM_READY=$(echo "$MODELMESH" | awk '{print $2}') + READY_NUM=$(echo "$MM_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$MM_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "ModelMesh controller ready ($MM_READY)" + else + fail "ModelMesh controller not ready ($MM_READY)" + fi +else + echo " ModelMesh controller not found (may not be deployed)" +fi + +# 8c. Check ServingRuntimes exist +SR_COUNT=$(oc get servingruntimes -n redhat-ods-applications --no-headers 2>/dev/null | wc -l | tr -d ' ') +if [[ "$SR_COUNT" -gt 0 ]]; then + pass "Found $SR_COUNT ServingRuntime(s) in redhat-ods-applications" +else + warn "No ServingRuntimes found in redhat-ods-applications" +fi + +# 8d. Check InferenceServices across cluster +IS_COUNT=$(oc get inferenceservice --all-namespaces --no-headers 2>/dev/null | wc -l | tr -d ' ') +if [[ "$IS_COUNT" -gt 0 ]]; then + echo " Found $IS_COUNT InferenceService(s) across cluster" + # Check each for readiness + oc get inferenceservice --all-namespaces --no-headers 2>/dev/null | while read -r line; do + IS_NS=$(echo "$line" | awk '{print $1}') + IS_NAME=$(echo "$line" | awk '{print $2}') + IS_READY=$(echo "$line" | awk '{print $NF}') + echo " $IS_NS/$IS_NAME: $IS_READY" + done +else + echo " No InferenceServices deployed" +fi + +echo "" +``` + +### Step 9: Model Registry Readiness + +```bash +echo "=== Test 8: Model Registry ===" + +MR_OPERATOR=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "model-registry-operator" | head -1 || echo "") + +if [[ -n "$MR_OPERATOR" ]]; then + MR_READY=$(echo "$MR_OPERATOR" | awk '{print $2}') + READY_NUM=$(echo "$MR_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$MR_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "Model Registry operator ready ($MR_READY)" + else + fail "Model Registry operator not ready ($MR_READY)" + fi +else + warn "Model Registry operator not found (may be set to Removed)" +fi + +# Check registry namespace +MR_NS=$(oc get datasciencecluster -o jsonpath='{.items[0].spec.components.modelregistry.registriesNamespace}' 2>/dev/null || echo "") +if [[ -n "$MR_NS" ]]; then + if oc get namespace "$MR_NS" &>/dev/null; then + pass "Model Registry namespace $MR_NS exists" + else + warn "Model Registry namespace $MR_NS not found" + fi +fi + +echo "" +``` + +### Step 10: TrustyAI / EvalHub Readiness + +```bash +echo "=== Test 9: TrustyAI / EvalHub ===" + +# 10a. TrustyAI operator +TRUSTYAI=$(oc get deployment -n redhat-ods-applications --no-headers 2>/dev/null | grep "trustyai" | head -1 || echo "") + +if [[ -n "$TRUSTYAI" ]]; then + TA_READY=$(echo "$TRUSTYAI" | awk '{print $2}') + READY_NUM=$(echo "$TA_READY" | cut -d/ -f1) + TOTAL_NUM=$(echo "$TA_READY" | cut -d/ -f2) + + if [[ "$READY_NUM" == "$TOTAL_NUM" && "$READY_NUM" -gt 0 ]]; then + pass "TrustyAI operator ready ($TA_READY)" + else + fail "TrustyAI operator not ready ($TA_READY)" + fi +else + warn "TrustyAI operator not found (may be set to Removed)" +fi + +# 10b. Check EvalHub namespace and resources +if oc get namespace evalhub &>/dev/null; then + EVALHUB_PODS=$(oc get pods -n evalhub --no-headers 2>/dev/null || echo "") + EH_TOTAL=$(echo "$EVALHUB_PODS" | grep -c '.' || echo "0") + EH_RUNNING=$(echo "$EVALHUB_PODS" | grep -c "Running" || echo "0") + EH_ISSUES=$(echo "$EVALHUB_PODS" | grep -cE "CrashLoopBackOff|ImagePullBackOff|Error" || echo "0") + + if [[ "$EH_ISSUES" -eq 0 && "$EH_RUNNING" -gt 0 ]]; then + pass "EvalHub namespace: $EH_RUNNING/$EH_TOTAL pods running" + elif [[ "$EH_ISSUES" -gt 0 ]]; then + fail "EvalHub namespace: $EH_ISSUES pods with issues" + else + warn "EvalHub namespace exists but no running pods" + fi + + # Check EvalHub route + EH_ROUTE=$(oc get route -n evalhub --no-headers 2>/dev/null | head -1 | awk '{print $2}' || echo "") + if [[ -n "$EH_ROUTE" ]]; then + pass "EvalHub route: https://$EH_ROUTE" + fi +else + echo " EvalHub namespace not found (not configured)" +fi + +echo "" +``` + +### Step 11: Dependent Operator Health + +Check that key dependent operators (service mesh, serverless, pipelines, cert-manager) are installed and healthy. + +```bash +echo "=== Test 10: Dependent Operators ===" + +DEPENDENT_OPERATORS=( + "servicemeshoperator" + "openshift-pipelines-operator-rh" + "serverless-operator" + "openshift-cert-manager-operator" +) + +for op in "${DEPENDENT_OPERATORS[@]}"; do + OP_CSV=$(oc get csv --all-namespaces 2>/dev/null | grep "$op" | grep -v Replacing | head -1 || echo "") + + if [[ -n "$OP_CSV" ]]; then + OP_PHASE=$(echo "$OP_CSV" | awk '{print $NF}') + OP_NAME=$(echo "$OP_CSV" | awk '{print $2}') + if [[ "$OP_PHASE" == "Succeeded" ]]; then + pass "$op ($OP_NAME): Succeeded" + else + warn "$op ($OP_NAME): $OP_PHASE" + fi + else + warn "$op: not installed" + fi +done + +echo "" +``` + +### Step 12: Disconnected-Specific Checks (auto-detected) + +If running on a disconnected cluster (detected by IDMS presence), run additional checks. + +```bash +echo "=== Test 11: Disconnected Cluster Checks ===" + +IDMS_COUNT=$(oc get imagedigestmirrorset --no-headers 2>/dev/null | wc -l | tr -d ' ') + +if [[ "$IDMS_COUNT" -gt 0 ]]; then + echo " Detected disconnected cluster ($IDMS_COUNT IDMS entries)" + + # Check IDMS entries for key RHOAI sources + REQUIRED_SOURCES=("registry.redhat.io/rhoai" "registry.redhat.io/rhel9" "registry.redhat.io/ubi9") + IDMS_SOURCES=$(oc get imagedigestmirrorset -o jsonpath='{range .items[*]}{range .spec.imageDigestMirrors[*]}{.source}{"\n"}{end}{end}' 2>/dev/null | sort -u) + + for source in "${REQUIRED_SOURCES[@]}"; do + if echo "$IDMS_SOURCES" | grep -q "$source"; then + pass "IDMS entry exists for $source" + else + fail "IDMS entry missing for $source" + fi + done + + # Check for any ImagePullBackOff across ALL namespaces (not just RHOAI) + IPB_PODS=$(oc get pods --all-namespaces --no-headers 2>/dev/null | grep -E "ImagePullBackOff|ErrImagePull" | head -10 || echo "") + if [[ -n "$IPB_PODS" ]]; then + IPB_COUNT=$(echo "$IPB_PODS" | wc -l | tr -d ' ') + warn "$IPB_COUNT pods with ImagePullBackOff across cluster (may indicate missing mirrored images)" + echo "$IPB_PODS" | while read -r line; do + echo " $(echo "$line" | awk '{print $1"/"$2": "$4}')" + done + else + pass "No ImagePullBackOff pods across cluster" + fi +else + echo " Connected cluster detected (no IDMS entries) — skipping disconnected checks" +fi + +echo "" +``` + +### Step 13: Test Summary + +```bash +echo "==========================================" +echo " RHOAI Verification Summary" +echo "==========================================" +echo "" +echo " PASS: $PASS_COUNT" +echo " FAIL: $FAIL_COUNT" +echo " WARN: $WARN_COUNT" +echo "" + +if [[ $FAIL_COUNT -eq 0 ]]; then + echo " Result: ALL TESTS PASSED" +else + echo " Result: $FAIL_COUNT FAILURE(S) DETECTED" + echo "" + echo " Troubleshooting:" + echo " - ImagePullBackOff: Run /mirror-images to mirror missing images" + echo " - CrashLoopBackOff: Check pod logs (may need podToPodTLS workaround)" + echo " - DSC not Ready: Check component conditions with: oc get dsc -o yaml" + echo " - CSV not Succeeded: Check InstallPlan and operator logs" +fi + +echo "" +echo "Cluster: $(oc whoami --show-server 2>/dev/null)" +echo "RHOAI Version: ${CSV_VERSION:-Unknown}" +``` + +Write the test results to the report file in markdown format for archival. + +## Output + +Report saved to `artifacts/rhoai-manager/reports/verify-[timestamp].md` with: +- Cluster info and RHOAI version +- Per-test PASS/FAIL/WARN results +- Summary counts +- Troubleshooting guidance for failures diff --git a/workflows/rhoai-manager/README.md b/workflows/rhoai-manager/README.md index 8ed1f620..55406ab2 100644 --- a/workflows/rhoai-manager/README.md +++ b/workflows/rhoai-manager/README.md @@ -33,7 +33,9 @@ workflows/rhoai-manager/ │ ├── odh-update.md # ODH update to latest nightly │ ├── odh-uninstall.md # ODH uninstall │ ├── odh-pr-tracker.md # Track ODH PRs in RHOAI builds -│ └── mirror-images.md # Mirror images to disconnected bastions +│ ├── mirror-images.md # Mirror images to disconnected bastions +│ ├── rhoai-disconnected.md # Install/update RHOAI on disconnected clusters +│ └── rhoai-verify.md # Post-install/update verification tests └── README.md # This file ``` @@ -168,20 +170,84 @@ Track whether an ODH pull request has been included in the latest RHOAI build. ### /mirror-images -Mirror all RHOAI operator and component images from a connected cluster to both disconnected cluster bastion registries. +Mirror all images needed for a complete disconnected RHOAI deployment from a connected cluster to one or more bastion registries. Includes RHOAI operator, all components, and infrastructure services. **Usage:** `/mirror-images` **What it does:** -1. Extracts images from connected cluster's CSV relatedImages + running pods -2. Builds a combined pull secret with source registry and bastion credentials -3. Deploys a mirror pod on the connected cluster (fast AWS-internal transfers) -4. Mirrors all images to both bastions with `--keep-manifest-list=true --filter-by-os=".*"` -5. Tags destinations with `:latest` to prevent Quay tagless manifest GC -6. Verifies every image on both bastions, reports failures +1. Extracts images from connected cluster's CSV relatedImages (all of them, no exclusions by default) +2. Scans all relevant namespaces for running pod images (minio, keycloak, postgres, milvus, vLLM, service mesh, etc.) +3. Captures catalog source images and module architecture images +4. Builds a combined pull secret with source registry and bastion credentials +5. Deploys a mirror pod on the connected cluster (fast AWS-internal transfers) +6. Mirrors all images to each bastion with `--keep-manifest-list=true --filter-by-os=".*"` +7. Tags destinations with `:latest` to prevent Quay tagless manifest GC +8. Verifies every image on each bastion, reports failures by category +9. Generates IDMS (ImageDigestMirrorSet) YAML for the disconnected cluster -**Required inputs:** Both bastion addresses, bastion credentials, RHOAI version, optional exclude patterns. +**Required inputs:** Bastion registry address(es), bastion credentials. RHOAI version is auto-detected. Optional exclude patterns (empty by default). + +--- + +### /rhoai-disconnected + +Install or update RHOAI on a disconnected (air-gapped) OpenShift cluster using a digest-pinned FBC catalog image. + +**Usage:** +```bash +/rhoai-disconnected fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:fe1157d5... +/rhoai-disconnected install fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:... +/rhoai-disconnected update fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:... +/rhoai-disconnected fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:... bastion=host:8443 channel=stable-3.4 +``` + +**Required input:** `fbc=<image@sha256:digest>` — the FBC catalog image (must be already mirrored to bastion via `/mirror-images`). + +**Optional inputs:** `bastion=<host:port>` (auto-detected from IDMS), `channel=<channel>` (default: `stable-3.4`), `install`/`update` (auto-detected). + +**What it does:** + +1. Auto-detects install vs update mode and bastion registry from IDMS +2. **Pre-flight verification**: checks that the FBC image and ALL relatedImages exist on the bastion before proceeding +3. Verifies IDMS entries cover all required source registries +4. Creates/updates OLM CatalogSource, namespace, OperatorGroup, and Subscription +5. For updates: forces CSV reinstall to pick up new component images +6. Waits for operator CSV and DataScienceCluster to reach Ready state +7. Post-install health check: detects ImagePullBackOff and CrashLoopBackOff pods +8. Applies known workarounds (podToPodTLS bug, persistenceagent TLS cert) +9. Configures dashboard feature flags (automl, autorag, genAiStudio) + +**Prerequisite:** All images mirrored to bastion (use `/mirror-images` on connected cluster first). IDMS configured on disconnected cluster. + +--- + +### /rhoai-verify + +Run post-install/update verification tests to confirm all RHOAI components are healthy and functional. + +**Usage:** +```bash +/rhoai-verify # Run all tests (default: full) +/rhoai-verify quick # Operator + DSC + pod health only +/rhoai-verify full # All tests including smoke tests +``` + +**What it checks:** + +1. Operator health — CSV phase, subscription state, CatalogSource readiness +2. DataScienceCluster — phase, component conditions +3. Pod health — scans all RHOAI namespaces for ImagePullBackOff, CrashLoopBackOff, not-ready containers +4. Dashboard — deployment readiness, route existence, HTTP response +5. Data Science Pipelines — DSP operator, DSPA health, podToPodTLS status +6. Workbenches — notebook controller, ODH notebook controller, workbench namespace +7. Model Serving — KServe controller, ModelMesh controller, ServingRuntimes, InferenceServices +8. Model Registry — operator readiness, registry namespace +9. TrustyAI / EvalHub — TrustyAI operator, EvalHub namespace/pods/route +10. Dependent operators — service mesh, serverless, pipelines, cert-manager +11. Disconnected checks (auto-detected) — IDMS entries, cluster-wide ImagePullBackOff scan + +**Output:** Report at `artifacts/rhoai-manager/reports/verify-[timestamp].md` with PASS/FAIL/WARN summary and troubleshooting guidance. --- @@ -191,7 +257,7 @@ Mirror all RHOAI operator and component images from a connected cluster to both ``` 1. /oc-login 2. /rhoai-install -3. /rhoai-version +3. /rhoai-verify ``` ### Fresh ODH Installation @@ -205,7 +271,7 @@ Mirror all RHOAI operator and component images from a connected cluster to both ``` 1. /oc-login 2. /rhoai-update -3. /rhoai-version +3. /rhoai-verify ``` ### Pull Latest Nightly (ODH) @@ -231,7 +297,14 @@ Mirror all RHOAI operator and component images from a connected cluster to both ### Mirror Images to Disconnected Clusters ``` 1. /oc-login # Connect to the connected cluster -2. /mirror-images # Mirror all RHOAI images to both bastions +2. /mirror-images # Mirror all RHOAI + infrastructure images to bastion(s) +``` + +### Install/Update RHOAI on Disconnected Cluster +``` +1. /oc-login # Connect to the disconnected cluster +2. /rhoai-disconnected fbc=quay.io/rhoai/rhoai-fbc-fragment@sha256:... +3. /rhoai-verify # Verify everything is healthy ``` ### Decommission From 00ba9ad1aeea26e06f90f9394704510be001c081 Mon Sep 17 00:00:00 2001 From: Vaishnavi-Modi <vamodi@redhat.com> Date: Tue, 14 Apr 2026 12:05:42 -0400 Subject: [PATCH 22/23] feat: add summary display section to mirror-images command Add a formatted summary table displayed after mirroring completes, showing total/verified/skipped/failed counts, duration, per-category breakdown, artifact paths, and next steps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --- .../.claude/commands/mirror-images.md | 37 +++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/workflows/rhoai-manager/.claude/commands/mirror-images.md b/workflows/rhoai-manager/.claude/commands/mirror-images.md index fa3339f7..5dbc5743 100644 --- a/workflows/rhoai-manager/.claude/commands/mirror-images.md +++ b/workflows/rhoai-manager/.claude/commands/mirror-images.md @@ -471,3 +471,40 @@ The following table lists all image categories that must be mirrored for a compl - `artifacts/rhoai-manager/mirror-images-{version}.txt` -- categorized image list extracted from the connected cluster - `artifacts/rhoai-manager/mirror-log-{version}.txt` -- complete mirror pod log with verification results - `artifacts/rhoai-manager/mirror-idms-{version}.yaml` -- ImageDigestMirrorSet YAML for the disconnected cluster (generated from the mirrored image list) + +## Summary Display + +After mirroring completes, display a summary table to the user in this format: + +``` +**RHOAI v{version} Image Mirror — Complete** + +| Metric | Value | +|--------|-------| +| Total images | {total} | +| Verified | {verified} | +| Skipped (already on bastion) | {skipped} | +| Failed | {failed} | +| Duration | {duration} | +| Target | `{bastion_registry}` | + +**Image Breakdown:** + +| Category | Count | +|----------|-------| +| RHOAI Operator and Components | {count} | +| Model Serving Runtimes (vLLM) | {count} | +| Infrastructure Dependencies | {count} | +| FBC Catalog | {count} | +| Base Images | {count} | + +**Artifacts saved:** +- `artifacts/rhoai-manager/mirror-images-{version}.txt` — categorized image list +- `artifacts/rhoai-manager/mirror-log-{version}.txt` — full mirror log ({line_count} lines) +- `artifacts/rhoai-manager/mirror-idms-{version}.yaml` — ImageDigestMirrorSet YAML for disconnected cluster + +**Next step:** Apply the IDMS on the disconnected cluster: +oc apply -f artifacts/rhoai-manager/mirror-idms-{version}.yaml +``` + +If any images failed, append a **Failed Images** section listing them by category with their full image references. From df95c2a6f64424ef32ccbb4789bb4499fdb9243a Mon Sep 17 00:00:00 2001 From: vmrh21 <bot@ambient-code.local> Date: Thu, 16 Apr 2026 23:10:46 +0000 Subject: [PATCH 23/23] feat: generate .cve-fix/examples.md guidance for Observability repos - Generate .cve-fix/examples.md guidance for all 14 Observability repos - Patterns extracted from merged CVE/dependency PRs in stolostron org - Repos with <3 CVE PRs include org-wide patterns as fallback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --- .../.cve-fix/stolostron-grafana/examples.md | 30 +++++++++++++++ .../stolostron-kube-rbac-proxy/examples.md | 36 ++++++++++++++++++ .../stolostron-kube-state-metrics/examples.md | 33 ++++++++++++++++ .../stolostron-memcached-exporter/examples.md | 30 +++++++++++++++ .../examples.md | 30 +++++++++++++++ .../examples.md | 32 ++++++++++++++++ .../stolostron-node-exporter/examples.md | 28 ++++++++++++++ .../examples.md | 30 +++++++++++++++ .../stolostron-observatorium/examples.md | 30 +++++++++++++++ .../examples.md | 32 ++++++++++++++++ .../examples.md | 30 +++++++++++++++ .../stolostron-prometheus/examples.md | 33 ++++++++++++++++ .../examples.md | 32 ++++++++++++++++ .../.cve-fix/stolostron-thanos/examples.md | 38 +++++++++++++++++++ 14 files changed, 444 insertions(+) create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-grafana/examples.md create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-kube-rbac-proxy/examples.md create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-kube-state-metrics/examples.md create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-memcached-exporter/examples.md create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-addon/examples.md create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-operator/examples.md create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-node-exporter/examples.md create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-observatorium-operator/examples.md create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-observatorium/examples.md create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-prometheus-alertmanager/examples.md create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-prometheus-operator/examples.md create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-prometheus/examples.md create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-thanos-receive-controller/examples.md create mode 100755 workflows/cve-fixer/.cve-fix/stolostron-thanos/examples.md diff --git a/workflows/cve-fixer/.cve-fix/stolostron-grafana/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-grafana/examples.md new file mode 100755 index 00000000..3af72243 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-grafana/examples.md @@ -0,0 +1,30 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 0 --> + +<!-- Insufficient PR history for full pattern extraction. + Update with /guidance.update after more CVE fixes are merged. --> + +## Titles +- `Security: Fix CVE-YYYY-XXXXX (<package>)` (common across stolostron org) +- `fix(cve): CVE-YYYY-XXXXX - <package>` (conventional commit style, also used in org) + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (common across stolostron org) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-kube-rbac-proxy/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-kube-rbac-proxy/examples.md new file mode 100755 index 00000000..390e1c3d --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-kube-rbac-proxy/examples.md @@ -0,0 +1,36 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 10 --> + +## Titles +- `fix(cve): CVE-YYYY-XXXXX - <package>` (5/30 merged PRs) + - e.g. `fix(cve): CVE-2026-33186 - google.golang.org/grpc` + - e.g. `fix(cve): CVE-2026-33186 - google.golang.org/grpc` +- `Security: Fix CVE-YYYY-XXXXX (<package>)` (5/30 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (gRPC-Go) - release-2.13` + - e.g. `Security: Fix CVE-2026-33186 (gRPC-Go) - release-2.14` + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (10/30 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-backplane-2.10-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-backplane-2.9-attempt-1` +- `dependabot/<ecosystem>/<pkg>-<version>` (2/30 merged PRs) + - e.g. `dependabot/go_modules/golang.org/x/net-0.38.0` + - e.g. `dependabot/go_modules/golang.org/x/oauth2-0.27.0` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-kube-state-metrics/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-kube-state-metrics/examples.md new file mode 100755 index 00000000..a7408bb0 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-kube-state-metrics/examples.md @@ -0,0 +1,33 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 4 --> + +## Titles +- `Security: Fix CVE-YYYY-XXXXX (<package>)` (4/15 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (4/15 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15-attempt-1` +- `dependabot/<ecosystem>/<pkg>-<version>` (2/15 merged PRs) + - e.g. `dependabot/go_modules/github.com/golang-jwt/jwt/v5-5.2.2` + - e.g. `dependabot/go_modules/golang.org/x/crypto-0.35.0` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-memcached-exporter/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-memcached-exporter/examples.md new file mode 100755 index 00000000..3af72243 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-memcached-exporter/examples.md @@ -0,0 +1,30 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 0 --> + +<!-- Insufficient PR history for full pattern extraction. + Update with /guidance.update after more CVE fixes are merged. --> + +## Titles +- `Security: Fix CVE-YYYY-XXXXX (<package>)` (common across stolostron org) +- `fix(cve): CVE-YYYY-XXXXX - <package>` (conventional commit style, also used in org) + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (common across stolostron org) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-addon/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-addon/examples.md new file mode 100755 index 00000000..c0e621f5 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-addon/examples.md @@ -0,0 +1,30 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 4 --> + +## Titles +- `Security: Fix CVE-YYYY-XXXXX (<package>)` (4/10 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (4/10 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-operator/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-operator/examples.md new file mode 100755 index 00000000..72e15064 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-multicluster-observability-operator/examples.md @@ -0,0 +1,32 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 4 --> + +## Titles +- `Security: Fix CVE-YYYY-XXXXX (<package>)` (4/12 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (4/12 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15-attempt-1` +- `dependabot/<ecosystem>/<pkg>-<version>` (1/12 merged PRs) + - e.g. `dependabot/go_modules/go.opentelemetry.io/otel/sdk-1.40.0` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-node-exporter/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-node-exporter/examples.md new file mode 100755 index 00000000..8cc1dc04 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-node-exporter/examples.md @@ -0,0 +1,28 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 0 --> + +## Titles +- `Security: Fix CVE-YYYY-XXXXX (<package>)` (common across stolostron org) +- `fix(cve): CVE-YYYY-XXXXX - <package>` (conventional commit style, also used in org) + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (common across stolostron org) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-observatorium-operator/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-observatorium-operator/examples.md new file mode 100755 index 00000000..b1b8a4a2 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-observatorium-operator/examples.md @@ -0,0 +1,30 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 0 --> + +## Titles +- `Security: Fix CVE-YYYY-XXXXX (<package>)` (common across stolostron org) +- `fix(cve): CVE-YYYY-XXXXX - <package>` (conventional commit style, also used in org) + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (common across stolostron org) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) +- `vendor/` directory is vendored — run `go mod vendor` after dependency changes + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- This repo vendors dependencies — run `go mod vendor` after `go mod tidy` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-observatorium/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-observatorium/examples.md new file mode 100755 index 00000000..6debae0b --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-observatorium/examples.md @@ -0,0 +1,30 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 4 --> + +## Titles +- `Security: Fix CVE-YYYY-XXXXX (<package>)` (4/19 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (4/19 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-prometheus-alertmanager/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-prometheus-alertmanager/examples.md new file mode 100755 index 00000000..07ff82db --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-prometheus-alertmanager/examples.md @@ -0,0 +1,32 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 5 --> + +## Titles +- `Security: Fix CVE-YYYY-XXXXX (<package>)` (4/26 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` +- `Other CVE title format` (1/26 merged PRs) + - e.g. `[release-2.10] fix: CVE-2023-45288 ensure golang/x/net is 0.23+` + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (4/26 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-prometheus-operator/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-prometheus-operator/examples.md new file mode 100755 index 00000000..f6c23f29 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-prometheus-operator/examples.md @@ -0,0 +1,30 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 4 --> + +## Titles +- `fix(cve): CVE-YYYY-XXXXX - <package>` (4/22 merged PRs) + - e.g. `fix(cve): CVE-2026-33186 - google.golang.org/grpc [release-2.17]` + - e.g. `fix(cve): CVE-2026-33186 - google.golang.org/grpc [release-2.16]` + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (4/22 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-release-2.17-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-release-2.16-attempt-1` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-prometheus/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-prometheus/examples.md new file mode 100755 index 00000000..a1bf7586 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-prometheus/examples.md @@ -0,0 +1,33 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 4 --> + +## Titles +- `Security: Fix CVE-YYYY-XXXXX (<package>)` (4/27 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (4/27 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15-attempt-1` +- `dependabot/<ecosystem>/<pkg>-<version>` (3/27 merged PRs) + - e.g. `dependabot/go_modules/github.com/golang-jwt/jwt/v5-5.2.2` + - e.g. `dependabot/go_modules/golang.org/x/crypto-0.35.0` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-thanos-receive-controller/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-thanos-receive-controller/examples.md new file mode 100755 index 00000000..8dee3c2c --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-thanos-receive-controller/examples.md @@ -0,0 +1,32 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 4 --> + +## Titles +- `Security: Fix CVE-YYYY-XXXXX (<package>)` (4/17 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go) - release-2.15` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go) - release-2.14` + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (4/17 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.15` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14` +- `dependabot/<ecosystem>/<pkg>-<version>` (1/17 merged PRs) + - e.g. `dependabot/go_modules/golang.org/x/crypto-0.35.0` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI diff --git a/workflows/cve-fixer/.cve-fix/stolostron-thanos/examples.md b/workflows/cve-fixer/.cve-fix/stolostron-thanos/examples.md new file mode 100755 index 00000000..b51136f2 --- /dev/null +++ b/workflows/cve-fixer/.cve-fix/stolostron-thanos/examples.md @@ -0,0 +1,38 @@ +<!-- last-analyzed: 2026-04-16 | cve-merged: 6 --> + +## Titles +- `Security: Fix CVE-YYYY-XXXXX (<package>)` (3/35 merged PRs) + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` + - e.g. `Security: Fix CVE-2026-33186 (grpc-go)` +- `Other CVE title format` (2/35 merged PRs) + - e.g. `fix: [release-2.10] CVE-2023-45288 ensure golang/x/net is 0.23+` + - e.g. `CVE-2023-45288 ensure golang/x/net is 0.23+` +- `Bump <pkg> from X to Y to fix CVE-YYYY-XXXXX` (1/35 merged PRs) + - e.g. `Bump google.golang.org/grpc to v1.79.3 to fix CVE-2026-33186` + +## Branches +- `fix/cve-<id>-<pkg>-<branch>-attempt-N` (4/35 merged PRs) + - e.g. `fix/cve-2026-33186-grpc-go-release-2.14-attempt-1` + - e.g. `fix/cve-2026-33186-grpc-go-release-2.17-attempt-1` +- `dependabot/<ecosystem>/<pkg>-<version>` (3/35 merged PRs) + - e.g. `dependabot/go_modules/golang.org/x/crypto-0.35.0` + - e.g. `dependabot/go_modules/github.com/golang-jwt/jwt/v5-5.2.2` + +## Files +- `go.mod` + `go.sum` always change together for Go dependency updates +- `Dockerfile` / `Containerfile.operator` may also be updated (Go version bumps) + +## Co-upgrades +- When bumping a Go dependency, always run `go mod tidy` to update `go.sum` +- Go version bumps (`go.mod` directive) often require updating `Dockerfile` / `Containerfile.operator` + +## PR Description +- Include CVE ID, severity, and affected package in description +- Reference the target branch (e.g. `release-2.16`) when targeting non-default branches +- Include test results section +- For multi-branch fixes, create separate PRs per branch (not a single PR) + +## Don'ts +- ❌ Do not combine multiple CVE fixes in a single PR +- ❌ Do not target the wrong release branch (verify `--base` matches intended branch) +- ❌ Do not skip `go mod tidy` — incomplete `go.sum` updates will fail CI