IU HPC Agent Skill

A Claude Code skill that turns AI into an expert assistant for Indiana University's Big Red 200 and Quartz HPC systems. Verified against on-cluster data (March 2026).

What It Does

When a user asks anything about IU's HPC systems, Claude reads the relevant skill files and provides accurate, actionable answers with copy-paste commands and templates. No more digging through scattered KB articles.

Example interactions:

"I need to train a 7B parameter model" -- estimates GPU memory, walltime, fair-share cost, and generates a ready-to-submit SBATCH script
"My job failed with OOM" -- walks through the diagnostic flowchart (host vs GPU OOM, exit codes, fixes)
"How do I get started on Big Red 200?" -- provides the full onboarding path: access, SSH, conda, first job
"I have 10,000 files to process" -- recommends job arrays with the right partition, throttling, and thread settings

Architecture

SKILL.md (89 lines)          <-- Always loaded. Routes to the right file.
  |
  +-- 00-quickstart.md        New user onboarding (access, SSH, conda, first job)
  +-- 01-hardware.md          BR200 + Quartz specs, GPU topology, network
  +-- 02-slurm.md             Partitions, billing, QOS, scheduler, backfill
  +-- 03-gpu-optimization.md  DDP, FSDP, mixed precision, checkpointing, NCCL
  +-- 04-cpu-optimization.md  NUMA topology, SMT, BLAS threading, compiler flags
  +-- 05-storage-envs.md      Storage tiers, I/O tuning, conda, modules, containers
  +-- 06-templates.md         10 copy-paste SBATCH templates
  +-- 07-recipes-tips.md      Workflows, data transfer, tips, efficiency checklist
  +-- 08-access-support.md    Allocations, ACCESS, Jetstream2, training, support
  +-- 09-resource-estimator.md GPU memory formulas, walltime, sizing cheat sheet
  +-- 10-troubleshooting.md   Diagnostic flowcharts for every failure mode

Token-efficient by design: Claude loads only the 89-line router by default, then reads 1-2 topic files on demand. A typical query uses ~250 lines of context instead of dumping 2000+.

What's Covered

Category	Highlights
Hardware	BR200 (640 CPU + 66 GPU nodes, A100-40GB, NVLink mesh), Quartz (92 CPU + 22 V100 + 12 H100 nodes)
SLURM	All partitions, billing weights (1 GPU = 16x CPU), QOS limits, scheduler config, backfill strategies
GPU Training	DDP/FSDP decision tree, multi-node launch, BF16, activation checkpointing, gradient accumulation, PyTorch checklist
CPU Optimization	Zen 2 NUMA topology (8 nodes/compute), SMT guidance, BLAS thread control, MKL-on-AMD fix, compiler flags
Storage	5 tiers (Home/Slate/Scratch/SDA/Geode-Project), Lustre tuning, local tmpfs staging, small-file avoidance, Globus endpoints
Templates	GPU training, CPU preprocessing, HP sweep arrays, multi-node distributed, interactive, Jupyter, pipeline submission, memory-intensive
Resource Estimation	GPU memory formulas (16B/param FP32), model size tables (10M to 70B), walltime benchmarks, fair-share cost calculator
Troubleshooting	Exit code decoder, OOM (host + GPU), PENDING diagnosis, NCCL errors, filesystem issues, CUDA mismatches, low GPU utilization
Onboarding	7-step quickstart from zero to running GPU job, common first-day mistakes, Research Desktop (RED) GUI option
Access & Training	RT Projects, ACCESS allocations, Jetstream2, 6+ training courses, Wednesday office hours

Installation

Copy the skill directory into your Claude Code skills path:

# For a single user
cp -r iu-hpc/ ~/.claude/skills/iu-hpc/

# Or clone the repo
git clone https://github.com/AmitSubhash/iu-hpc-agent.git ~/.claude/skills/iu-hpc/

Setup

Add 3 lines to your ~/.bashrc on BR200/Quartz:

export SLURM_ACCOUNT="your_account_id"   # from projects.rt.iu.edu
export PROJ="/N/slate/$USER/your_project"
export ENVS="/N/slate/$USER/envs"

All templates use <account> for SBATCH -A and $USER for paths. Replace as needed.

Verification

All hardware specs, SLURM configuration, and billing weights were verified by SSHing into BR200 compute nodes (March 2026). To re-verify:

ssh br200
sinfo -o "%20P %10a %10l %6D %8c %10m %G"         # partitions
sacctmgr show qos format=Name,Priority,MaxWall      # QOS
scontrol show config | grep -E "Priority|Backfill"   # scheduler

Contributing

Hardware data should be verified on-cluster before updating
Templates must include set -euo pipefail, explicit --mem, module purge, and thread control vars
Keep each file under 300 lines
No hardcoded usernames, account IDs, or personal paths -- use $USER and <account>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IU HPC Agent Skill

What It Does

Architecture

What's Covered

Installation

Setup

Verification

Contributing

Sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
00-quickstart.md		00-quickstart.md
01-hardware.md		01-hardware.md
02-slurm.md		02-slurm.md
03-gpu-optimization.md		03-gpu-optimization.md
04-cpu-optimization.md		04-cpu-optimization.md
05-storage-envs.md		05-storage-envs.md
06-templates.md		06-templates.md
07-recipes-tips.md		07-recipes-tips.md
08-access-support.md		08-access-support.md
09-resource-estimator.md		09-resource-estimator.md
10-troubleshooting.md		10-troubleshooting.md
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md

Folders and files

Latest commit

History

Repository files navigation

IU HPC Agent Skill

What It Does

Architecture

What's Covered

Installation

Setup

Verification

Contributing

Sources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages