GDND is a proactive GPU health monitoring and fault isolation system for Kubernetes clusters. It runs as a DaemonSet on all GPU nodes, detects unhealthy GPUs through multi-level detection, and automatically isolates faulty nodes via Taint/Cordon mechanisms.
-
Three-tier Detection Pipeline
- L1 Passive Detection (30s): NVML queries, XID error scanning, zombie process detection
- L2 Active Detection (5min): CUDA 128x128 matrix multiplication micro-benchmark
- L3 PCIe Detection (24h, optional): PCIe bandwidth testing
-
Health State Machine:
HEALTHYβSUSPECTEDβUNHEALTHYβISOLATED -
Automatic Isolation: Cordon nodes, apply taints, evict pods (configurable)
-
Prometheus Metrics: Full observability with
gdnd_gpu_status, temperature, utilization metrics -
Lightweight: Target image size < 50MB, minimal resource footprint (10m CPU, 32Mi memory)
-
Extensible: Device abstraction layer supports NVIDIA GPUs and Huawei Ascend NPUs
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GDND DaemonSet β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β L1 Passive β β L2 Active β β L3 PCIe β Detectors β
β β (30s) β β (5min) β β (24h) β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β
β ββββββββββββββββββΌβββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββββ β
β β Health State Machine β β
β β HEALTHY β SUSPECTED β β
β β β UNHEALTHY β ISOLATEDβ β
β βββββββββββββ¬ββββββββββββ β
β β β
β ββββββββββββββββββΌβββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Cordon β β Taint β β Alert β Actions β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Kubernetes cluster 1.25+
- NVIDIA GPU nodes with drivers installed
kubectlconfigured to access your cluster
# Install from local chart
helm install gdnd ./release/rust/gdnd/chart \
--namespace kube-system \
--set config.dryRun=true # Start in dry-run mode for safety
# After verifying logs, disable dry-run
helm upgrade gdnd ./release/rust/gdnd/chart \
--namespace kube-system \
--set config.dryRun=falsecd release/rust/gdnd/deploy
# Apply RBAC
kubectl apply -f rbac.yaml
# Apply ConfigMap
kubectl apply -f configmap.yaml
# Deploy DaemonSet
kubectl apply -f daemonset.yaml# Check DaemonSet status
kubectl get daemonset gdnd -n kube-system
# View logs
kubectl logs -l app.kubernetes.io/name=gdnd -n kube-system -f
# Check metrics
kubectl port-forward -n kube-system daemonset/gdnd 9100:9100
curl http://localhost:9100/metrics | grep gdnd_gpu| Parameter | Description | Default |
|---|---|---|
device_type |
Device type: auto, nvidia, ascend |
auto |
l1_interval |
L1 passive detection interval | 30s |
l2_interval |
L2 active detection interval | 5m |
health.failure_threshold |
Consecutive failures before UNHEALTHY | 3 |
health.fatal_xids |
Fatal XID codes for immediate isolation | [31, 43, 48, 79] |
health.temperature_threshold |
Temperature threshold (Celsius) | 85 |
isolation.cordon |
Whether to cordon unhealthy nodes | true |
isolation.evict_pods |
Whether to evict pods | false |
isolation.taint_key |
Taint key | nvidia.com/gpu-health |
isolation.taint_effect |
Taint effect | NoSchedule |
dry_run |
Log actions without executing | false |
device_type: auto
l1_interval: 30s
l2_interval: 5m
health:
failure_threshold: 3
fatal_xids: [31, 43, 48, 79]
temperature_threshold: 85
active_check_timeout: 5s
isolation:
cordon: true
evict_pods: false
taint_key: nvidia.com/gpu-health
taint_value: failed
taint_effect: NoSchedule
metrics:
enabled: true
port: 9100
dry_run: falseThese XID errors trigger immediate GPU isolation:
| XID | Description |
|---|---|
| 31 | GPU memory page fault / MMU fault |
| 43 | GPU stopped processing |
| 48 | Double Bit ECC Error |
| 79 | GPU has fallen off the bus |
| Metric | Type | Labels | Description |
|---|---|---|---|
gdnd_gpu_status |
Gauge | gpu, uuid, name | Health status (0=healthy, 1=suspected, 2=unhealthy, 3=isolated) |
gdnd_gpu_temperature_celsius |
Gauge | gpu | GPU temperature |
gdnd_gpu_utilization_percent |
Gauge | gpu | GPU utilization |
gdnd_gpu_memory_used_bytes |
Gauge | gpu | GPU memory used |
gdnd_check_duration_seconds |
Histogram | level, gpu | Detection check duration |
gdnd_check_failures_total |
Counter | level, gpu, reason | Total detection failures |
gdnd_isolation_actions_total |
Counter | action | Total isolation actions |
gdnd_gpu_count |
Gauge | - | Number of GPUs detected |
- Rust 1.75+
- CUDA Toolkit 12.2+ (for gpu-check binary)
cd src/rust/gdnd
# Check compilation
cargo check
# Run tests
cargo test
# Build release binary
cargo build --release
# Run locally (dry-run mode)
cargo run -- --config configs/config.yaml --node-name test-node --dry-runcd release/rust/gdnd
# Build release binaries
./build.sh
# Build Docker image
./build.sh --dockersrc/rust/gdnd/
βββ gdnd/ # Main binary
β βββ src/
β βββ main.rs # Entry point
β βββ config.rs # Configuration
β βββ cli.rs # CLI arguments
βββ gdnd-core/ # Core detection logic
β βββ src/
β βββ device/ # Device abstraction
β β βββ interface.rs # DeviceInterface trait
β β βββ nvidia.rs # NVIDIA implementation
β β βββ mock.rs # Mock for testing
β βββ detection/ # Detectors
β β βββ l1_passive.rs
β β βββ l2_active.rs
β βββ state_machine.rs # Health state machine
β βββ scheduler.rs # Detection scheduler
β βββ metrics.rs # Prometheus metrics
βββ gdnd-k8s/ # Kubernetes integration
β βββ src/
β βββ client.rs # K8s client
β βββ node_ops.rs # Node operations
βββ gpu-check/ # CUDA micro-benchmark
βββ gpu_check.cu # 128x128 matrix multiply
release/rust/gdnd/
βββ build.sh # Build script
βββ chart/ # Helm chart
βββ configs/ # Production configs
βββ deploy/ # K8s manifests
| Feature | GDND | Node Problem Detector | DIY Scripts |
|---|---|---|---|
| GPU-specific detection | β XID, ECC, driver deadlock | β Generic | Varies |
| Active health check | β CUDA matrix mul | β | Varies |
| Automatic isolation | β Cordon + Taint | ||
| Image size | < 50MB | ~100MB | Varies |
| Configuration | Simple YAML | Complex | Custom |
| Prometheus metrics | β Built-in | β | Manual |
- Core Rust implementation with NVIDIA GPU support
- L1 Passive Detection (NVML/npu-smi, XID scanning, zombie process detection)
- L2 Active Detection (CUDA/AscendCL micro-benchmark)
- Health State Machine (HEALTHY β SUSPECTED β UNHEALTHY β ISOLATED)
- Kubernetes integration (Cordon/Taint/Evict)
- Prometheus metrics
- Helm Chart deployment
- Huawei Ascend NPU full support (2026-01-21)
- L3 PCIe bandwidth test (framework ready, 80% complete)
- Real hardware integration testing
- ECC error detection enhancement
- Grafana dashboard templates
- AlertManager integration
- Node auto-recovery (GPU reset, ISOLATED β HEALTHY)
- Multi-GPU per-device isolation
Contributions are welcome! Please feel free to submit issues and pull requests.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Run tests (
cargo test) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- nvml-wrapper - Rust bindings for NVIDIA NVML
- kube-rs - Kubernetes client for Rust
- prometheus-rs - Prometheus client for Rust