Skip to content

GDND is a proactive GPU health monitoring and fault isolation system for Kubernetes clusters. It runs as a DaemonSet on all GPU nodes, detects unhealthy GPUs through multi-level detection, and automatically isolates faulty nodes via Taint/Cordon mechanisms.

License

Notifications You must be signed in to change notification settings

iannil/gpu-dead-node-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GDND - GPU Dead Node Detector

License Rust Kubernetes

English | δΈ­ζ–‡

GDND is a proactive GPU health monitoring and fault isolation system for Kubernetes clusters. It runs as a DaemonSet on all GPU nodes, detects unhealthy GPUs through multi-level detection, and automatically isolates faulty nodes via Taint/Cordon mechanisms.

Features

  • Three-tier Detection Pipeline

    • L1 Passive Detection (30s): NVML queries, XID error scanning, zombie process detection
    • L2 Active Detection (5min): CUDA 128x128 matrix multiplication micro-benchmark
    • L3 PCIe Detection (24h, optional): PCIe bandwidth testing
  • Health State Machine: HEALTHY β†’ SUSPECTED β†’ UNHEALTHY β†’ ISOLATED

  • Automatic Isolation: Cordon nodes, apply taints, evict pods (configurable)

  • Prometheus Metrics: Full observability with gdnd_gpu_status, temperature, utilization metrics

  • Lightweight: Target image size < 50MB, minimal resource footprint (10m CPU, 32Mi memory)

  • Extensible: Device abstraction layer supports NVIDIA GPUs and Huawei Ascend NPUs

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         GDND DaemonSet                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚ L1 Passive  β”‚  β”‚ L2 Active   β”‚  β”‚ L3 PCIe     β”‚  Detectors  β”‚
β”‚  β”‚ (30s)       β”‚  β”‚ (5min)      β”‚  β”‚ (24h)       β”‚             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜             β”‚
β”‚         β”‚                β”‚                β”‚                     β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚
β”‚                          β–Ό                                      β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚
β”‚              β”‚   Health State Machine β”‚                         β”‚
β”‚              β”‚  HEALTHY β†’ SUSPECTED  β”‚                          β”‚
β”‚              β”‚  β†’ UNHEALTHY β†’ ISOLATEDβ”‚                         β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β”‚                          β”‚                                      β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β”‚
β”‚         β–Ό                β–Ό                β–Ό                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚   Cordon    β”‚  β”‚    Taint    β”‚  β”‚    Alert    β”‚  Actions    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

Prerequisites

  • Kubernetes cluster 1.25+
  • NVIDIA GPU nodes with drivers installed
  • kubectl configured to access your cluster

Install with Helm (Recommended)

# Install from local chart
helm install gdnd ./release/rust/gdnd/chart \
  --namespace kube-system \
  --set config.dryRun=true  # Start in dry-run mode for safety

# After verifying logs, disable dry-run
helm upgrade gdnd ./release/rust/gdnd/chart \
  --namespace kube-system \
  --set config.dryRun=false

Install with kubectl

cd release/rust/gdnd/deploy

# Apply RBAC
kubectl apply -f rbac.yaml

# Apply ConfigMap
kubectl apply -f configmap.yaml

# Deploy DaemonSet
kubectl apply -f daemonset.yaml

Verify Installation

# Check DaemonSet status
kubectl get daemonset gdnd -n kube-system

# View logs
kubectl logs -l app.kubernetes.io/name=gdnd -n kube-system -f

# Check metrics
kubectl port-forward -n kube-system daemonset/gdnd 9100:9100
curl http://localhost:9100/metrics | grep gdnd_gpu

Configuration

Key Configuration Options

Parameter Description Default
device_type Device type: auto, nvidia, ascend auto
l1_interval L1 passive detection interval 30s
l2_interval L2 active detection interval 5m
health.failure_threshold Consecutive failures before UNHEALTHY 3
health.fatal_xids Fatal XID codes for immediate isolation [31, 43, 48, 79]
health.temperature_threshold Temperature threshold (Celsius) 85
isolation.cordon Whether to cordon unhealthy nodes true
isolation.evict_pods Whether to evict pods false
isolation.taint_key Taint key nvidia.com/gpu-health
isolation.taint_effect Taint effect NoSchedule
dry_run Log actions without executing false

Example config.yaml

device_type: auto
l1_interval: 30s
l2_interval: 5m

health:
  failure_threshold: 3
  fatal_xids: [31, 43, 48, 79]
  temperature_threshold: 85
  active_check_timeout: 5s

isolation:
  cordon: true
  evict_pods: false
  taint_key: nvidia.com/gpu-health
  taint_value: failed
  taint_effect: NoSchedule

metrics:
  enabled: true
  port: 9100

dry_run: false

Fatal XID Error Codes

These XID errors trigger immediate GPU isolation:

XID Description
31 GPU memory page fault / MMU fault
43 GPU stopped processing
48 Double Bit ECC Error
79 GPU has fallen off the bus

Prometheus Metrics

Metric Type Labels Description
gdnd_gpu_status Gauge gpu, uuid, name Health status (0=healthy, 1=suspected, 2=unhealthy, 3=isolated)
gdnd_gpu_temperature_celsius Gauge gpu GPU temperature
gdnd_gpu_utilization_percent Gauge gpu GPU utilization
gdnd_gpu_memory_used_bytes Gauge gpu GPU memory used
gdnd_check_duration_seconds Histogram level, gpu Detection check duration
gdnd_check_failures_total Counter level, gpu, reason Total detection failures
gdnd_isolation_actions_total Counter action Total isolation actions
gdnd_gpu_count Gauge - Number of GPUs detected

Development

  • Rust 1.75+
  • CUDA Toolkit 12.2+ (for gpu-check binary)

Build from Source

cd src/rust/gdnd

# Check compilation
cargo check

# Run tests
cargo test

# Build release binary
cargo build --release

# Run locally (dry-run mode)
cargo run -- --config configs/config.yaml --node-name test-node --dry-run

Build Docker Image

cd release/rust/gdnd

# Build release binaries
./build.sh

# Build Docker image
./build.sh --docker

Project Structure

src/rust/gdnd/
β”œβ”€β”€ gdnd/                    # Main binary
β”‚   └── src/
β”‚       β”œβ”€β”€ main.rs          # Entry point
β”‚       β”œβ”€β”€ config.rs        # Configuration
β”‚       └── cli.rs           # CLI arguments
β”œβ”€β”€ gdnd-core/               # Core detection logic
β”‚   └── src/
β”‚       β”œβ”€β”€ device/          # Device abstraction
β”‚       β”‚   β”œβ”€β”€ interface.rs # DeviceInterface trait
β”‚       β”‚   β”œβ”€β”€ nvidia.rs    # NVIDIA implementation
β”‚       β”‚   └── mock.rs      # Mock for testing
β”‚       β”œβ”€β”€ detection/       # Detectors
β”‚       β”‚   β”œβ”€β”€ l1_passive.rs
β”‚       β”‚   └── l2_active.rs
β”‚       β”œβ”€β”€ state_machine.rs # Health state machine
β”‚       β”œβ”€β”€ scheduler.rs     # Detection scheduler
β”‚       └── metrics.rs       # Prometheus metrics
β”œβ”€β”€ gdnd-k8s/                # Kubernetes integration
β”‚   └── src/
β”‚       β”œβ”€β”€ client.rs        # K8s client
β”‚       └── node_ops.rs      # Node operations
└── gpu-check/               # CUDA micro-benchmark
    └── gpu_check.cu         # 128x128 matrix multiply

release/rust/gdnd/
β”œβ”€β”€ build.sh                 # Build script
β”œβ”€β”€ chart/                   # Helm chart
β”œβ”€β”€ configs/                 # Production configs
└── deploy/                  # K8s manifests

Comparison with Alternatives

Feature GDND Node Problem Detector DIY Scripts
GPU-specific detection βœ… XID, ECC, driver deadlock ❌ Generic Varies
Active health check βœ… CUDA matrix mul ❌ Varies
Automatic isolation βœ… Cordon + Taint ⚠️ Manual rules ⚠️
Image size < 50MB ~100MB Varies
Configuration Simple YAML Complex Custom
Prometheus metrics βœ… Built-in βœ… Manual

Roadmap

Completed (v1.0)

  • Core Rust implementation with NVIDIA GPU support
  • L1 Passive Detection (NVML/npu-smi, XID scanning, zombie process detection)
  • L2 Active Detection (CUDA/AscendCL micro-benchmark)
  • Health State Machine (HEALTHY β†’ SUSPECTED β†’ UNHEALTHY β†’ ISOLATED)
  • Kubernetes integration (Cordon/Taint/Evict)
  • Prometheus metrics
  • Helm Chart deployment
  • Huawei Ascend NPU full support (2026-01-21)

In Progress

  • L3 PCIe bandwidth test (framework ready, 80% complete)
  • Real hardware integration testing

Planned

  • ECC error detection enhancement
  • Grafana dashboard templates
  • AlertManager integration
  • Node auto-recovery (GPU reset, ISOLATED β†’ HEALTHY)
  • Multi-GPU per-device isolation

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Run tests (cargo test)
  4. Commit your changes (git commit -m 'Add amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Open a Pull Request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

About

GDND is a proactive GPU health monitoring and fault isolation system for Kubernetes clusters. It runs as a DaemonSet on all GPU nodes, detects unhealthy GPUs through multi-level detection, and automatically isolates faulty nodes via Taint/Cordon mechanisms.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published