GDND - GPU Dead Node Detector

GDND is a proactive GPU health monitoring and fault isolation system for Kubernetes clusters. It runs as a DaemonSet on all GPU nodes, detects unhealthy GPUs through multi-level detection, and automatically isolates faulty nodes via Taint/Cordon mechanisms.

Features

Three-tier Detection Pipeline
- L1 Passive Detection (30s): NVML queries, XID error scanning, zombie process detection
- L2 Active Detection (5min): CUDA 128x128 matrix multiplication micro-benchmark
- L3 PCIe Detection (24h, optional): PCIe bandwidth testing
Health State Machine: HEALTHY → SUSPECTED → UNHEALTHY → ISOLATED
Automatic Isolation: Cordon nodes, apply taints, evict pods (configurable)
Prometheus Metrics: Full observability with gdnd_gpu_status, temperature, utilization metrics
Lightweight: Target image size < 50MB, minimal resource footprint (10m CPU, 32Mi memory)
Extensible: Device abstraction layer supports NVIDIA GPUs and Huawei Ascend NPUs

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         GDND DaemonSet                          │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │ L1 Passive  │  │ L2 Active   │  │ L3 PCIe     │  Detectors  │
│  │ (30s)       │  │ (5min)      │  │ (24h)       │             │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘             │
│         │                │                │                     │
│         └────────────────┼────────────────┘                     │
│                          ▼                                      │
│              ┌───────────────────────┐                          │
│              │   Health State Machine │                         │
│              │  HEALTHY → SUSPECTED  │                          │
│              │  → UNHEALTHY → ISOLATED│                         │
│              └───────────┬───────────┘                          │
│                          │                                      │
│         ┌────────────────┼────────────────┐                     │
│         ▼                ▼                ▼                     │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │   Cordon    │  │    Taint    │  │    Alert    │  Actions    │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
└─────────────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

Kubernetes cluster 1.25+
NVIDIA GPU nodes with drivers installed
kubectl configured to access your cluster

Install with Helm (Recommended)

# Install from local chart
helm install gdnd ./release/rust/gdnd/chart \
  --namespace kube-system \
  --set config.dryRun=true  # Start in dry-run mode for safety

# After verifying logs, disable dry-run
helm upgrade gdnd ./release/rust/gdnd/chart \
  --namespace kube-system \
  --set config.dryRun=false

Install with kubectl

cd release/rust/gdnd/deploy

# Apply RBAC
kubectl apply -f rbac.yaml

# Apply ConfigMap
kubectl apply -f configmap.yaml

# Deploy DaemonSet
kubectl apply -f daemonset.yaml

Verify Installation

# Check DaemonSet status
kubectl get daemonset gdnd -n kube-system

# View logs
kubectl logs -l app.kubernetes.io/name=gdnd -n kube-system -f

# Check metrics
kubectl port-forward -n kube-system daemonset/gdnd 9100:9100
curl http://localhost:9100/metrics | grep gdnd_gpu

Configuration

Key Configuration Options

Parameter	Description	Default
`device_type`	Device type: `auto`, `nvidia`, `ascend`	`auto`
`l1_interval`	L1 passive detection interval	`30s`
`l2_interval`	L2 active detection interval	`5m`
`health.failure_threshold`	Consecutive failures before UNHEALTHY	`3`
`health.fatal_xids`	Fatal XID codes for immediate isolation	`[31, 43, 48, 79]`
`health.temperature_threshold`	Temperature threshold (Celsius)	`85`
`isolation.cordon`	Whether to cordon unhealthy nodes	`true`
`isolation.evict_pods`	Whether to evict pods	`false`
`isolation.taint_key`	Taint key	`nvidia.com/gpu-health`
`isolation.taint_effect`	Taint effect	`NoSchedule`
`dry_run`	Log actions without executing	`false`

Example config.yaml

device_type: auto
l1_interval: 30s
l2_interval: 5m

health:
  failure_threshold: 3
  fatal_xids: [31, 43, 48, 79]
  temperature_threshold: 85
  active_check_timeout: 5s

isolation:
  cordon: true
  evict_pods: false
  taint_key: nvidia.com/gpu-health
  taint_value: failed
  taint_effect: NoSchedule

metrics:
  enabled: true
  port: 9100

dry_run: false

Fatal XID Error Codes

These XID errors trigger immediate GPU isolation:

XID	Description
31	GPU memory page fault / MMU fault
43	GPU stopped processing
48	Double Bit ECC Error
79	GPU has fallen off the bus

Prometheus Metrics

Metric	Type	Labels	Description
`gdnd_gpu_status`	Gauge	gpu, uuid, name	Health status (0=healthy, 1=suspected, 2=unhealthy, 3=isolated)
`gdnd_gpu_temperature_celsius`	Gauge	gpu	GPU temperature
`gdnd_gpu_utilization_percent`	Gauge	gpu	GPU utilization
`gdnd_gpu_memory_used_bytes`	Gauge	gpu	GPU memory used
`gdnd_check_duration_seconds`	Histogram	level, gpu	Detection check duration
`gdnd_check_failures_total`	Counter	level, gpu, reason	Total detection failures
`gdnd_isolation_actions_total`	Counter	action	Total isolation actions
`gdnd_gpu_count`	Gauge	-	Number of GPUs detected

Development

Rust 1.75+
CUDA Toolkit 12.2+ (for gpu-check binary)

Build from Source

cd src/rust/gdnd

# Check compilation
cargo check

# Run tests
cargo test

# Build release binary
cargo build --release

# Run locally (dry-run mode)
cargo run -- --config configs/config.yaml --node-name test-node --dry-run

Build Docker Image

cd release/rust/gdnd

# Build release binaries
./build.sh

# Build Docker image
./build.sh --docker

Project Structure

src/rust/gdnd/
├── gdnd/                    # Main binary
│   └── src/
│       ├── main.rs          # Entry point
│       ├── config.rs        # Configuration
│       └── cli.rs           # CLI arguments
├── gdnd-core/               # Core detection logic
│   └── src/
│       ├── device/          # Device abstraction
│       │   ├── interface.rs # DeviceInterface trait
│       │   ├── nvidia.rs    # NVIDIA implementation
│       │   └── mock.rs      # Mock for testing
│       ├── detection/       # Detectors
│       │   ├── l1_passive.rs
│       │   └── l2_active.rs
│       ├── state_machine.rs # Health state machine
│       ├── scheduler.rs     # Detection scheduler
│       └── metrics.rs       # Prometheus metrics
├── gdnd-k8s/                # Kubernetes integration
│   └── src/
│       ├── client.rs        # K8s client
│       └── node_ops.rs      # Node operations
└── gpu-check/               # CUDA micro-benchmark
    └── gpu_check.cu         # 128x128 matrix multiply

release/rust/gdnd/
├── build.sh                 # Build script
├── chart/                   # Helm chart
├── configs/                 # Production configs
└── deploy/                  # K8s manifests

Comparison with Alternatives

Feature	GDND	Node Problem Detector	DIY Scripts
GPU-specific detection	✅ XID, ECC, driver deadlock	❌ Generic	Varies
Active health check	✅ CUDA matrix mul	❌	Varies
Automatic isolation	✅ Cordon + Taint	⚠️ Manual rules	⚠️
Image size	< 50MB	~100MB	Varies
Configuration	Simple YAML	Complex	Custom
Prometheus metrics	✅ Built-in	✅	Manual

Roadmap

Completed (v1.0)

Core Rust implementation with NVIDIA GPU support
L1 Passive Detection (NVML/npu-smi, XID scanning, zombie process detection)
L2 Active Detection (CUDA/AscendCL micro-benchmark)
Health State Machine (HEALTHY → SUSPECTED → UNHEALTHY → ISOLATED)
Kubernetes integration (Cordon/Taint/Evict)
Prometheus metrics
Helm Chart deployment
Huawei Ascend NPU full support (2026-01-21)

In Progress

L3 PCIe bandwidth test (framework ready, 80% complete)
Real hardware integration testing

Planned

ECC error detection enhancement
Grafana dashboard templates
AlertManager integration
Node auto-recovery (GPU reset, ISOLATED → HEALTHY)
Multi-GPU per-device isolation

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Run tests (cargo test)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

nvml-wrapper - Rust bindings for NVIDIA NVML
kube-rs - Kubernetes client for Rust
prometheus-rs - Prometheus client for Rust

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
release/rust/gdnd		release/rust/gdnd
src/rust/gdnd		src/rust/gdnd
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GDND - GPU Dead Node Detector

Features

Architecture

Quick Start

Prerequisites

Install with Helm (Recommended)

Install with kubectl

Verify Installation

Configuration

Key Configuration Options

Example config.yaml

Fatal XID Error Codes

Prometheus Metrics

Development

Build from Source

Build Docker Image

Project Structure

Comparison with Alternatives

Roadmap

Completed (v1.0)

In Progress

Planned

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

iannil/gpu-dead-node-detector

Folders and files

Latest commit

History

Repository files navigation

GDND - GPU Dead Node Detector

Features

Architecture

Quick Start

Prerequisites

Install with Helm (Recommended)

Install with kubectl

Verify Installation

Configuration

Key Configuration Options

Example config.yaml

Fatal XID Error Codes

Prometheus Metrics

Development

Build from Source

Build Docker Image

Project Structure

Comparison with Alternatives

Roadmap

Completed (v1.0)

In Progress

Planned

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages