Skip to content

Model Training Gets Stuck, GPU Memory Usage High but Utilization at 0% #658

@wjhme

Description

@wjhme

When training an Orion anomaly detection model, the program becomes stuck during the training phase. GPU memory usage is abnormally high while GPU utilization remains at 0%. Specific symptoms are as follows:

Environment Information
Hardware Configuration:

GPU: 2× NVIDIA A40 (42GB memory per card, 84GB total)

System RAM: 232GB

Software Environment:

TensorFlow version: 2.14.1

Orion version: 0.7.1

CUDA version: 11.8

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions