-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Hi @g-braeunlich,
Following Soheyl's suggestion, I’m assigning this issue to you.
I am encountering an issue where training runs on Euler are not deterministic, despite using a fixed seed (--seed 1). As shown in the W&B report, two identical runs on the same node yield different loss curves and different generated designs.
#!/bin/bash
#SBATCH --job-name=cgan_cnn_2d_beams2d
#SBATCH --time=00:45:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=7GB
#SBATCH --gpus=rtx_4090:1
#SBATCH --output=engiopt_cgan_cnn_2d_beams2d_%j.out
#SBATCH --error=engiopt_cgan_cnn_2d_beams2d_%j.err
# Mail notifications disabled (SKIP_SLURM_EMAIL)
mkdir -p "$SCRATCH/logs" "$SCRATCH/datasets" "$SCRATCH/models"
module purge
module load stack/2024-06 gcc/12.2.0 python_cuda/3.11.6 cuda/12.8.0 eth_proxy
source ~/venvs/engineer_assistant/bin/activate
export WANDB_API_KEY="...."
export WANDB_ENTITY="gioelemo-ethz"
export WANDB_PROJECT="engiopt"
export HF_HOME="$SCRATCH/models"
export HF_DATASETS_CACHE="$SCRATCH/datasets"
export HF_TOKEN="..."
cd $HOME/EngiOpt
python engiopt/cgan_cnn_2d/cgan_cnn_2d.py --problem-id "beams2d" --track --save-model --n-epochs 100 --seed 1
echo "Training complete!"
Both of the training runned on the same Euler node
The generated design looks also different
Probably not everything is seeded correctly see https://docs.pytorch.org/docs/stable/notes/randomness.html#reproducibility and https://docs.pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch-use-deterministic-algorithms
Could you help me investigate if there are specific settings in the engiopt trainer we should adjust to ensure bit-wise reproducibility?
Thanks!
Gioele