Modified aishell/ASR/conformer_ctc/train.py, which implemented multi-machine DDP.#1845
Modified aishell/ASR/conformer_ctc/train.py, which implemented multi-machine DDP.#1845czl66 wants to merge 3 commits into
Conversation
Merged from latest repo.
|
Could you describe how to run it for multi-node multi-GPU training? |
yes, here is the code for main bash file: node_rank=$1
WORLD_SIZE=$2
export CUDA_VISIBLE_DEVICES=$3
echo "WORKER INFO:: node_rank=$node_rank, WORLD_SIZE=$WORLD_SIZE, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
DISTRIBUTED_ARGS="
--nnodes ${WORLD_SIZE:-1} \
--nproc_per_node $gpu_num \
--node_rank ${node_rank:-0} \
--master_addr ${MASTER_ADDR:-127.0.0.1} \
--master_port ${MASTER_PORT:-26669}
"
torchrun $DISTRIBUTED_ARGS ./conformer_ctc/train.py --world-size $gpu_num --max-duration 200 --num-epochs 100. and u should write another script to start the training, including assign the node, the WORLD_SIZE, the gpus. |
|
e.g., u have 4 machines, and each machine has 8-gpus, if one node assigns one gpu, the total nodes is 32, and you should pass $1=0,1,2,3...31, $2=32, $3='0', '1', '2', ... '7' one by one. Besides, if one node assigns 2 gpus, the total nodes is 16, and you should pass $1=0,1,2,3...15, $2=16, $3='0,1', '2,3', '4,5', '6,7' respectively. |
and the single machine version is provided: export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
torchrun --nproc_per_node $gpu_num ./conformer_ctc/train.py --world-size $gpu_num --max-duration 200 --num-epochs 100 |
…tch-way decoding, faster.
|
Also, when I using decode.py for ctc_decoding, I found that the speed is really slow, even it has pasted several hours, the recognizing result is not generated. So I debug, finally found the |
|
There is no need to modify To enable multi-node multi-GPU support, simply modify the train.py file with the following changes: Add |
yeah, you are absolutely right. In addition, I think using barrier() is a must. |
By the way, if you set |
|
I think there is no need for |
In my practice on aishell -conformer_ctc-asr-task, I found that the script only implemented single machine - multi gpus, which is inconvenient for our gpusevrers. So I modified train.py, hope can be helpful for your icefall community. :)
