Update DDP example#1364
Conversation
✅ Deploy Preview for pytorch-examples-preview canceled.
|
dvrogozh
left a comment
There was a problem hiding this comment.
LGTM, @jafraustro : CC reviewers in PR description.
|
the CI is failing for Distributed examples because something cant find numpy |
Hi, I changed the torch version in requirements.txt file. × No solution found when resolving dependencies: |
- Replace deprecated launch utility with torchrun (see PyTorch docs: https://pytorch.org/docs/stable/distributed.html#launch-utility) - Update README to reflect torchrun usage - Remove main.py (no longer referenced in documentation) - Update CI to test example.py script instead Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
|
it's failing now with some new errors |
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Hello @soumith, The errors occurred because there were not enough GPUs available. To address this, I added a minimum GPU verification step, similar to the approach used in the tensor_parallel_example.py example. This ensures the script only runs when the required number of GPUs are present. |
|
thank you! |
Update DDP to use the accelerator API and switch to torchrun for distributed launches
CC: @dvrogozh , @msaroufim