Skip to content

Update DDP example#1364

Merged
soumith merged 4 commits into
pytorch:mainfrom
jafraustro:jafraust/ddp
Jul 14, 2025
Merged

Update DDP example#1364
soumith merged 4 commits into
pytorch:mainfrom
jafraustro:jafraust/ddp

Conversation

@jafraustro
Copy link
Copy Markdown
Contributor

@jafraustro jafraustro commented Jul 8, 2025

Update DDP to use the accelerator API and switch to torchrun for distributed launches

CC: @dvrogozh , @msaroufim

@jafraustro jafraustro marked this pull request as ready for review July 8, 2025 15:06
@netlify
Copy link
Copy Markdown

netlify Bot commented Jul 8, 2025

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit afdd3ce
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-examples-preview/deploys/68712b124833f100080d2c69

Copy link
Copy Markdown
Contributor

@dvrogozh dvrogozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @jafraustro : CC reviewers in PR description.

@soumith
Copy link
Copy Markdown
Contributor

soumith commented Jul 10, 2025

the CI is failing for Distributed examples because something cant find numpy

@jafraustro
Copy link
Copy Markdown
Contributor Author

the CI is failing for Distributed examples because something cant find numpy

Hi, I changed the torch version in requirements.txt file.

× No solution found when resolving dependencies:
╰─▶ Because only torch<=2.7.1 is available and you require torch>=2.8

- Replace deprecated launch utility with torchrun (see PyTorch docs: https://pytorch.org/docs/stable/distributed.html#launch-utility)
- Update README to reflect torchrun usage
- Remove main.py (no longer referenced in documentation)
- Update CI to test example.py script instead

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
@soumith
Copy link
Copy Markdown
Contributor

soumith commented Jul 11, 2025

it's failing now with some new errors

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
@jafraustro
Copy link
Copy Markdown
Contributor Author

it's failing now with some new errors

Hello @soumith,

The errors occurred because there were not enough GPUs available. To address this, I added a minimum GPU verification step, similar to the approach used in the tensor_parallel_example.py example. This ensures the script only runs when the required number of GPUs are present.

@soumith soumith merged commit f84bcb3 into pytorch:main Jul 14, 2025
8 checks passed
@soumith
Copy link
Copy Markdown
Contributor

soumith commented Jul 14, 2025

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants