Skip to content

Update DDP example#14

Closed
jafraustro wants to merge 2 commits into
dvrogozh:mainfrom
jafraustro:jafraust/ddp
Closed

Update DDP example#14
jafraustro wants to merge 2 commits into
dvrogozh:mainfrom
jafraustro:jafraust/ddp

Conversation

@jafraustro
Copy link
Copy Markdown

@jafraustro jafraustro commented Jun 23, 2025

Update DDP to use the accelerator API

  • Switch to torchrun for distributed launches
  • Replace deprecated launch utility with torchrun (see PyTorch docs: https://pytorch.org/docs/stable/distributed.html#launch-utility)
  • Update README to reflect torchrun usage
  • Remove main.py (no longer referenced in documentation)
  • Update CI to test example.py script instead

Copy link
Copy Markdown
Owner

@dvrogozh dvrogozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more overall comments:

  1. What is main.py script? It's not documented in readme as it seems.
  2. The test script run_distributed_examples.sh was not adjusted. That's suspicious - does it still work?
  3. Test script is missing testing of example.py

In general, I think that breaking the change into Accelerate API and distributed PRs does not make sense. For distributed these are associated changes. So I would do these in a single PR.

What makes sense for this sample is to break changes into 2:

  1. Switch to torchrun remaining with CUDA
  2. Updating to use new APIs and allow more backends with that

Our issue is that we don't have CUDA environment to fully test this on. So for us combined change is better as we can at least give it a try thru XPU.

Also, did you consider to start with https://github.com/pytorch/examples/tree/main/distributed/tensor_parallelism as it's already switched to torchrun?

Comment thread distributed/ddp/README.md Outdated
Comment thread distributed/ddp/README.md Outdated
Comment thread distributed/ddp/README.md Outdated
Comment thread distributed/ddp/README.md Outdated
Comment thread distributed/ddp/README.md Outdated
Comment thread distributed/ddp/README.md Outdated
Comment thread distributed/ddp/README.md Outdated
Comment thread distributed/ddp/example.py Outdated
print(f"[{os.getpid()}] Initializing process group with: {env_dict}")
dist.init_process_group(backend="nccl")
acc = torch.accelerator.current_accelerator()
vendor_backend = torch.distributed.get_default_backend_for_device(acc)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I don't see documentation on torch.distributed.get_default_backend_for_device at pytorch.org. Is it really missing or I just miss it?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think it's really missing. I don't see it here: https://docs.pytorch.org/docs/stable/distributed.html and anywhere in the search on pytorch.org documentation.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed:

Let's see if this will get merged... If not, we might not be able to use this API.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still open, but I think it will be eventually merged

@jafraustro
Copy link
Copy Markdown
Author

Hi @dvrogozh

I have implemented the recommended changes and submitted the pull request for the tensor_parallelism example #15 .

Copy link
Copy Markdown
Owner

@dvrogozh dvrogozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor comments, looks good actually.

Note that you might need to fight thru CI unless pytorch#1354 will get merged first. The pin point was the Python version (distributed tests use 3.8 at the moment and blow up on later pytorch versions). Update to 3.10 if observed.

Comment thread distributed/ddp/run_example.sh Outdated
@@ -0,0 +1,10 @@
# To run sample:
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to chmod a+x run_example.sh to make it executable.

Suggested change
# To run sample:
#!/bin/bash
#
# To run sample:

Comment thread distributed/ddp/example.py Outdated
# The main entry point is called directly without using subprocess
spmd_main(args.local_world_size, args.local_rank)
main()

No newline at end of file
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an empty line/drop tailing spaces

Comment thread distributed/ddp/README.md Outdated
Multiple Data_ or SPMD since the same application runs on all
application but each one operates on different portions of the
training dataset.
## Table of Contents
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToC is arguably needed as the document is not that big + github actually does have a (hidden) ToC for any md document.

- Replace deprecated launch utility with torchrun (see PyTorch docs: https://pytorch.org/docs/stable/distributed.html#launch-utility)
- Update README to reflect torchrun usage
- Remove main.py (no longer referenced in documentation)
- Update CI to test example.py script instead

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
@jafraustro
Copy link
Copy Markdown
Author

I created the PR pytorch#1364

Thanks for your help and feedback

@jafraustro jafraustro closed this Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants