Update DDP example by jafraustro · Pull Request #14 · dvrogozh/examples

jafraustro · 2025-06-23T19:21:16Z

Update DDP to use the accelerator API

Switch to torchrun for distributed launches
Replace deprecated launch utility with torchrun (see PyTorch docs: https://pytorch.org/docs/stable/distributed.html#launch-utility)
Update README to reflect torchrun usage
Remove main.py (no longer referenced in documentation)
Update CI to test example.py script instead

dvrogozh

For more overall comments:

What is main.py script? It's not documented in readme as it seems.
The test script run_distributed_examples.sh was not adjusted. That's suspicious - does it still work?
Test script is missing testing of example.py

In general, I think that breaking the change into Accelerate API and distributed PRs does not make sense. For distributed these are associated changes. So I would do these in a single PR.

What makes sense for this sample is to break changes into 2:

Switch to torchrun remaining with CUDA
Updating to use new APIs and allow more backends with that

Our issue is that we don't have CUDA environment to fully test this on. So for us combined change is better as we can at least give it a try thru XPU.

Also, did you consider to start with https://github.com/pytorch/examples/tree/main/distributed/tensor_parallelism as it's already switched to torchrun?

dvrogozh · 2025-06-25T00:06:10Z

        print(f"[{os.getpid()}] Initializing process group with: {env_dict}")  
-        dist.init_process_group(backend="nccl")
+        acc = torch.accelerator.current_accelerator()
+        vendor_backend = torch.distributed.get_default_backend_for_device(acc)


By the way, I don't see documentation on torch.distributed.get_default_backend_for_device at pytorch.org. Is it really missing or I just miss it?

No, I think it's really missing. I don't see it here: https://docs.pytorch.org/docs/stable/distributed.html and anywhere in the search on pytorch.org documentation.

Filed:

docs: add get_default_backend_for_device to distributed documentation pytorch/pytorch#156783

Let's see if this will get merged... If not, we might not be able to use this API.

It is still open, but I think it will be eventually merged

jafraustro · 2025-07-07T20:55:00Z

Hi @dvrogozh

I have implemented the recommended changes and submitted the pull request for the tensor_parallelism example #15 .

dvrogozh

Couple minor comments, looks good actually.

Note that you might need to fight thru CI unless pytorch#1354 will get merged first. The pin point was the Python version (distributed tests use 3.8 at the moment and blow up on later pytorch versions). Update to 3.10 if observed.

dvrogozh · 2025-07-07T21:54:39Z

@@ -0,0 +1,10 @@
+# To run sample:


Don't forget to chmod a+x run_example.sh to make it executable.

Suggested change

# To run sample:

#!/bin/bash

#

# To run sample:

dvrogozh · 2025-07-07T21:56:52Z

-    # The main entry point is called directly without using subprocess
-    spmd_main(args.local_world_size, args.local_rank)
+    main()
+


add an empty line/drop tailing spaces

dvrogozh · 2025-07-07T21:58:16Z

-Multiple Data_ or SPMD since the same application runs on all
-application but each one operates on different portions of the
-training dataset.
+## Table of Contents


ToC is arguably needed as the document is not that big + github actually does have a (hidden) ToC for any md document.

- Replace deprecated launch utility with torchrun (see PyTorch docs: https://pytorch.org/docs/stable/distributed.html#launch-utility) - Update README to reflect torchrun usage - Remove main.py (no longer referenced in documentation) - Update CI to test example.py script instead Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>

jafraustro · 2025-07-08T15:05:17Z

I created the PR pytorch#1364

Thanks for your help and feedback

dvrogozh force-pushed the main branch from 35c0da0 to 6f61614 Compare June 24, 2025 18:54

dvrogozh reviewed Jun 24, 2025

View reviewed changes

dvrogozh reviewed Jun 25, 2025

View reviewed changes

jafraustro force-pushed the jafraust/ddp branch from b8abf3d to afe703a Compare July 3, 2025 21:41

dvrogozh reviewed Jul 7, 2025

View reviewed changes

jafraustro added 2 commits July 8, 2025 07:53

Refactor DDP example to use Accelerator API

4557c8c

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>

jafraustro force-pushed the jafraust/ddp branch from afe703a to 4557c8c Compare July 8, 2025 14:55

jafraustro closed this Jul 8, 2025

Conversation

jafraustro commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dvrogozh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dvrogozh Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

dvrogozh Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

dvrogozh Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

jafraustro Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

jafraustro commented Jul 7, 2025

Uh oh!

dvrogozh left a comment

Choose a reason for hiding this comment

Uh oh!

dvrogozh Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

dvrogozh Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

dvrogozh Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

jafraustro commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jafraustro commented Jun 23, 2025 •

edited

Loading