Update DDP example#14
Conversation
dvrogozh
left a comment
There was a problem hiding this comment.
For more overall comments:
- What is
main.pyscript? It's not documented in readme as it seems. - The test script run_distributed_examples.sh was not adjusted. That's suspicious - does it still work?
- Test script is missing testing of
example.py
In general, I think that breaking the change into Accelerate API and distributed PRs does not make sense. For distributed these are associated changes. So I would do these in a single PR.
What makes sense for this sample is to break changes into 2:
- Switch to
torchrunremaining with CUDA - Updating to use new APIs and allow more backends with that
Our issue is that we don't have CUDA environment to fully test this on. So for us combined change is better as we can at least give it a try thru XPU.
Also, did you consider to start with https://github.com/pytorch/examples/tree/main/distributed/tensor_parallelism as it's already switched to torchrun?
| print(f"[{os.getpid()}] Initializing process group with: {env_dict}") | ||
| dist.init_process_group(backend="nccl") | ||
| acc = torch.accelerator.current_accelerator() | ||
| vendor_backend = torch.distributed.get_default_backend_for_device(acc) |
There was a problem hiding this comment.
By the way, I don't see documentation on torch.distributed.get_default_backend_for_device at pytorch.org. Is it really missing or I just miss it?
There was a problem hiding this comment.
No, I think it's really missing. I don't see it here: https://docs.pytorch.org/docs/stable/distributed.html and anywhere in the search on pytorch.org documentation.
There was a problem hiding this comment.
Filed:
Let's see if this will get merged... If not, we might not be able to use this API.
There was a problem hiding this comment.
It is still open, but I think it will be eventually merged
dvrogozh
left a comment
There was a problem hiding this comment.
Couple minor comments, looks good actually.
Note that you might need to fight thru CI unless pytorch#1354 will get merged first. The pin point was the Python version (distributed tests use 3.8 at the moment and blow up on later pytorch versions). Update to 3.10 if observed.
| @@ -0,0 +1,10 @@ | |||
| # To run sample: | |||
There was a problem hiding this comment.
Don't forget to chmod a+x run_example.sh to make it executable.
| # To run sample: | |
| #!/bin/bash | |
| # | |
| # To run sample: |
| # The main entry point is called directly without using subprocess | ||
| spmd_main(args.local_world_size, args.local_rank) | ||
| main() | ||
|
No newline at end of file |
There was a problem hiding this comment.
add an empty line/drop tailing spaces
| Multiple Data_ or SPMD since the same application runs on all | ||
| application but each one operates on different portions of the | ||
| training dataset. | ||
| ## Table of Contents |
There was a problem hiding this comment.
ToC is arguably needed as the document is not that big + github actually does have a (hidden) ToC for any md document.
- Replace deprecated launch utility with torchrun (see PyTorch docs: https://pytorch.org/docs/stable/distributed.html#launch-utility) - Update README to reflect torchrun usage - Remove main.py (no longer referenced in documentation) - Update CI to test example.py script instead Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
|
I created the PR pytorch#1364 Thanks for your help and feedback |
Update DDP to use the accelerator API