Integration with DCP by LucasLLC · Pull Request #978 · pytorch/PiPPy

LucasLLC · 2024-03-18T22:24:55Z

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Testing out some Checkpointing code .

PR description is WIP

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ x] New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A
Logs for Test A
Test B
Logs for Test B

Checklist:

Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

LucasLLC · 2024-03-18T22:25:44Z

test/test_transformer.py

    return layers


+def pipe_to_sd(pipe):


@wz337 , might be interesting in dist state dict

kwen2501 · 2024-03-19T17:26:04Z

Thanks for making it work!
Quick comment:
Do you mind creating a dedicated example for DCP + PP? You can copy the model out (we plan to build a "model hub" for tests, so that would solve the duplicated code problem).

kwen2501 · 2024-03-19T17:28:39Z

test/test_transformer.py

+
+with tempfile.TemporaryDirectory() as tmpdir:
+    #Simulate saving the pipe
+    # Option 1:


I think Option 1 would be more likely used than Option 2 in realistic setting. Could you please uncomment this block of code?

kwen2501 · 2024-03-19T17:36:46Z

test/test_transformer.py

+    #     print(f"Saving pipeline stage {stage_idx}")
+    #     stage_mod = pipe.get_stage_module(stage_idx)
+    #     dcp.save(
+    #         {f"stage_{stage_idx}": stage_mod},


Curious, is the dict required by API of DCP? Can a user directly save stage_mod?

why does this matter? i think the DCP api had reasons for interfacing with dict instead of model, adding a new variant that takes model and gets its dict should be possible, but i think it's clearer this way that the only part of the model that gets saved is the dict

Just to be clear: I like saving the state dict too (instead of the module). That's more composable to me.
My question above is: is {f"stage_{stage_idx}": stage_mod} necessary?

wconstab · 2024-03-21T20:44:22Z

test/test_transformer.py


+def pipe_to_sd(pipe):
+    sd = {}
+    for stage_idx in range(pipe.num_stages):


something a little fishy about this proposal (equally so for both option 1 and 2) is that it's not likely you'd want to iterate all the stages in the pipe and load/save them.

Example 1: simple pipeline with 4 gpus
rank0: save/load pipe.submod_0 only
...
Example 2: complex pipeline with 4 gpus, 2 stages per gpu
rank0: save/load pipe.submod_0 and pipe.submod_4
rank1: save/load pipe.submod_1 and pipe.submod_5
...

wconstab · 2024-03-21T20:47:45Z

test/test_transformer.py

+    sd = {}
+    for stage_idx in range(pipe.num_stages):
+        stage_mod = pipe.get_stage_module(stage_idx)
+        sd[f"stage_{stage_idx}"] = stage_mod


not really clear to me why we need to add a prefix at all.

orig model ----------- Transformer embedding layers 0 1 split model ----------- submod0 embedding layers 0 submod 1 layers 1

There should be no duplication of fqns between submods/stages.

what are we doing about the 'submod_0' part in the fqn? when we do stage_mod = pipe.get_stage_module(stage_idx) does that return us a module that has top level keys like embedding and layers or a module that has a top level key of submod_n?

If the former, can't we just save/load the keys as usual?

If the latter, we can still save/load without a prefix of stage_{idx} i think, but we'll sadly be uncompatible to load into a non-PP model later on if we want to.

Former. @wconstab

kwen2501 · 2024-03-27T02:08:44Z

What's our plan for this PR? @LucasLLC I think we are pretty close to the destination.
Would the following next steps be reasonable?

Move the example to examples/checkpoint, and name it pippy_dcp.py.
Focus on Option 1 (per-stage saving), and clean up the UI. (See comments)
Make the example runnable in a multi-process setting. Today it saves the stages in a for loop, would be nice if multiple ranks can do their saving job simultaneously.

kwen2501 · 2024-03-27T02:13:02Z

For code quality checks, please run:

./format.sh
./check.sh

testing dcp code

2e745d3

LucasLLC requested a review from kwen2501 March 18, 2024 22:25

facebook-github-bot added the cla signed label Mar 18, 2024

LucasLLC requested review from H-Huang and wz337 March 18, 2024 22:25

LucasLLC commented Mar 18, 2024

View reviewed changes

test/test_transformer.py

return layers

def pipe_to_sd(pipe):

Copy link

Author

LucasLLC Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wz337 , might be interesting in dist state dict

kwen2501 reviewed Mar 19, 2024

View reviewed changes

wconstab reviewed Mar 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with DCP#978

Integration with DCP#978
LucasLLC wants to merge 1 commit intounflattenfrom
dcp_testing

LucasLLC commented Mar 18, 2024

Uh oh!

LucasLLC Mar 18, 2024

Uh oh!

kwen2501 commented Mar 19, 2024

Uh oh!

kwen2501 Mar 19, 2024

Uh oh!

kwen2501 Mar 19, 2024

Uh oh!

wconstab Mar 21, 2024

Uh oh!

kwen2501 Mar 27, 2024

Uh oh!

wconstab Mar 21, 2024

Uh oh!

wconstab Mar 21, 2024 •

edited

Loading

Uh oh!

kwen2501 Mar 27, 2024

Uh oh!

kwen2501 commented Mar 27, 2024

Uh oh!

kwen2501 commented Mar 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

LucasLLC commented Mar 18, 2024

Description

Type of change

Feature/Issue validation/testing

Checklist:

Uh oh!

LucasLLC Mar 18, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Mar 19, 2024

Uh oh!

kwen2501 Mar 19, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Mar 19, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Mar 21, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Mar 21, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Mar 27, 2024

Uh oh!

kwen2501 commented Mar 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wconstab Mar 21, 2024 •

edited

Loading