Skip to content

Conversation

@zfan3-sc
Copy link
Collaborator

Scope of work done
The current implementation for staggering processes during data loader initialization, which was intended to smooth out memory spikes, does not have the desired effect.

There are two main reasons:

  1. The sleep was not placed immediately before the source of the memory spike (DistSamplingProducer initialization).
  2. A torch.distributed.broadcast_object_list call synchronizes all processes, effectively negating the staggered sleep.

In this PR, we move the sleep logic out of init_neighbor_loader_worker and place it directly before DistSamplingProducer initialization, or before the parent DistLoader class’s __init__ which invokes it. This ensures the stagger occurs at the correct point and is no longer nullified by distributed synchronization.

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

Copy link
Collaborator

@kmontemayor2-sc kmontemayor2-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, for GraphStore (server-client) mode this may still be an issue but also unforuntately need to launch the samplers in lock-step.

Can you add a TODO (assigned to me) around python/gigl/distributed/distributed_neighborloader.py:340 to look into this?

@zfan3-sc
Copy link
Collaborator Author

kmontemayor2-sc

Added the TODO

Copy link
Collaborator

@kmontemayor2-sc kmontemayor2-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, you'll need a second stamp as this is an OSS repo :)

Copy link
Collaborator

@mkolodner-sc mkolodner-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Zihao!

@zfan3-sc zfan3-sc enabled auto-merge January 21, 2026 20:21
@zfan3-sc zfan3-sc added this pull request to the merge queue Jan 21, 2026
@zfan3-sc zfan3-sc removed this pull request from the merge queue due to a manual request Jan 21, 2026
@zfan3-sc zfan3-sc added this pull request to the merge queue Jan 21, 2026
Merged via the queue into main with commit 660ecbe Jan 22, 2026
6 checks passed
@zfan3-sc zfan3-sc deleted the zfan3/stagger branch January 22, 2026 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants