-
Notifications
You must be signed in to change notification settings - Fork 12
fix process staggering logic to smooth memory spikes #451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
64a62ef to
e042757
Compare
kmontemayor2-sc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, for GraphStore (server-client) mode this may still be an issue but also unforuntately need to launch the samplers in lock-step.
Can you add a TODO (assigned to me) around python/gigl/distributed/distributed_neighborloader.py:340 to look into this?
|
Added the TODO |
0127e87 to
83af1b2
Compare
83af1b2 to
77d52f4
Compare
kmontemayor2-sc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, you'll need a second stamp as this is an OSS repo :)
mkolodner-sc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot Zihao!
Scope of work done
The current implementation for staggering processes during data loader initialization, which was intended to smooth out memory spikes, does not have the desired effect.
There are two main reasons:
DistSamplingProducerinitialization).torch.distributed.broadcast_object_listcall synchronizes all processes, effectively negating the staggered sleep.In this PR, we move the sleep logic out of init_neighbor_loader_worker and place it directly before
DistSamplingProducerinitialization, or before the parentDistLoaderclass’s__init__which invokes it. This ensures the stagger occurs at the correct point and is no longer nullified by distributed synchronization.Where is the documentation for this feature?: N/A
Did you add automated tests or write a test plan?
Updated Changelog.md? NO
Ready for code review?: NO