Conversation
Summary of ChangesHello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an issue with asynchronous checkpoint saving within the Megatron framework by centralizing and enhancing the checkpoint saving utility. It improves the flexibility and correctness of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to fix an issue with asynchronous saving in Megatron. The changes primarily involve refactoring the save_mcore_checkpoint function to accept output_dir and async_save parameters, allowing for more flexible checkpointing logic. This is then used in BaseMegatronTrainer to handle merged LoRA checkpoints. However, I've found a critical bug in swift/megatron/trainers/base.py where a variable src_path is used without being defined, which will cause a NameError during execution. I've provided a suggestion to fix this.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request fixes issues with asynchronous checkpoint saving in Megatron. The main change involves updating the latest_checkpointed_iteration.txt tracker file synchronously when a save is scheduled, rather than when it's finalized asynchronously. This prevents losing track of the latest checkpoint if a crash occurs. The PR also introduces a more robust async call management system with a persistent worker option, and refactors the training loop to correctly use these new utilities. Additionally, some redundant code in logging callbacks has been cleaned up. The changes are well-structured and address the async saving problem effectively. I have one minor suggestion to correct a misleading docstring in a new utility function.
| """Check if async calls queue is empty. This result is consistent across ranks. | ||
|
|
||
| Returns: | ||
| bool: True if there is any ongoing async call. |
There was a problem hiding this comment.
The docstring for the return value is incorrect and contradicts the function's implementation and name. The function returns True when there are no ongoing async calls (i.e., the queue is empty), but the docstring states the opposite. This can be misleading for developers using this function.
| bool: True if there is any ongoing async call. | |
| bool: True if there are no ongoing async calls, False otherwise. |
No description provided.