add flight recorder tutorial#3814
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3814
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 7b1c83a with merge base cdc645a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Claude finished @sekyondaMeta's task in 2m 12s —— View job Review of PR #3814
Blocking Issues1. Wrong file extension — Sphinx will not process this file The file is named source_suffix = ".rst"There is no MyST-Parser or any markdown extension installed (see The file content is written in Markdown (headers with
2. Missing Per the repo's contribution guidelines (
Neither of these was done. There is no reference to 3. Missing thumbnail image New tutorials require a square, high-resolution thumbnail image in Content & Structural Issues4. Relationship with existing flight recorder tutorial There is an existing flight recorder tutorial at
If this is meant to be separate, consider at least cross-linking to the existing tutorial and explaining how the two differ (this one focuses on TorchComms debug server dumps vs. the existing one which focuses on the older env-var-based FR configuration). 5. Local path leaked in Reference section (line 297) This is a local filesystem path from the author's machine and should be removed. 6. Directory placement question The file is placed in 7. Demo script ( The tutorial references
8. Table of Contents section is redundant Lines 16-24 manually list a Table of Contents with anchor links. Sphinx (and RST) auto-generate TOCs via the Minor Issues
SummaryThe tutorial content itself is well-structured and covers a useful debugging workflow. However, this PR has several blocking issues that prevent it from functioning in the build:
These all need to be addressed before this can be merged. |
fc69c0e to
23b4344
Compare
|
@claude review these recent updates to the PR |
|
Claude finished @sekyondaMeta's task in 7m 41s —— View job Review of PR #3814 (Updated)
Issues Resolved Since Last ReviewThe PR has been significantly updated and addresses several of the blocking issues from the previous review:
Remaining IssueLink checker CI failure — The link checker is failing because of Fix: Add Minor Suggestions (non-blocking)
SummaryThe PR is in much better shape compared to the first version. The RST conversion, index entries, cross-reference, and inlined demo script address all the previously-identified blocking issues. The only remaining CI failure is the link checker flagging |
Summary: Add a tutorial for debugging single-rank hangs in distributed PyTorch jobs using the TorchComms Flight Recorder, covering both aggregated text dump analysis and per-rank pickle-based CLI detection workflows.
svekars
left a comment
There was a problem hiding this comment.
LGTM! Thank you for addressing the comments.
|
@svekars how do we merge this pr? |
|
@pytorchbot merge |
|
Mergebot is not configured for this repository. Please use the merge button provided by GitHub. |
Summary:
Add a tutorial for debugging single-rank hangs in distributed PyTorch jobs using the TorchComms Flight Recorder, covering both aggregated text dump analysis and per-rank pickle-based CLI detection workflows.