Skip to content

[Enhance] Show dataset build progress with cache status#1843

Open
HAOCHENYE wants to merge 1 commit into
mainfrom
log-cache-info
Open

[Enhance] Show dataset build progress with cache status#1843
HAOCHENYE wants to merge 1 commit into
mainfrom
log-cache-info

Conversation

@HAOCHENYE
Copy link
Copy Markdown
Collaborator

Add a tqdm progress bar to build_datasets and expose JsonlDataset.cached
so the bar can report whether each dataset hit the tokenize cache. Keep
each jsonl path bound to its own dataset/tokenize_fn config while
expanding paths, instead of reusing the last loop config across all
paths. Also fix a stale get_rank comparison (get_rank == 0) that left the
token-count debug log permanently disabled.

Add a tqdm progress bar to build_datasets and expose JsonlDataset.cached
so the bar can report whether each dataset hit the tokenize cache. Keep
each jsonl path bound to its own dataset/tokenize_fn config while
expanding paths, instead of reusing the last loop config across all
paths. Also fix a stale get_rank comparison (get_rank == 0) that left the
token-count debug log permanently disabled.
@HAOCHENYE HAOCHENYE requested a review from jayhenry May 26, 2026 21:44
@HAOCHENYE
Copy link
Copy Markdown
Collaborator Author

@claude review

Comment on lines 150 to 151

return datasets
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: The tqdm progress bar is created but never closed. When using manual bar.update(1) instead of iterating over the wrapper, you need an explicit bar.close() after the loop — otherwise the final state may not render correctly (e.g., the cursor stays on the bar line).

Suggested change
return datasets
if bar is not None:
bar.close()
return datasets

@@ -154,7 +165,7 @@ def build_dataloader(
dp_size = dp_mesh.size() if dp_mesh is not None else 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Good catch fixing get_rank == 0get_rank() == 0 here. However, the same bug exists at line 402 in DataloaderConfig.build:

if self.pack_level != "none" and get_rank == 0:

Since this PR already fixes the deprecated copy, it would be good to fix the active one in DataloaderConfig.build as well — that code path is the one users actually hit.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 26, 2026

Claude: Review Summary

Good set of changes — the loop restructuring correctly fixes the multi-dataset config-sharing bug, the progress bar is a helpful UX improvement, and the get_rank == 0get_rank() == 0 fix is a real bug.

Issues found:

  1. Missing bar.close() — the tqdm bar is manually updated but never closed, which can leave the terminal in an odd state.
  2. Incomplete get_rank fix — the same get_rank == 0 bug (comparing the function object instead of calling it) exists at line 402 in DataloaderConfig.build. Since this PR already addresses the deprecated copy, fixing the active code path too would be worthwhile.

Both are minor fixes. LGTM otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant