[Enhance] Show dataset build progress with cache status#1843
Conversation
Add a tqdm progress bar to build_datasets and expose JsonlDataset.cached so the bar can report whether each dataset hit the tokenize cache. Keep each jsonl path bound to its own dataset/tokenize_fn config while expanding paths, instead of reusing the last loop config across all paths. Also fix a stale get_rank comparison (get_rank == 0) that left the token-count debug log permanently disabled.
|
@claude review |
|
|
||
| return datasets |
There was a problem hiding this comment.
Claude: The tqdm progress bar is created but never closed. When using manual bar.update(1) instead of iterating over the wrapper, you need an explicit bar.close() after the loop — otherwise the final state may not render correctly (e.g., the cursor stays on the bar line).
| return datasets | |
| if bar is not None: | |
| bar.close() | |
| return datasets |
| @@ -154,7 +165,7 @@ def build_dataloader( | |||
| dp_size = dp_mesh.size() if dp_mesh is not None else 1 | |||
There was a problem hiding this comment.
Claude: Good catch fixing get_rank == 0 → get_rank() == 0 here. However, the same bug exists at line 402 in DataloaderConfig.build:
if self.pack_level != "none" and get_rank == 0:Since this PR already fixes the deprecated copy, it would be good to fix the active one in DataloaderConfig.build as well — that code path is the one users actually hit.
|
Claude: Review Summary Good set of changes — the loop restructuring correctly fixes the multi-dataset config-sharing bug, the progress bar is a helpful UX improvement, and the Issues found:
Both are minor fixes. LGTM otherwise. |
Add a tqdm progress bar to build_datasets and expose JsonlDataset.cached
so the bar can report whether each dataset hit the tokenize cache. Keep
each jsonl path bound to its own dataset/tokenize_fn config while
expanding paths, instead of reusing the last loop config across all
paths. Also fix a stale get_rank comparison (get_rank == 0) that left the
token-count debug log permanently disabled.