Fix incomplete stats.json when relaunching a LocalPipelineExecutor job by discobot · Pull Request #491 · huggingface/datatrove

discobot · 2026-06-13T13:49:55Z

Fixes #360.

The incomplete stats.json after a relaunch is specific to
LocalPipelineExecutor (answering the question in the thread): run() sums
the PipelineStats returned in-memory by the tasks it just ran, and ranks
completed in a previous run are filtered out of ranks_to_run (their
_run_for_rank returns an empty PipelineStats), so their contribution is
dropped — even though each completed rank's stats were correctly persisted to
stats/{rank:05d}.json. The ray executor and the merge_stats tool already
merge from those persisted files and are not affected.

This PR makes the local executor do the same: after all tasks finish, build
the merged stats from stats/{rank:05d}.json for every local rank instead of
summing the in-memory return values. The files are guaranteed to exist at
that point, since a rank is only marked completed after its stats file is
written and the merge only runs once all remaining tasks succeeded.

One side effect to be aware of: for fresh runs too, stats.json is now
aggregated per task (e.g. "docs": {"total": 20, "n": 4, ...} instead of
"docs": 20) — the format the ray executor and merge_stats already
produce, so the executors now serialize consistently.

Added a regression test that fails one rank on the first launch, relaunches,
and checks that both the returned stats and stats.json cover all tasks
(fails with AssertionError: 10 != 20 before the fix).

Note

Low Risk
Localized change to post-run stats aggregation in the local executor; behavior aligns with existing Ray merge logic and per-rank persistence.

Overview
Fixes incomplete stats.json when a LocalPipelineExecutor job is relaunched after a partial run (e.g. #360).

run() no longer sums in-memory PipelineStats from tasks executed in the current invocation. It now loads and merges stats/{rank:05d}.json for every local rank, matching the Ray executor and merge_stats. Ranks skipped via skip_completed still contribute because their stats were already written when they first completed.

Fresh runs also get the richer aggregated stat shape in stats.json (e.g. per-metric totals with n), consistent with Ray.

A regression test simulates a mid-run failure and relaunch and asserts returned stats and on-disk stats.json count all tasks’ work.

^{Reviewed by Cursor Bugbot for commit 396f5db. Bugbot is set up for automated code reviews on this repo. Configure here.}

When a partially completed job was relaunched, the local executor only summed the stats returned by the freshly run tasks, so stats.json was missing everything from tasks completed in previous runs. Build the merged stats from the persisted per-rank stats files instead, like the ray executor already does. Also adds a regression test that fails one rank on the first launch and checks the merged stats after the relaunch.

discobot mentioned this pull request Jun 13, 2026

Wrong stats files at the end when the job is relaunch #360

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incomplete stats.json when relaunching a LocalPipelineExecutor job#491

Fix incomplete stats.json when relaunching a LocalPipelineExecutor job#491
discobot wants to merge 1 commit into
huggingface:mainfrom
discobot:fix/360-local-executor-relaunch-stats

discobot commented Jun 13, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

discobot commented Jun 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

discobot commented Jun 13, 2026 •

edited by cursor Bot

Loading