Skip to content

Fix incomplete stats.json when relaunching a LocalPipelineExecutor job#491

Open
discobot wants to merge 1 commit into
huggingface:mainfrom
discobot:fix/360-local-executor-relaunch-stats
Open

Fix incomplete stats.json when relaunching a LocalPipelineExecutor job#491
discobot wants to merge 1 commit into
huggingface:mainfrom
discobot:fix/360-local-executor-relaunch-stats

Conversation

@discobot

@discobot discobot commented Jun 13, 2026

Copy link
Copy Markdown

Fixes #360.

The incomplete stats.json after a relaunch is specific to
LocalPipelineExecutor (answering the question in the thread): run() sums
the PipelineStats returned in-memory by the tasks it just ran, and ranks
completed in a previous run are filtered out of ranks_to_run (their
_run_for_rank returns an empty PipelineStats), so their contribution is
dropped — even though each completed rank's stats were correctly persisted to
stats/{rank:05d}.json. The ray executor and the merge_stats tool already
merge from those persisted files and are not affected.

This PR makes the local executor do the same: after all tasks finish, build
the merged stats from stats/{rank:05d}.json for every local rank instead of
summing the in-memory return values. The files are guaranteed to exist at
that point, since a rank is only marked completed after its stats file is
written and the merge only runs once all remaining tasks succeeded.

One side effect to be aware of: for fresh runs too, stats.json is now
aggregated per task (e.g. "docs": {"total": 20, "n": 4, ...} instead of
"docs": 20) — the format the ray executor and merge_stats already
produce, so the executors now serialize consistently.

Added a regression test that fails one rank on the first launch, relaunches,
and checks that both the returned stats and stats.json cover all tasks
(fails with AssertionError: 10 != 20 before the fix).


Note

Low Risk
Localized change to post-run stats aggregation in the local executor; behavior aligns with existing Ray merge logic and per-rank persistence.

Overview
Fixes incomplete stats.json when a LocalPipelineExecutor job is relaunched after a partial run (e.g. #360).

run() no longer sums in-memory PipelineStats from tasks executed in the current invocation. It now loads and merges stats/{rank:05d}.json for every local rank, matching the Ray executor and merge_stats. Ranks skipped via skip_completed still contribute because their stats were already written when they first completed.

Fresh runs also get the richer aggregated stat shape in stats.json (e.g. per-metric totals with n), consistent with Ray.

A regression test simulates a mid-run failure and relaunch and asserts returned stats and on-disk stats.json count all tasks’ work.

Reviewed by Cursor Bugbot for commit 396f5db. Bugbot is set up for automated code reviews on this repo. Configure here.

When a partially completed job was relaunched, the local executor only summed the stats returned by the freshly run tasks, so stats.json was missing everything from tasks completed in previous runs. Build the merged stats from the persisted per-rank stats files instead, like the ray executor already does. Also adds a regression test that fails one rank on the first launch and checks the merged stats after the relaunch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong stats files at the end when the job is relaunch

1 participant