Fix incomplete stats.json when relaunching a LocalPipelineExecutor job#491
Open
discobot wants to merge 1 commit into
Open
Fix incomplete stats.json when relaunching a LocalPipelineExecutor job#491discobot wants to merge 1 commit into
discobot wants to merge 1 commit into
Conversation
When a partially completed job was relaunched, the local executor only summed the stats returned by the freshly run tasks, so stats.json was missing everything from tasks completed in previous runs. Build the merged stats from the persisted per-rank stats files instead, like the ray executor already does. Also adds a regression test that fails one rank on the first launch and checks the merged stats after the relaunch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #360.
The incomplete
stats.jsonafter a relaunch is specific toLocalPipelineExecutor(answering the question in the thread):run()sumsthe
PipelineStatsreturned in-memory by the tasks it just ran, and rankscompleted in a previous run are filtered out of
ranks_to_run(their_run_for_rankreturns an emptyPipelineStats), so their contribution isdropped — even though each completed rank's stats were correctly persisted to
stats/{rank:05d}.json. The ray executor and themerge_statstool alreadymerge from those persisted files and are not affected.
This PR makes the local executor do the same: after all tasks finish, build
the merged stats from
stats/{rank:05d}.jsonfor every local rank instead ofsumming the in-memory return values. The files are guaranteed to exist at
that point, since a rank is only marked completed after its stats file is
written and the merge only runs once all remaining tasks succeeded.
One side effect to be aware of: for fresh runs too,
stats.jsonis nowaggregated per task (e.g.
"docs": {"total": 20, "n": 4, ...}instead of"docs": 20) — the format the ray executor andmerge_statsalreadyproduce, so the executors now serialize consistently.
Added a regression test that fails one rank on the first launch, relaunches,
and checks that both the returned stats and
stats.jsoncover all tasks(fails with
AssertionError: 10 != 20before the fix).Note
Low Risk
Localized change to post-run stats aggregation in the local executor; behavior aligns with existing Ray merge logic and per-rank persistence.
Overview
Fixes incomplete
stats.jsonwhen aLocalPipelineExecutorjob is relaunched after a partial run (e.g. #360).run()no longer sums in-memoryPipelineStatsfrom tasks executed in the current invocation. It now loads and mergesstats/{rank:05d}.jsonfor every local rank, matching the Ray executor andmerge_stats. Ranks skipped viaskip_completedstill contribute because their stats were already written when they first completed.Fresh runs also get the richer aggregated stat shape in
stats.json(e.g. per-metric totals withn), consistent with Ray.A regression test simulates a mid-run failure and relaunch and asserts returned stats and on-disk
stats.jsoncount all tasks’ work.Reviewed by Cursor Bugbot for commit 396f5db. Bugbot is set up for automated code reviews on this repo. Configure here.