Skip to content

feat: Parquet format and rolling retention for rollout debug saves#24

Merged
nightlessbaron merged 5 commits into
prodfrom
feature/rollout-parquet-retain
May 20, 2026
Merged

feat: Parquet format and rolling retention for rollout debug saves#24
nightlessbaron merged 5 commits into
prodfrom
feature/rollout-parquet-retain

Conversation

@nightlessbaron
Copy link
Copy Markdown

@nightlessbaron nightlessbaron commented May 18, 2026

Adds two new flags to --save-debug-rollout-data:

  • --save-rollout-format {pt,parquet} — write snappy-compressed parquet instead of .pt; rollout_id stored in schema metadata. Default pt keeps existing behaviour.
  • --save-rollout-retain-last-n N — after each training rollout, delete the file for step R-N. O(1), no directory scan. Default 0 keeps all files.

Eval rollouts (eval_*) are excluded from the retention window.

Diagnostics are written by @odp. We might remove them in the future depending on how compute intensive they are.

…aves

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nightlessbaron nightlessbaron requested a review from a team May 18, 2026 01:02
@nightlessbaron nightlessbaron changed the base branch from main to prod May 18, 2026 01:03
Comment thread .github/workflows/docker-build.yml Fixed
Comment thread miles/backends/megatron_utils/actor.py Outdated
@nightlessbaron nightlessbaron changed the title Parquet format and rolling retention for rollout debug saves feat: Parquet format and rolling retention for rollout debug saves May 18, 2026
@nightlessbaron nightlessbaron requested a review from odp May 18, 2026 01:09
Comment thread miles/ray/rollout.py
Comment thread miles/ray/rollout.py Outdated
@odp
Copy link
Copy Markdown

odp commented May 20, 2026

@nightlessbaron in miles/ray/rollout.py

    def _get_rollout_data(self, rollout_id):
        if self.args.load_debug_rollout_data:
            data = torch.load(
                self.args.load_debug_rollout_data.format(rollout_id=rollout_id),
                weights_only=False,
            )["samples"]

the load_dubug_rollout_data path still expects a torch tensor. May be we don't do parquet in this PR just do --save-rollout-retain-last-n which helps save disk space. The parquet conversion can be done outside when we need it.

@nightlessbaron
Copy link
Copy Markdown
Author

made _get_rollout_data dispatch on the .parquet extension so load and save stay consistent without an extra flag. also added the logger.info on delete and switched the single unlink to a backward walk so we drain all stale files when restarting with a smaller N (not just one).

odp
odp previously approved these changes May 20, 2026
Copy link
Copy Markdown

@odp odp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nightlessbaron nightlessbaron merged commit bc75bec into prod May 20, 2026
14 of 17 checks passed
@nightlessbaron nightlessbaron deleted the feature/rollout-parquet-retain branch May 20, 2026 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants