Skip to content

feat: support offline mode for math training datasets#4

Merged
haizhongzheng merged 1 commit into
mainfrom
offline-math
May 25, 2026
Merged

feat: support offline mode for math training datasets#4
haizhongzheng merged 1 commit into
mainfrom
offline-math

Conversation

@haizhongzheng
Copy link
Copy Markdown
Member

Description

Add an offline-data mode for the math recipes: pre-download every train/eval
dataset once, then load from disk at training time so the AstraFlow service
never has to reach the HuggingFace Hub during a run.

This unblocks training on clusters without internet access — datasets used to
be fetched lazily by datasets.load_dataset(...) on first use.

How it works. AgentConfig gains a single data_root knob; when set,
_create_dataset_from_config auto-derives
offline_dir = {data_root}/{name} for every rollout/eval entry that doesn't
already specify one. name is the dict key for eval datasets and the
dataset_fn module basename for the rollout. The per-loader offline_dir
plumbing already existed; this PR adds the orchestrator + a single recipe-level
switch.

What's new:

  • astraflow/dataflow/service_config.pyAgentConfig.data_root field
  • astraflow/dataflow/service.py — auto-derivation logic + _module_basename
    helper
  • examples/math/offline/download_math_datasets.py — one-shot downloader
    using each dataset module's existing download_dataset() helper.
    Idempotent; supports --only, --force, --verify; writes
    MANIFEST.json.
  • examples/math/offline/qwen3-8b-m2po-full-offline/ — matching recipe
    (same as qwen3-8b-m2po-full but with dataflow.data_root: data-data/math)
  • examples/math/offline/README.md — workflow doc
  • docs/en/recipes/math-offline.md + docs/en/index.rst — new docs page
    under Recipes, immediately after Math
    Caveats: model + tokenizer weights are not covered — model_path /
    tokenizer_path still resolve via HF cache. Documented in the recipe header
    and docs page.

Pre-download every math train/eval dataset once and load it from disk at
training time, so the AstraFlow service never has to reach the
HuggingFace Hub during a run.

- AgentConfig gains a single ``data_root`` knob; when set,
  ``_create_dataset_from_config`` auto-derives
  ``offline_dir = {data_root}/{name}`` for every rollout/eval entry that
  does not already specify one. Name is the dict key for eval datasets,
  or the dataset_fn module basename for the rollout.
- examples/math/offline/download_math_datasets.py orchestrates a
  one-shot download via each dataset module's existing
  ``download_dataset()`` helper. Idempotent; supports --only, --force,
  --verify; writes MANIFEST.json with row counts.
- examples/math/offline/qwen3-8b-m2po-full-offline/ is the matching
  recipe — same as qwen3-8b-m2po-full but with
  ``dataflow.data_root: data-data/math``.
- docs/en/recipes/math-offline.md added to the Recipes toctree
  immediately after recipes/math, with the download/run workflow.

Smoke-tested end-to-end on 8x H100: all 6 datasets loaded from
data-data/math, eval-at-start ran 4768 items, 3 train steps completed,
weights transferred to RaaS each cycle.
@haizhongzheng haizhongzheng merged commit 93517ce into main May 25, 2026
1 check failed
@haizhongzheng haizhongzheng deleted the offline-math branch May 25, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant