feat: support offline mode for math training datasets#4
Merged
Conversation
Pre-download every math train/eval dataset once and load it from disk at
training time, so the AstraFlow service never has to reach the
HuggingFace Hub during a run.
- AgentConfig gains a single ``data_root`` knob; when set,
``_create_dataset_from_config`` auto-derives
``offline_dir = {data_root}/{name}`` for every rollout/eval entry that
does not already specify one. Name is the dict key for eval datasets,
or the dataset_fn module basename for the rollout.
- examples/math/offline/download_math_datasets.py orchestrates a
one-shot download via each dataset module's existing
``download_dataset()`` helper. Idempotent; supports --only, --force,
--verify; writes MANIFEST.json with row counts.
- examples/math/offline/qwen3-8b-m2po-full-offline/ is the matching
recipe — same as qwen3-8b-m2po-full but with
``dataflow.data_root: data-data/math``.
- docs/en/recipes/math-offline.md added to the Recipes toctree
immediately after recipes/math, with the download/run workflow.
Smoke-tested end-to-end on 8x H100: all 6 datasets loaded from
data-data/math, eval-at-start ran 4768 items, 3 train steps completed,
weights transferred to RaaS each cycle.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add an offline-data mode for the math recipes: pre-download every train/eval
dataset once, then load from disk at training time so the AstraFlow service
never has to reach the HuggingFace Hub during a run.
This unblocks training on clusters without internet access — datasets used to
be fetched lazily by
datasets.load_dataset(...)on first use.How it works.
AgentConfiggains a singledata_rootknob; when set,_create_dataset_from_configauto-derivesoffline_dir = {data_root}/{name}for every rollout/eval entry that doesn'talready specify one.
nameis the dict key for eval datasets and thedataset_fnmodule basename for the rollout. The per-loaderoffline_dirplumbing already existed; this PR adds the orchestrator + a single recipe-level
switch.
What's new:
astraflow/dataflow/service_config.py—AgentConfig.data_rootfieldastraflow/dataflow/service.py— auto-derivation logic +_module_basenamehelper
examples/math/offline/download_math_datasets.py— one-shot downloaderusing each dataset module's existing
download_dataset()helper.Idempotent; supports
--only,--force,--verify; writesMANIFEST.json.examples/math/offline/qwen3-8b-m2po-full-offline/— matching recipe(same as
qwen3-8b-m2po-fullbut withdataflow.data_root: data-data/math)examples/math/offline/README.md— workflow docdocs/en/recipes/math-offline.md+docs/en/index.rst— new docs pageunder Recipes, immediately after Math
Caveats: model + tokenizer weights are not covered —
model_path/tokenizer_pathstill resolve via HF cache. Documented in the recipe headerand docs page.