Skip to content

Llama-3 experiments#6

Merged
m-wojnar merged 29 commits into
mainfrom
llm-experiments
Apr 28, 2026
Merged

Llama-3 experiments#6
m-wojnar merged 29 commits into
mainfrom
llm-experiments

Conversation

@m-wojnar

Copy link
Copy Markdown
Owner

No description provided.

m-wojnar and others added 29 commits April 21, 2026 20:36
- experiments/lm: full torchtitan-based training pipeline (MaxPTrainer,
  maxp_converter, maxp_llama3 scale configs, SLURM launch sweep, debug runner)
- Steps auto-computed as 20× non-embed params via meta-device param count
- Replace torchtitan submodule with torchtitan>=0.2.2 pip dependency;
  remove all sys.path hacks from scripts
- Fix DTensor/Tensor mix in alignment computation (FSDP2 compat)
- Method rename: mp-full→mup-full, mp-noalign→mup-no
- docs/experiments.md: consolidated experiment plan
- pyproject.toml: add torchtitan, trim unused dev deps

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- README: add installation with cu128/cu130/cpu extras, LLM experiments section
- pyproject: cu128/cu130 extras for CUDA index selection, remove torchaudio
- torchtitan pinned to nightly 1a2fef04 (2026-04-21)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Switch fineweb loaders to streaming=True for multi-worker support
- Use fineweb-edu base dataset across all scales for fair comparison
- Make --batch-size global (divided by WORLD_SIZE for local batch size)
- Add --num-workers (default 8) and --prefetch-factor (default 4) args
- Set OMP_NUM_THREADS=16, HF_DATASETS_OFFLINE=1 in SLURM template
- Remove SCALE_DATASETS dict, default dataset to fineweb-edu

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Re-enable async checkpointing via --checkpoint-interval. Resync
optimizer param refs by FQN after no_grad capture forwards so FSDP2
reshard-induced Parameter swaps don't break get_optimizer_state_dict.
Add use_training_activations flag to MaxPConverter config.
Skip runs with existing checkpoints by default; --resume bypasses
the skip to resubmit interrupted training.
Force reshard_after_forward=always to prevent no_grad forwards from
leaving layers unsharded; remove now-unnecessary _capture_and_resync.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@m-wojnar m-wojnar merged commit 1cfd5f4 into main Apr 28, 2026
5 checks passed
@m-wojnar m-wojnar deleted the llm-experiments branch April 28, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant