Llama-3 experiments by m-wojnar · Pull Request #6 · m-wojnar/maxP

m-wojnar · 2026-04-21T18:57:30Z

No description provided.

- experiments/lm: full torchtitan-based training pipeline (MaxPTrainer, maxp_converter, maxp_llama3 scale configs, SLURM launch sweep, debug runner) - Steps auto-computed as 20× non-embed params via meta-device param count - Replace torchtitan submodule with torchtitan>=0.2.2 pip dependency; remove all sys.path hacks from scripts - Fix DTensor/Tensor mix in alignment computation (FSDP2 compat) - Method rename: mp-full→mup-full, mp-noalign→mup-no - docs/experiments.md: consolidated experiment plan - pyproject.toml: add torchtitan, trim unused dev deps Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- README: add installation with cu128/cu130/cpu extras, LLM experiments section - pyproject: cu128/cu130 extras for CUDA index selection, remove torchaudio - torchtitan pinned to nightly 1a2fef04 (2026-04-21) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Switch fineweb loaders to streaming=True for multi-worker support - Use fineweb-edu base dataset across all scales for fair comparison - Make --batch-size global (divided by WORLD_SIZE for local batch size) - Add --num-workers (default 8) and --prefetch-factor (default 4) args - Set OMP_NUM_THREADS=16, HF_DATASETS_OFFLINE=1 in SLURM template - Remove SCALE_DATASETS dict, default dataset to fineweb-edu Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Re-enable async checkpointing via --checkpoint-interval. Resync optimizer param refs by FQN after no_grad capture forwards so FSDP2 reshard-induced Parameter swaps don't break get_optimizer_state_dict. Add use_training_activations flag to MaxPConverter config.

Skip runs with existing checkpoints by default; --resume bypasses the skip to resubmit interrupted training.

Force reshard_after_forward=always to prevent no_grad forwards from leaving layers unsharded; remove now-unnecessary _capture_and_resync. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

m-wojnar and others added 29 commits April 21, 2026 20:36

Fix uv CUDA extras: add conflicts to prevent cu128/cu130 co-resolution

6119ab3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add download_hf_assets script and fix SDPA wrapper

409dbfe

Introduce LlamaParametrizedModule, fix compilation

f82c57a

Fix vocab size in debug run

8bdb167

Fix PL-Grid paths

8c0a1c6

Fix launch script

3bc59f3

Fix launch script

41f817d

Fix launch script

62eb8a1

Fix launch script

97c6e80

Fix launch script

3a3d982

Init parametrization before compile (to enable hooks)

b9deee5

bfloat16 training and fix default port

5707e90

Modify optimizer defaults handling and number of warmup steps

7c1e23b

Fix maxP optimizer param groups

989dbf5

Try to fix torchrun issues

dc6343a

Fix dataset and full activation checkpointing

968c5c2

Fix torch recompile limit

cd4af13

Update training config

c1c252a

Async checkpointing

41c9be9

Add fineweb 100BT dataset

0b86bec

Fix checkpointing

18cefa4

Prefetch dataset

6508da8

Remove checkpointing

bcc54d2

Restore --resume flag in launch_sweep

a926a91

Skip runs with existing checkpoints by default; --resume bypasses the skip to resubmit interrupted training.

Fix FSDP2 stale params via parallelize_llama wrapper

c875060

Force reshard_after_forward=always to prevent no_grad forwards from leaving layers unsharded; remove now-unnecessary _capture_and_resync. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

m-wojnar merged commit 1cfd5f4 into main Apr 28, 2026
5 checks passed

m-wojnar deleted the llm-experiments branch April 28, 2026 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-3 experiments#6

Llama-3 experiments#6
m-wojnar merged 29 commits into
mainfrom
llm-experiments

m-wojnar commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m-wojnar commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant