-
Notifications
You must be signed in to change notification settings - Fork 138
Open
Description
1. Environment Info
- OS: [ Ubuntu 22.04.4 LTS]
- SimAI Commit: [e5d1251]
- Python Version: [Python 3.11.12]
2.Steps to Reproduce
Step 1: Generate Workload
- Working Directory:
/root/simai/SimAI/aicb/ - Command Executed:
uv run -m workload_generator.SimAI_training_workload_generator \ --frame=Megatron \ --world_size=1472 \ --tensor_model_parallel_size=8 \ --pipeline_model_parallel=8 \ --global_batch=32 \ --micro_batch=1 \ --epoch_num=1 \ --model_name=gpt_13B \ --hidden_size=5120 \ --num_layers=40 \ --seq_length=4096 \ --num_attention_heads=40 \ --vocab_size=50257 \ --max_position_embeddings=4096 \ --ffn_hidden_size=11008 \ --dtype=bfloat16 \ --enable_sequence_parallel \ --swiglu \ --make_vocab_size_divisible_by=128 \ --workload_only \ --output_filename=two_phase_opt_phase1/ws1472-tp8-pp8 \ --aiob_enable \ --comp_filepath=/root/simai/SimAI/aicb/workload/aiob_inputs/A100_A800.txt
- Generated Workload: ws1472-tp8-pp8.txt
Step 2: Run Analytical Simulation
- Working Directory:
/root/simai/SimAI/ - Command Executed:
/root/simai/SimAI/bin/SimAI_analytical \ -w /root/simai/SimAI/aicb/results/workload/two_phase_opt_phase1/ws1472-tp8-pp8.txt \ -g 1472 \ -g_p_s 16 \ -n_p_s 5 \ -r two_phase_opt_phase1/ws1472-tp8-pp8 \ -g_type A800 \ -nic 37.47811737589823 \ -dp_o 0.5 \ -tp_o 0.7 \ -ep_o 0.8 \ -pp_o 0.5
- Result File: ws1472-tp8-pp8EndToEnd.csv
3. The Issue
In the generated EndToEnd.csv file, the Expose DP comm time is exceptionally large,This occurs consistently in many experiments
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels