config.yaml as it stands is a sample of what gets used when you run examples/llama_example.py. This is what has been used in the making of this framework and our research projects otherwise. Please find the configurable options below.
model: The directory containing the model and tokenizer. It assumes that the model and tokenizer files are available in this directory. On the Vector cluster, you can find public models stored under/model-weights.enable_wandb_logging: Whether you would like to use w&b for logging.
The key-value pairs stored under wandb_config are directly passed into the wandb.init method during the execution of the script. These are spell-sensitive. The whole config file is also logged as part of the run.
output_dir: The directory which stores checkpointed states and the final consolidated model.max_seq_len: The maximum sequence length being used during training. This should be less than or equal to the model's maximum possible sequence length.epochs: The number of epochs to train over.seed: The seed number. All devices will be set to this seed before anything.
-
sharding_strategy: The FSDP sharding strategy being used. This should be from one of the options available in PyTorch's FSDP module. For the information below,X = {model parameters, gradients, optimizer states, activations}.- If you're training on a single node and
Xcan fit on a single GPU, then useNO_SHARD. This is essentially just Distributed Data Parallel and you will just be parallelizing data batches across GPUs. - If you're training on a single node and
Xcannot fit on a single GPU, we recommend usingFULL_SHARD. This is regular FSDP andX(except for activations) is evenly sharded across GPUs. Activations are local to each GPU and are unsharded in FSDP. Check the pseudocode here for more information. - If you're training on multiple nodes, we recommend using
HYBRID_SHARD. This is essentiallyFULL_SHARDwithin a node, and replication across nodes (DDP). The benefit of this is that the expensive all-gathers and reduce-scatters from FSDP happen over much faster connectivity within nodes (intranode), and the communication between nodes (internode) is limited to all-reduces for gradients.
- If you're training on a single node and
use_mp: Whether to use mixed precision. This is done using bf16.use_activation_checkpointing: Whether to use activation checkpointing. This greatly reduces memory footprint as only a few intermediate activations as saved during the forward pass, and are then recomputed for the backward pass on the fly. However, the tradeoff between compute vs. memory usually makes this worth it.use_flash_attention: Whether to use Flash Attention. If it is supported for your model in HuggingFace, you can enable this option.low_cpu_mem_usage: Whether to efficiently load the model. If enabled, the model weights are only loaded once on rank 0 and are broadcasted to the rest of the world from the main rank. It will prevent the CPU memory from exploding when loading large models (e.g. LLaMa-70B).
lora_peft_config: Optionally, fine-tune the model using low-rank adaptation via HuggingFace PEFT. Uncomment this section to enable LoRA. All parameters specified under this section are forwarded to peft.LoRAConfig.
max_grad_norm: The maximum gradient norm used for gradient clipping.gradient_accumulation_steps: The number of training steps that we accumulate gradients for in order to achieve a certain global batch size.global_batch_size = n_gpus * micro_batch_size * gradient_accumulation_steps. Note that micro-batch size is synonymous with the batch size used per GPU.
Similar to the wandb config above, these keyword parameters are fed directly into a user-defined optimizer. As such, they are spell-sensitive. See examples/llama_example.py for an example.
lr_scheduler_type: This can either be our custom definedplataeu-with-warmupor an option from what HuggingFace offers as seen here (check the enumerations oftransformers.SchedulerTypefor what can be passed into here). Note thatplataeu-with-warmupis the normal reduce-on-plataeu scheduler, but it starts off with a linear warmup for a given number of training steps.warmup_ratio: The ratio of the total number of training steps that are spent for a linear warmup. This should be between 0 and 1.
checkpointing_enabled: Whether to enable state checkpointing during training.logging_steps: How often evaluation is run using the evaluation dataset.save_frequency: The frequency at which checkpointing occurs. This must be between 0 and 1.
ignore_index: The integer index used to ignore a given token in the loss calculation. Cross-entropy loss by default uses-100.eval_bs: Per GPU evaluation batch size.train_bs: Per GPU training batch size.train_ds: Path to the preprocessed training dataset.eval_ds: Path to the preprocessed evaluation dataset.
ignore_index: The integer index used to ignore a given token in the loss calculation. Cross-entropy loss by default uses-100.dataset_format: Here for forward-compatibility.data_field: The data field that in the dataset that will be used for training.packing_type: Eitherfullorpartial.fullpacking concatenates the whole dataset and then chunks it.partialpacking chunks on each individual datapoint (which can span multiple context lengths). Note: while packing, sometimes there are multiple tokens that should not be broken up. If they are, then the decoded format ends up being prepended with##. There isn't a fix for it yet, but it is something to keep in mind.overlap: When we chunk a data point during packing, we can choose to have some overlap between the current chunk and the next chunk. This might help the model understand surrounding context during training (although this isn't something we have empirically investigated, we keep this option available to users).add_bos_eos_tokens: Whether to addBOSandEOStokens as defined by the respective HuggingFace tokenizer. If using packing, these will be added after packing is done, so that each chunk of sizemax_seq_lenhas these tokens.from_disk: Whether we are going to be loading the dataset to preprocess from disk (the other option is to download straight from HuggingFace).seperator: If using conditional finetuning (i.e. in a given data point, everything beforeseparatorwill not be used for calculating the loss and its labels will beignore_index). Note: ifseparatoris not found in a given sequence, the default behavior is that datapoint will be skipped and not be a part of the final set.load_path: The directory containing the HuggingFace dataset we are loading to preprocess.split: Ifload_pathis a dataset dictionary,splitspecifies which key in this dictionary contains the dataset we are preprocessing.save_path: The directory we will be saving the processed dataset to.truncate: Whether or not to truncate (instead of pack) data points that exceedmax_seq_len.pre_pend: A possible string we can prepend to our data points before tokenizing.