Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
211 commits
Select commit Hold shift + click to select a range
73e87ec
testing
LeonGuertler Aug 14, 2024
ed196a5
testing
LeonGuertler Aug 14, 2024
4b7729e
testing
LeonGuertler Aug 14, 2024
f2f1742
testing
LeonGuertler Aug 14, 2024
65e2359
testing
LeonGuertler Aug 14, 2024
1beac8f
testing
LeonGuertler Aug 14, 2024
22813bd
testing
LeonGuertler Aug 14, 2024
883ce58
testing
LeonGuertler Aug 14, 2024
2986036
testing
LeonGuertler Aug 14, 2024
0402752
testing
LeonGuertler Aug 14, 2024
0523c66
testing
LeonGuertler Aug 14, 2024
e2e7d59
testing
LeonGuertler Aug 14, 2024
cf43624
testing
LeonGuertler Aug 14, 2024
b15af1c
testing
LeonGuertler Aug 14, 2024
1b00837
testing
LeonGuertler Aug 14, 2024
42c2351
testing
LeonGuertler Aug 14, 2024
d00207f
testing
LeonGuertler Aug 14, 2024
8930c2a
fix multi-gpu
LeonGuertler Aug 14, 2024
a047cd1
fix multi-gpu
LeonGuertler Aug 14, 2024
9bdd632
fix multi-gpu
LeonGuertler Aug 14, 2024
cffe7c3
fix multi-gpu
LeonGuertler Aug 14, 2024
f258849
added alternative tokenizers
LeonGuertler Aug 14, 2024
752a42b
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
5816705
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
daa5b73
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
112b207
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
4b76217
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
177bda4
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
e0ebf80
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
97bc582
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
c25bd5f
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
de78424
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
4e8e461
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
a486b5e
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
17fb9e2
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
27bfea7
added byte level accuracy and byte-level perplexity
LeonGuertler Aug 16, 2024
858eb83
fixed byte ppl
LeonGuertler Aug 22, 2024
7f290d9
fixed byte ppl
LeonGuertler Aug 22, 2024
8c1d14c
added faster bpe (maybe buggy)
LeonGuertler Aug 27, 2024
0eac6e5
added subsample tokenizer
LeonGuertler Sep 1, 2024
a18332a
added subsample tokenizer
LeonGuertler Sep 1, 2024
a1c3e43
added subsample tokenizer
LeonGuertler Sep 1, 2024
c326313
added subsample tokenizer
LeonGuertler Sep 1, 2024
e26bcc8
added subsample tokenizer
LeonGuertler Sep 1, 2024
4323949
added subsample tokenizer
LeonGuertler Sep 1, 2024
9334c75
added subsample tokenizer(debugging)
LeonGuertler Sep 1, 2024
761c030
added subsample tokenizer(debugging)
LeonGuertler Sep 1, 2024
7c1152b
added subsample tokenizer(debugging)
LeonGuertler Sep 1, 2024
17b7be4
added subsample tokenizer(debugging)
LeonGuertler Sep 1, 2024
968a3fc
added subsample tokenizer(debugging)
LeonGuertler Sep 1, 2024
84046fe
added subsample tokenizer(debugging)
LeonGuertler Sep 1, 2024
d9c0a3d
added subsample tokenizer(debugging)
LeonGuertler Sep 1, 2024
9c9dbb3
added subsample tokenizer(debugging)
LeonGuertler Sep 1, 2024
082a594
added subsample tokenizer(debugging)
LeonGuertler Sep 1, 2024
3cb8199
should be working now
LeonGuertler Sep 1, 2024
e8eb8da
added wide config
LeonGuertler Sep 1, 2024
2d1cb4f
added new tokenizer type
LeonGuertler Sep 3, 2024
5b7896c
added new tokenizer type (debugging)
LeonGuertler Sep 3, 2024
cbe3c13
added new tokenizer type (debugging)
LeonGuertler Sep 3, 2024
9de54bb
added new tokenizer type (debugging)
LeonGuertler Sep 3, 2024
d3428be
added new tokenizer type (debugging)
LeonGuertler Sep 3, 2024
076faf0
added new tokenizer type (debugging)
LeonGuertler Sep 3, 2024
6368e12
added new tokenizer type (debugging)
LeonGuertler Sep 3, 2024
c97d03b
added c-proj sharing
LeonGuertler Sep 4, 2024
42870ef
added bigger runs
LeonGuertler Sep 6, 2024
1b57985
added bigger runs
LeonGuertler Sep 6, 2024
d9766f9
added bigger runs
LeonGuertler Sep 6, 2024
034f64d
added bigger runs
LeonGuertler Sep 6, 2024
644a473
added bigger runs
LeonGuertler Sep 6, 2024
85369c2
added bigger runs
LeonGuertler Sep 7, 2024
ce05eb1
added bigger runs
LeonGuertler Sep 7, 2024
0538690
debugging
LeonGuertler Sep 7, 2024
06a2912
debugging
LeonGuertler Sep 7, 2024
8b7ceab
debugging
LeonGuertler Sep 7, 2024
b74e21a
debugging
LeonGuertler Sep 7, 2024
4ecedbe
changed LoRA implementation and added MoE LoRA
LeonGuertler Sep 9, 2024
c2dea2b
changed LoRA implementation and added MoE LoRA
LeonGuertler Sep 9, 2024
dc6e0d5
changed LoRA implementation and added MoE LoRA
LeonGuertler Sep 9, 2024
ce858fc
updated
LeonGuertler Sep 9, 2024
564813f
added MoE sharing
LeonGuertler Sep 9, 2024
1c7cb1d
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
84f0afd
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
944fc7f
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
87c5e35
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
8c6b33d
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
e9d9ca8
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
95e14fc
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
98f9f33
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
a70f819
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
900defd
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
0915d91
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
1e0cf19
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
c3d55e7
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
a532e83
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
27c7294
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
093d672
added MoE sharing (debugging)
LeonGuertler Sep 9, 2024
153bad0
debugging
LeonGuertler Sep 9, 2024
e868bce
debugging
LeonGuertler Sep 9, 2024
5bb5388
debugging
LeonGuertler Sep 9, 2024
e401051
speed-up
LeonGuertler Sep 9, 2024
17d3de1
added new datasets
LeonGuertler Sep 10, 2024
f2d83f0
added new datasets
LeonGuertler Sep 10, 2024
3b7cb95
added new datasets
LeonGuertler Sep 10, 2024
2ddb966
added new datasets
LeonGuertler Sep 10, 2024
b23a49c
added new datasets
LeonGuertler Sep 10, 2024
be5d6da
added new datasets
LeonGuertler Sep 10, 2024
9464c62
added new datasets
LeonGuertler Sep 10, 2024
2ae61aa
added new datasets
LeonGuertler Sep 10, 2024
707f2fb
added new datasets
LeonGuertler Sep 10, 2024
fb20461
added new datasets
LeonGuertler Sep 10, 2024
67fc68b
added new datasets
LeonGuertler Sep 10, 2024
0859ff1
added new datasets
LeonGuertler Sep 10, 2024
b4a26d4
added new datasets
LeonGuertler Sep 10, 2024
37631ee
added new datasets
LeonGuertler Sep 10, 2024
a304b24
added new datasets
LeonGuertler Sep 10, 2024
01746a2
added new datasets
LeonGuertler Sep 10, 2024
adc3cb6
added new datasets
LeonGuertler Sep 10, 2024
e7a22be
added new datasets
LeonGuertler Sep 10, 2024
8893966
added new datasets
LeonGuertler Sep 10, 2024
b479b14
added new datasets
LeonGuertler Sep 10, 2024
100b9ff
added new datasets
LeonGuertler Sep 10, 2024
2901ebb
added new datasets
LeonGuertler Sep 10, 2024
5d40f4f
added new datasets
LeonGuertler Sep 10, 2024
8e17a33
added new datasets
LeonGuertler Sep 10, 2024
27ff2a5
updated blend of experts model to global
LeonGuertler Sep 10, 2024
b689980
debugging
LeonGuertler Sep 10, 2024
4f82476
debugging
LeonGuertler Sep 10, 2024
931ed4c
debugging
LeonGuertler Sep 10, 2024
0d43610
debugging
LeonGuertler Sep 10, 2024
2dd48a0
debugging
LeonGuertler Sep 10, 2024
a27d793
debugging
LeonGuertler Sep 10, 2024
dcaca0a
more data
LeonGuertler Sep 10, 2024
c53cb25
new weight init for lora
LeonGuertler Sep 11, 2024
d029528
rewrite
LeonGuertler Sep 12, 2024
45641c2
rewrite (debugging)
LeonGuertler Sep 12, 2024
017a648
rewrite (debugging)
LeonGuertler Sep 12, 2024
c5b495e
rewrite (debugging)
LeonGuertler Sep 12, 2024
2158add
rewrite (debugging)
LeonGuertler Sep 12, 2024
50ff420
debugging
LeonGuertler Sep 13, 2024
a5fc661
debugging
LeonGuertler Sep 13, 2024
3b95d9b
debugging
LeonGuertler Sep 13, 2024
c71e2b7
debugging
LeonGuertler Sep 13, 2024
aeb7ec8
debugging
LeonGuertler Sep 13, 2024
df9521d
debugging
LeonGuertler Sep 13, 2024
1bc63ea
debugging
LeonGuertler Sep 13, 2024
9364bf2
debugging
LeonGuertler Sep 13, 2024
f5fd4f2
debugging
LeonGuertler Sep 13, 2024
c449a1e
debugging
LeonGuertler Sep 13, 2024
c3a197f
debugging
LeonGuertler Sep 13, 2024
6f74f1d
debugging
LeonGuertler Sep 13, 2024
29a5ce9
debugging
LeonGuertler Sep 13, 2024
c8340da
tokenizer -filter to simplify
LeonGuertler Sep 13, 2024
31e1edb
debugging
LeonGuertler Sep 13, 2024
b8a643a
debugging
LeonGuertler Sep 13, 2024
706df1e
debugging
LeonGuertler Sep 13, 2024
9670ae4
debugging
LeonGuertler Sep 13, 2024
11d99ee
debugging
LeonGuertler Sep 13, 2024
a58c3cb
debugging
LeonGuertler Sep 13, 2024
5559f30
debugging
LeonGuertler Sep 13, 2024
f5eece4
debugging
LeonGuertler Sep 13, 2024
a820b5a
debugging
LeonGuertler Sep 13, 2024
19a5638
debugging
LeonGuertler Sep 13, 2024
1f5c8a3
debugging
LeonGuertler Sep 13, 2024
7c8c7bd
debugging
LeonGuertler Sep 13, 2024
f2674a3
training is working
LeonGuertler Sep 13, 2024
d7c4349
added byte-level metrics
LeonGuertler Sep 14, 2024
60f7dd5
updated config
LeonGuertler Sep 14, 2024
a19129e
updated config
LeonGuertler Sep 14, 2024
0ee3ef2
updated config
LeonGuertler Sep 14, 2024
92b8d01
updated config
LeonGuertler Sep 14, 2024
87f6fc3
updated config
LeonGuertler Sep 14, 2024
3d18b74
fixed all evals and WandB
LeonGuertler Sep 14, 2024
cc4143c
Added bytelevel() to pre-tokenizer and decoder of tokenizer in BPE
bobbycxy Sep 15, 2024
27515a0
minor changes
LeonGuertler Sep 16, 2024
fa75fce
minor changes
LeonGuertler Sep 16, 2024
7e42b06
minor changes
LeonGuertler Sep 16, 2024
8fd3e1b
changed baseline setup
LeonGuertler Sep 16, 2024
5bbd766
changed baseline setup
LeonGuertler Sep 16, 2024
9399669
Merge branch 'text-modeling-eval' of https://github.com/LeonGuertler/…
bobbycxy Sep 16, 2024
0e99f91
added max new tokens for standard.yaml
bobbycxy Sep 16, 2024
0499041
minor fixes
LeonGuertler Sep 16, 2024
67ce98d
.
LeonGuertler Sep 16, 2024
1a3585f
added entropy based temperature sampling
bobbycxy Sep 16, 2024
a27c435
revert to rndm idx in dataset
LeonGuertler Sep 16, 2024
c1638e6
fixes
LeonGuertler Sep 16, 2024
8e68d5d
adjust hp
LeonGuertler Sep 16, 2024
b42c737
hp
LeonGuertler Sep 16, 2024
d775812
hp
LeonGuertler Sep 16, 2024
2243502
added baseline configs
LeonGuertler Sep 18, 2024
40e3db2
temporarily removed tokenizers
LeonGuertler Sep 18, 2024
c59749d
adding 10B T for Fineweb-edu
bobbycxy Sep 20, 2024
c324fbe
revise max and warmup iters
bobbycxy Sep 21, 2024
d2c2b3e
revised the max and warmup iters, and switched to 10B dataset.
bobbycxy Sep 23, 2024
d0b0196
.item()
LeonGuertler Oct 10, 2024
0482af2
wip
LeonGuertler Oct 10, 2024
f5ad869
wip
LeonGuertler Oct 10, 2024
0477434
wip
LeonGuertler Oct 10, 2024
74c8e8e
wip
LeonGuertler Oct 10, 2024
8cc578f
wip
LeonGuertler Oct 11, 2024
8a29326
wip
LeonGuertler Oct 11, 2024
8d08948
added adaptive sampler with varentropy. Note the addition of attentio…
bobbycxy Oct 11, 2024
93d3330
updates huggingface code to build but not quite wokring
DylanASHillier Oct 11, 2024
cb7a56e
updated loralinear to fix
DylanASHillier Oct 11, 2024
9a50a96
wip
LeonGuertler Oct 11, 2024
a4721df
wip
LeonGuertler Oct 11, 2024
b061547
wip
LeonGuertler Oct 11, 2024
77fc068
wip
LeonGuertler Oct 11, 2024
f958ffa
partially fixes
DylanASHillier Oct 11, 2024
5555a63
Merge branch 'text-modeling-eval' of https://github.com/LeonGuertler/…
DylanASHillier Oct 11, 2024
33ac888
adds llama config
DylanASHillier Oct 11, 2024
96083ae
data generation strategy
bobbycxy Oct 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -168,3 +168,5 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

*.pt
235 changes: 235 additions & 0 deletions Non-code/documentation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
# Configuration Documentation

This documentation details the required inputs for configuring the model, trainer, and general settings. Note, experimental architectures are not included here, as they can be subject to frequent change

# Model Configuration (`model`)

Conceptually the model is split into three parts, the `core_model`[core_model.py], `embedder` and `lm_head`, where the `embedder` is responsible for converting strings to token-embeddings (i.e. includes both the tokenizer and the token embedder), the `core_model` does most of the 'thinking' (i.e. contains the actual transformer blocks), and the `lm_head` links the final hidden state back to the vocabulary. The model configuration should contain the following.

## Core Model `core_model_type`

The name of the core-model architecture to be used. Depending on the selected option, additional sub-configurations may be required. The following core-models are available:

- **generic (link the code)**: A basic transformer model with multiple attn+ffn blocks. This can be used for almost all classic architectures.
- `num_layers`: The number of transformer blocks in the core model
- `ffn`: Configuration for the Feedforward Network (FFN). [Link to the sub-config]
- `attn`: Configuration for the Attention mechanism. [Link to the sub-config]
- `ffn_weight_tying`: Whether to tie the weights of the FFN layers across the transformer blocks. _[default: false]_
- `c_proj_weight_tying`: Whether to tie the weights of the c*proj linear layer in the attention block across the transformer blocks. *[default: false]\_
- **hf_core (link the code)**: This can be used to load huggingface models
- `freeze_core`: Whether to freeze the core model. _[default: true]_
- `todo`

## Embedding Model `embedding_model_type`

The name of the embedding model to be used. The available options are:

- **generic (link the code)**: A simple embedding model using torch.nn.Embedding to link the tokenizer indecies to learned embeddings.
- `embedding_weight_tying`: Whether to tie the weights of the embedding with those of the language modeling head. _[default: true]_
- `embedding_dropout`: The dropout rate to be used in the embedding layer. _[default: 0.0]_

## Tokenizer `tokenizer_type`

The name of the tokenizer to be used. Depending on the tokenizer selected, additional variables might be necessary. The options include:

- **GPT-2 (link the code)**: The GPT-2 tokenizer via tiktoken.
- **o200k_base (link the code)**: The GPT-40 tokenizer via tiktoken.
- **cl100k_base (link the code)**: The GPT-4 tokenizer via tiktoken.
- **p50k_base (link the code)**: The davinci tokenizer via tiktoken.
- **llama_32k (link the code)**: The old llama tokenizer via huggingface.
- **opt_50k (link the code)**: The OPT tokenizer via huggingface.
- **mistral_32k (link the code)**: The mistral tokenizer via huggingface.
- **bpe (link the code)**: A custom byte-pair encoding tokenizer using the hf library.
- `vocab_size`: The size of the vocabulary
- `tokenizer_dataset_name`: The name of the dataset used to train the tokenizer
- `tokenizer_simplify`: Whether to simplify the tokenizer by removing foreign symbols and forcing digits to be tokenized at an individual level _[default: true]_

## Hidden Dimension Size `hidden_dim`

The dimensionality of the hidden layers in the model.

## Context Window `context_window`

The size of the context window (sequence length).

## Language Model Head `lm_head_type`

The name of the language model head to be used. The available options are:

- **generic (link the code)**: A simple linear layer to map the hidden state back to the vocabulary.
- `lm_head_normalization`: The name of the normalization function to be used. The options are listed here _[TODO: link to normalization]_
- `lm_head_bias`: true/false whether the bias units in the linear layers should be used _[defaul: false]_
- `lm_head_dropout`: The dropout rate to be used in the language modeling head. _[default: 0.0]_

## Model Shell `model_shell_type`

The model shell combines the embedder, core_model and lm_head into a single model. The available options are:

- **standard (link the code)**: A simple model shell that combines the embedder, core_model and lm_head into a single model.

## Positional encoding `positional_encoding_type`

The type of positional encoding to be used. The options include:

- **learned**: A learned positional encoding applied right after token embedding in the embedder.
- **sincos**: A sinusoidal positional encoding applied right after token embedding in the embedder.
- **rope**: A rotary positional encoding applied in the attention blocks.
- **none**: No positional encoding is applied.

## The initialization function `initialization_fn`

The function to be used to initialize the model weights. Options include: _[default: kaiming]_

- **xavier**: Xavier initialization
- `gain`: The gain value for the initialization _[default: 1.0]_
- **kaiming**: Kaiming initialization
- `mode`: The mode of the initialization _[default: fan_in]_
- `nonlinearity`: The nonlinearity function to be used _[default: gelu]_
- **none**: No initialization

# TODO: implement the initialization parameters

# Other sub-configs that might be necessary (as specified above)

### Feed Forward Network `ffn` [feedforward.py]

- **generic (link the code)**: A standard FFN block with two linear layers and an activation in-between.

- `ffn_dim`: The feed-forward dimension (commonly 4x hidden-dim)
- `bias`: true/false whether the bias units in the linear layers should be used _[defaul: false]_
- `ffn_dropout`: The dropout rate to be used at the beginning of the feed-forward network. _[default: 0.0]_
- `activation`: The name of the activation function to be used. Options include: _[defaul: gelu]_
- _gelu_: https://pytorch.org/docs/stable/generated/torch.nn.GELU.html
- _relu_: https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html
- _leakyrelu_: https://pytorch.org/docs/stable/generated/torch.nn.LeakyReLU.html
- _tanh_: https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html
- _sigmoid_: https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html
- _silu_: https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html
- _none_: https://pytorch.org/docs/stable/generated/torch.nn.Identity.html
- `normalization`: The name of the normalization function to be used. The options are listed here _[TODO: link to normalization]_

- **silu_ffn (link the code)**: introduces a gating mechanism by applying the SiLU activation function to one linear transformation of the input, then multiplying it with another linear transformation. This allows for dynamic feature modulation, enhancing the network's expressiveness.
- `ffn_dim`: The feed-forward dimension
- `bias`: true/false whether the bias units in the linear layers should be used _[defaul: false]_
- `normalization`: The name of the normalization function to be used. The options are listed here _[TODO: link to normalization]_

### Attention `attn` [attention.py]

- **causal (link the code)**: A standard MHA block.

- `num_kv_heads`: The number of K and V heads.
- `num_q_heads`: The number of Q heads. If GQA is not intended to be used, this can be skipped and will default to `num_kv_heads`. _[default: num_kv_heads]_
- `bias`: true/false whether the bias units in the linear layers should be used _[defaul: false]_
- `attn_dropout`: The dropout rate to be used at the beginning of the attention block. _[default: 0.0]_
- `normalization`: The name of the normalization function to be used. The options are listed here _[TODO: link to normalization]_

- **bidirectional (link the code)**: A standard MHA block.
- `num_kv_heads`: The number of K and V heads.
- `num_q_heads`: The number of Q heads. If GQA is not intended to be used, this can be skipped and will default to `num_kv_heads`. _[default: num_kv_heads]_
- `bias`: true/false whether the bias units in the linear layers should be used _[defaul: false]_

### Normalization `normalization` [normalization.py]

The name of the normalization to be used. Options include: _[defaul: rms_norm]_

- **rms_norm**: (Root Mean Square Normalization): Normalizes the input based on its root mean square (RMS) to stabilize training by controlling the variance of the activations.
- **layer_norm**: (Layer Normalization): Normalizes the input across features in each layer to ensure consistent activation distributions and improve convergence.
- **none**: No normalization is applied.

(TODO include an example .yaml file and link to it)

# Trainer Configuration (`trainer`)

The trainer configuration contains the settings for training the model. There are a number of main trainers to choose from, each with their own sub-configurations. The main trainers are:

## Base Trainer

The base trainer is a standard trainer that can be used for most standard language model training tasks (like autoregressive pre-training, MLM-based pre-training, fine-tuning, etc.). To use the base trainer, just set `trainer_type` to **base_trainer**. The following are the sub-configurations for the base trainer:

- `dataset`: The name of the dataset to be used for training.
- `batch_size`: The batch size to be used for training.
- `gradient_accumulation_steps`: The number of gradient accumulation steps.
- `max_iters`: The maximum number of iterations.
- `decay_lr`: Whether the learning rate should be decayed during training _[default: true]_
- `lr_decay_iters`: The number of iterations after which the learning rate decays _[default: max_iters]_
- `warmup_iters`: The number of warmup iterations.
- `eval_interval`: The number of steps after which the model goes through the inter-training evaluation. _[default: 2000]_
- `log_interval`: The number of steps after which the model logs the training progress. _[default: 10]_
- `checkpoint_interval`: The number of steps after which the model saves a checkpoint. _[default: 10000]_
- `lr_scheduler`: The learning rate scheduler to be used. Options include:
- **cosine**: Cosine learning rate decay.
- `dataloader`: The dataloader to be used. Options include:
- **standard**: Standard dataloader.
- `datasampling`: The data sampling strategy to be used. Options include:
- **standard**: Standard dataloader.
- `loss_fn`: The loss function to be used. Options include:
- **cross_entropy**: Cross-entropy loss.
- **next_token_mlm**: The masked language modeling next token loss.
- `eval`: The evaluation sub-config _[TODO: link to the sub-config]_
- `optimizer`: The optimizer sub-config _[TODO: link to the sub-config]_

### Evaluation Sub-config (`eval`)

It can be important to estimate model performance during training. To that end, you can specify the evaluation sub-config.

- `mcq_benchmarks`: A list of benchmarks to evaluate on. The available benchmarks are:
- **winogrande**:
- **hellaswag**:
- **arc_easy**:
- **mmlu**:
- **blimp**:
- **truthful_qa**:
- **piqa**:
- **race_middle**:
- **race_high**:
- **boolq**:
- **openbook_qa_closed**:
- **openbook_qa_open**:
- **copa**:
- **commonsense_qa**:
- **stlm_eval_arc_easy**:
- **stlm_eval_hellaswag**:
- **stlm_eval_truthful_qa**:
- **stlm_eval_winogrande**:
- `num_samples`: The number of samples for evaluation. _[default: 1000]_
- `dataset_loss_eval_iters`: The number of iterations after which the dataset loss is evaluated. _[default: 1000]_
- `text_modeling_eval`: true/false whether to evaluate the text modeling capacity. _[default: true]_

### Optimizer Sub-config (`optimizer`)

Each optimizer requires a number of different values to be specified. Here are the optimizer options:

- **AdamW** (`optimizer_name`): The basic AdamW optimizer
- `lr`: The learning rate _[default: 6e-4]_
- `min_lr`: The minimum learning rate _[default: 6e-6]_
- `weight_decay`: The weight decay factor _[default: 0.01]_
- `beta1`: The beta1 value for the Adam optimizer _[default: 0.9]_
- `beta2`: The beta2 value for the Adam optimizer _[default: 0.999]_
- `grad_clip`: The gradient clipping value _[default: 1.0]_

# General Configuration (`general`)

## Logging (`logging`)

Defines logging options.

- **wandb_log**: Boolean flag to enable/disable logging to Weights & Biases. _[default: false]_
- **wandb_project**: Name of the Weights & Biases project. _[default: SuperTinyLanguageModels]_
- **wandb_run_name**: Custom name for the run. Highly encouraged. If not provided, will be as descriptive as possible. _[default: None]_

## Paths (`paths`)

Defines paths for saving outputs, data, checkpoints, and evaluations.

- **output_dir**: Directory for output files.
- **data_dir**: Directory for data files.
- **checkpoint_dir**: Directory for checkpoints.
- **eval_dir**: Directory for evaluation files.

## Seed (`seed`)

Defines the random seed used for reproducibility.

## Device (`device`)

Defines the device for training and evaluation _[default: cuda]_
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# Work in Progress
## TODO
- change the generators to accept custom default values on initialization
- train LLama-3.2 is a value model
- standalone eval script for model (MATH) for easy search and prompting
- add start thought and end thought tokens to LLama
- fix the print_evaluation_results function to be pretty again (maybe return eval results as dicts, and process to wandb string afterwards)

# Super Tiny Language Models


[Model Interfaces](models/README.md) | [Full Configs](configs/full_configs/) | [Intro Paper](https://arxiv.org/abs/2405.14159) | [Discord](https://discord.gg/wwTruDPH)

This GitHub repository presents our research on Super Tiny Language Models (STLMs), aimed at delivering high performance with significantly reduced parameter counts (90-95% smaller) compared to traditional large language models. We explore innovative techniques such as byte-level tokenization with pooling, weight tying, and efficient training strategies. The codebase covers various subproblems, including tokenizer-free models, self-play based training, and alternative training objectives, targeting models with 10M, 50M, and 100M parameters while maintaining competitive performance.
Expand Down
26 changes: 26 additions & 0 deletions check_size.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
"""
Check model parameter count
"""
import hydra
from models.build_models import build_model
from models.utils import print_model_stats

@hydra.main(config_path="configs/train", config_name="baseline-10m")
def main(cfg):
if len(cfg) == 1:
cfg = cfg[list(cfg.keys())[0]]
if "full_configs" in cfg:
cfg = cfg["full_configs"]


model, _ = build_model(model_cfg=cfg["model"])

# print full parameter count
print_model_stats(model)



if __name__ == "__main__":
# pylint: disable=no-value-for-parameter
main()
# pylint: enable=no-value-for-parameter
85 changes: 0 additions & 85 deletions configs/full_configs/baseline.yaml

This file was deleted.

Loading