Currently using 32 RoPE dims + 0 NoPE dims instead of the intended 16/16 split due to HuggingFace utility function constraints. This means we're only using a single key head instead of the intended multi-head setup.
Current Impact:
- Key-value latent space is only being used for value heads
- Model uses only 1 key head (instead of multiple) but still has 8 query heads
- Configuration: 8 query heads, 1 key head, 8 value heads
Tasks:
References:
- The rope_type is
default, here.
- The rope init function gets pulled in from here.
- This is the source of the problem here.
- It looks like it's designed to support this functionality, but the config settings need to work right.
Currently using 32 RoPE dims + 0 NoPE dims instead of the intended 16/16 split due to HuggingFace utility function constraints. This means we're only using a single key head instead of the intended multi-head setup.
Current Impact:
Tasks:
References:
default, here.