Skip to content

Fix RoPE dimension split issue for balanced head configurations #7

Description

@chrisjmccormick

Currently using 32 RoPE dims + 0 NoPE dims instead of the intended 16/16 split due to HuggingFace utility function constraints. This means we're only using a single key head instead of the intended multi-head setup.

Current Impact:

  • Key-value latent space is only being used for value heads
  • Model uses only 1 key head (instead of multiple) but still has 8 query heads
  • Configuration: 8 query heads, 1 key head, 8 value heads

Tasks:

  • Investigate the HuggingFace RoPE utility function error.
    • Locate the source code (done, see references below).
    • Determine if we can get it working with the right config settings.
  • If not, some options are:
    • Surgical patch to DeepSeekV3 code
    • Replace entire Attention class
    • Create Decoder variant of SubspaceEncoder
  • Test that 16 RoPE / 16 NoPE configuration works correctly
  • Compare performance with current 32/0 setup

References:

  • The rope_type is default, here.
  • The rope init function gets pulled in from here.
  • This is the source of the problem here.
    • It looks like it's designed to support this functionality, but the config settings need to work right.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions