Skip to content

EXP-28 Phase 0: Feasibility — Profile Gemma 4 31B + Adapt Sheared LLaMA #387

@CalebisGross

Description

@CalebisGross

Parent: #386 (Project Bespoke epic)

Goal

Validate that structured pruning of Gemma 4 31B is feasible and prepare all tooling for the MI300X pruning session.

Tasks

Research & Code Review

  • Clone/examine Sheared LLaMA codebase
  • Map Sheared LLaMA's pruning mask code to Gemma 4 31B architecture
  • Identify required adaptations: 5:1 sliding/full attention pattern, KV head sharing, dual RoPE
  • Verify Gemma 4 31B layer structure: which layers are sliding window, which are full attention

Architecture Analysis

  • Load Gemma 4 31B (quantized Q4/Q8 with CPU offload, or on MI300X) and profile:
    • Per-layer activation magnitude on mnemonic encoding tasks
    • Attention entropy per head
    • FFN neuron activation sparsity
    • Gate/residual stream contribution per layer
  • Identify candidate layers for removal (lowest importance)
  • Design target architectures: map 31B (60 layers) → 8B, 4B, 2B, 1.5B shapes

MI300X Preparation

  • Write MI300X setup script (install dependencies: Sheared LLaMA, transformers, Gemma 4)
  • Prepare data transfer: v7 encoding dataset → droplet
  • Estimate training time and cost per phase
  • VRAM budget: verify 31B fits for full fine-tune with gradient checkpointing + Muon optimizer

Sliding/Full Attention Constraint (Carmack flag)

  • Document which of the 60 layers are sliding vs full attention
  • Define pruning constraint: pruned model must preserve at least 1 full-attention layer per N sliding layers for global context
  • Test: what happens if you naively remove all full-attention layers? (quality collapse expected)

VRAM Budget Concern

31B bf16 = 62GB. With Adam: params (62GB) + grads (62GB) + optimizer (124GB) = 248GB > 192GB MI300X.

Options:

  1. Muon optimizer (no momentum for matrices, ~40% less state) → ~180GB, fits tight
  2. Gradient checkpointing + activation offload → reduces activation memory
  3. Mixed precision: bf16 params + fp32 optimizer partitioned → standard DeepSpeed approach
  4. 8-bit Adam (bitsandbytes) → 62 + 62 + 62 = 186GB, fits

Resolve before Phase 1.

Definition of Done

  • Sheared LLaMA adaptation code written and tested on a small model (Gemma 4 E2B as proxy)
  • Gemma 4 31B layer importance profile completed
  • Target architectures defined for each pruning level
  • MI300X scripts ready to run
  • VRAM budget resolved

Metadata

Metadata

Assignees

No one assigned

    Labels

    researchML research experimentstrainingModel training, data, and evaluation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions