EXP-28 Phase 0: Feasibility — Profile Gemma 4 31B + Adapt Sheared LLaMA

Parent: #386 (Project Bespoke epic)

## Goal

Validate that structured pruning of Gemma 4 31B is feasible and prepare all tooling for the MI300X pruning session.

## Tasks

### Research & Code Review
- [ ] Clone/examine [Sheared LLaMA](https://github.com/princeton-nlp/LLM-Shearing) codebase
- [ ] Map Sheared LLaMA's pruning mask code to Gemma 4 31B architecture
- [ ] Identify required adaptations: 5:1 sliding/full attention pattern, KV head sharing, dual RoPE
- [ ] Verify Gemma 4 31B layer structure: which layers are sliding window, which are full attention

### Architecture Analysis
- [ ] Load Gemma 4 31B (quantized Q4/Q8 with CPU offload, or on MI300X) and profile:
  - Per-layer activation magnitude on mnemonic encoding tasks
  - Attention entropy per head
  - FFN neuron activation sparsity
  - Gate/residual stream contribution per layer
- [ ] Identify candidate layers for removal (lowest importance)
- [ ] Design target architectures: map 31B (60 layers) → 8B, 4B, 2B, 1.5B shapes

### MI300X Preparation
- [ ] Write MI300X setup script (install dependencies: Sheared LLaMA, transformers, Gemma 4)
- [ ] Prepare data transfer: v7 encoding dataset → droplet
- [ ] Estimate training time and cost per phase
- [ ] VRAM budget: verify 31B fits for full fine-tune with gradient checkpointing + Muon optimizer

### Sliding/Full Attention Constraint (Carmack flag)
- [ ] Document which of the 60 layers are sliding vs full attention
- [ ] Define pruning constraint: pruned model must preserve at least 1 full-attention layer per N sliding layers for global context
- [ ] Test: what happens if you naively remove all full-attention layers? (quality collapse expected)

## VRAM Budget Concern

31B bf16 = 62GB. With Adam: params (62GB) + grads (62GB) + optimizer (124GB) = 248GB > 192GB MI300X.

Options:
1. **Muon optimizer** (no momentum for matrices, ~40% less state) → ~180GB, fits tight
2. **Gradient checkpointing + activation offload** → reduces activation memory
3. **Mixed precision: bf16 params + fp32 optimizer partitioned** → standard DeepSpeed approach
4. **8-bit Adam (bitsandbytes)** → 62 + 62 + 62 = 186GB, fits

Resolve before Phase 1.

## Definition of Done

- Sheared LLaMA adaptation code written and tested on a small model (Gemma 4 E2B as proxy)
- Gemma 4 31B layer importance profile completed
- Target architectures defined for each pruning level
- MI300X scripts ready to run
- VRAM budget resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EXP-28 Phase 0: Feasibility — Profile Gemma 4 31B + Adapt Sheared LLaMA #387

Goal

Tasks

Research & Code Review

Architecture Analysis

MI300X Preparation

Sliding/Full Attention Constraint (Carmack flag)

VRAM Budget Concern

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

EXP-28 Phase 0: Feasibility — Profile Gemma 4 31B + Adapt Sheared LLaMA #387

Description

Goal

Tasks

Research & Code Review

Architecture Analysis

MI300X Preparation

Sliding/Full Attention Constraint (Carmack flag)

VRAM Budget Concern

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions