epic: Project Bespoke — Extract mnemonic's own LLM from Gemma 4 31B via structured pruning

## Vision

Mnemonic's "brain" is currently 99% someone else's model (Qwen 3.5 2B) with 1% of our adapters on top. **Project Bespoke** extracts a purpose-built model that exists because of mnemonic — smaller, faster, and genuinely ours.

The core insight from the Lottery Ticket Hypothesis (Frankle & Carlin, 2019): dense pretrained models contain sparse subnetworks that, trained in isolation, match full-model performance. Applied at scale via Sheared LLaMA (Xia et al., ICLR 2024): structured pruning with learned masks reduced LLaMA2-7B to 1.3B at 3% of from-scratch training cost, outperforming all comparably-sized models.

**Target:** Take a ~7-9B dense pretrained model → extract a 1.5-2B structured subnetwork optimized for mnemonic's tasks → export as a standalone GGUF → deploy as the daemon's native model.

## Why This Matters

| Property | Current (Qwen 2B + spokes) | Bespoke 1.5B |
|----------|---------------------------|--------------|
| Identity | "Qwen with adapters" | "Mnemonic's model" |
| Total params | 2B + 25M adapters | ~1.5B standalone |
| Inference VRAM | ~3GB | ~1-1.5GB |
| Encoding latency | ~20s | ~5-8s (est.) |
| Daemon + training | Mutually exclusive (VRAM) | Can coexist |
| Architecture | Frozen base + hooks | Clean single model |
| Spoke overhead | Per-layer injection at inference | Baked in (or minimal) |

## Architecture Decision: Dense vs Hybrid Base

**Critical finding:** Qwen 3.5 9B uses a hybrid architecture (Gated DeltaNet + Gated Attention + Sparse MoE). All published structured pruning methods (Sheared LLaMA, SliceGPT, LLM-Pruner) assume dense attention + dense FFN. Adapting to DeltaNet/MoE is novel research.

**Decision needed (Phase 0):** Evaluate these candidates:
1. **Qwen 3.5 9B (hybrid)** — Richest representations but DeltaNet/MoE complicates pruning
2. **Qwen 2.5 7B (dense)** — Pure attention, Sheared LLaMA code works directly, but older generation
3. **Qwen 3.5 4B (dense?)** — Need to verify architecture. Smaller starting point = less aggressive pruning needed (4B → 1.5B = 2.7:1 vs 9B → 1.5B = 6:1)
4. **Other dense 7-9B models** — LLaMA 3 8B, Gemma 2 9B (pure attention, well-studied)

## Research References

- [The Lottery Ticket Hypothesis](https://arxiv.org/abs/1803.03635) (Frankle & Carlin, 2019) — foundational theory
- [Sheared LLaMA](https://arxiv.org/abs/2310.06694) (Xia et al., ICLR 2024) — **primary method**. 7B → 1.3B via targeted structural pruning + continued pretraining. [Code](https://github.com/princeton-nlp/LLM-Shearing)
- [SliceGPT](https://arxiv.org/abs/2401.15024) (Ashkboos et al., ICLR 2024) — post-training PCA-based width reduction, no retraining. Good for initial 20-25% reduction.
- [Wanda](https://arxiv.org/abs/2306.11695) (Sun et al., ICLR 2024) — fast pruning via weight × activation magnitude. Unstructured only, not applicable for architecture extraction.
- [LLM-Pruner](https://arxiv.org/abs/2305.11627) (Ma et al., NeurIPS 2023) — dependency-graph-aware structured pruning + LoRA recovery.

## Phases

### Phase 0: Feasibility & Architecture Selection (EXP-28)
**Goal:** Pick the base model and validate that structured pruning works on it.

- [ ] Profile Qwen 3.5 9B, 4B, and one dense alternative (LLaMA 3 8B or Gemma 2 9B) on mnemonic encoding tasks
- [ ] Verify architecture details: which are dense attention, which are hybrid/MoE
- [ ] Run SliceGPT (no retraining) on each candidate at 25% reduction — measure encoding quality retention
- [ ] Estimate MI300X compute budget for full Sheared-LLaMA-style pruning on each
- [ ] **Decide:** which base model to prune
- [ ] Hardware: MI300X for profiling, local 7800 XT for quality evaluation

### Phase 1: Full Fine-Tune Baseline (MI300X)
**Goal:** Establish quality ceiling and collect importance metrics.

- [ ] Full fine-tune chosen model (all params unfrozen) on v7 encoding dataset
- [ ] Collect per-layer importance metrics during training: gradient magnitude, activation variance, attention entropy
- [ ] Evaluate: encoding quality (7 faithfulness metrics), stress test, eval loss
- [ ] This becomes the quality ceiling — the pruned model must approach this

### Phase 2: Structured Pruning (MI300X)
**Goal:** Find the minimal architecture that maintains encoding quality.

Following Sheared LLaMA methodology:
- [ ] Define target shape: ~1.5B params (~20 layers, hidden 2048, 16 heads, FFN 5504)
- [ ] Learn pruning masks: joint optimization of task loss + pruning objective (~3K steps)
- [ ] Progressive targets: evaluate at 4B, 3B, 2B, 1.5B — find the quality cliff
- [ ] For each target size, record: which layers survived, which heads, which FFN dimensions
- [ ] **Key question:** Does the pruned 1.5B from 9B beat the existing full 2B on encoding? If not, pruning is not worth it.

### Phase 3: Continued Pretraining (MI300X)
**Goal:** Recover quality lost during pruning.

- [ ] Continue pretraining the pruned model on mnemonic's encoding data
- [ ] Dynamic batch loading (per Sheared LLaMA): weight data mix by per-domain loss
- [ ] Evaluate after every 1K steps: faithfulness metrics, stress test
- [ ] Target: match or exceed Phase 1 quality ceiling on encoding tasks

### Phase 4: Lottery Ticket Validation
**Goal:** Test whether initialization matters (the core LTH claim).

- [ ] Take the pruned architecture from Phase 2
- [ ] **Variant A:** Keep trained weights (standard pruning)
- [ ] **Variant B:** Reset to original pretrained initialization, retrain from scratch on encoding data
- [ ] **Variant C:** Random initialization (ablation — should fail, confirming pretrained init matters)
- [ ] Compare A vs B vs C on all metrics
- [ ] If B ≈ A: the initialization IS the value, confirming true lottery ticket
- [ ] If A >> B: the trained weights matter more, standard pruning is sufficient

### Phase 5: Export & Local Deployment
**Goal:** Standalone bespoke model running in the daemon.

- [ ] Export pruned model as GGUF (standard format, no adapter hooks needed)
- [ ] Benchmark on RX 7800 XT: tok/s, VRAM, encoding latency
- [ ] Target: >200 tok/s, <1.5GB VRAM, <10s per encoding
- [ ] Integration test with mnemonic daemon (replace llama-server model)
- [ ] Lifecycle test: full 8-phase lifecycle with bespoke model

### Phase 6: Felix-LM Integration
**Goal:** Add task-specific spokes to the bespoke post.

- [ ] Train spoke adapters on the pruned model (encoding, synthesis, retrieval)
- [ ] Test hot-swap capability: switch between spoke sets at inference
- [ ] This is the full Felix-LM vision: small bespoke post + swappable spoke tools
- [ ] Final benchmark: the complete system

## Hardware Plan

| Phase | Hardware | Estimated Time | Estimated Cost |
|-------|----------|---------------|----------------|
| Phase 0 | MI300X droplet + local | 1-2 days | ~$20-30 |
| Phase 1 | MI300X droplet | 4-8 hours | ~$10-20 |
| Phase 2 | MI300X droplet | 8-16 hours | ~$20-40 |
| Phase 3 | MI300X droplet | 8-24 hours | ~$20-50 |
| Phase 4 | MI300X droplet | 4-8 hours | ~$10-20 |
| Phase 5 | Local 7800 XT | 1-2 hours | $0 |
| Phase 6 | Local 7800 XT | 4-8 hours | $0 |

Total estimated: ~$80-160 in MI300X compute, 1-2 weeks of research time.

## Success Criteria

1. **Quality:** Pruned model matches or exceeds current Qwen 2B + spokes on all 7 faithfulness metrics + 7/7 stress test
2. **Speed:** >200 tok/s inference on RX 7800 XT (current: 95 tok/s)
3. **Size:** <1.5GB VRAM for inference (current: ~3GB)
4. **Identity:** Standalone GGUF with no external model dependency — this is mnemonic's model

## Risks

1. **Hybrid architecture complexity** — If the chosen model is DeltaNet/MoE, pruning tool adaptation could take weeks
2. **Quality cliff** — The encoding task might not have enough signal to guide structural pruning decisions, leading to arbitrary cuts
3. **Diminishing returns** — If the pruned 1.5B doesn't beat the existing full 2B, the entire effort is wasted. Phase 2 has an explicit go/no-go gate for this.
4. **MI300X cost** — Iterative pruning experiments could exceed budget. Set hard $150 cap.

## Relationship to Other Work

- **EXP-26** (v7 data training) should complete first — establishes the data quality baseline
- **EXP-27** (Qwen 3.5 4B) explores a larger base model, may inform Phase 0 architecture selection
- **Issue #381** (faithfulness) — solved by EXP-25/26, provides the evaluation framework for Project Bespoke
- **Felix-LM design paper** (`~/Projects/felixlm/docs/felix_lm_design.tex`) — this is the path to realizing the post-and-spoke vision

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epic: Project Bespoke — Extract mnemonic's own LLM from Gemma 4 31B via structured pruning #386

Vision

Why This Matters

Architecture Decision: Dense vs Hybrid Base

Research References

Phases

Phase 0: Feasibility & Architecture Selection (EXP-28)

Phase 1: Full Fine-Tune Baseline (MI300X)

Phase 2: Structured Pruning (MI300X)

Phase 3: Continued Pretraining (MI300X)

Phase 4: Lottery Ticket Validation

Phase 5: Export & Local Deployment

Phase 6: Felix-LM Integration

Hardware Plan

Success Criteria

Risks

Relationship to Other Work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Property	Current (Qwen 2B + spokes)	Bespoke 1.5B
Identity	"Qwen with adapters"	"Mnemonic's model"
Total params	2B + 25M adapters	~1.5B standalone
Inference VRAM	~3GB	~1-1.5GB
Encoding latency	~20s	~5-8s (est.)
Daemon + training	Mutually exclusive (VRAM)	Can coexist
Architecture	Frozen base + hooks	Clean single model
Spoke overhead	Per-layer injection at inference	Baked in (or minimal)

Phase	Hardware	Estimated Time	Estimated Cost
Phase 0	MI300X droplet + local	1-2 days	~$20-30
Phase 1	MI300X droplet	4-8 hours	~$10-20
Phase 2	MI300X droplet	8-16 hours	~$20-40
Phase 3	MI300X droplet	8-24 hours	~$20-50
Phase 4	MI300X droplet	4-8 hours	~$10-20
Phase 5	Local 7800 XT	1-2 hours	$0
Phase 6	Local 7800 XT	4-8 hours	$0

epic: Project Bespoke — Extract mnemonic's own LLM from Gemma 4 31B via structured pruning #386

Description

Vision

Why This Matters

Architecture Decision: Dense vs Hybrid Base

Research References

Phases

Phase 0: Feasibility & Architecture Selection (EXP-28)

Phase 1: Full Fine-Tune Baseline (MI300X)

Phase 2: Structured Pruning (MI300X)

Phase 3: Continued Pretraining (MI300X)

Phase 4: Lottery Ticket Validation

Phase 5: Export & Local Deployment

Phase 6: Felix-LM Integration

Hardware Plan

Success Criteria

Risks

Relationship to Other Work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions