Skip to content

epic: Project Bespoke — Extract mnemonic's own LLM from Gemma 4 31B via structured pruning #386

@CalebisGross

Description

@CalebisGross

Vision

Mnemonic's "brain" is currently 99% someone else's model (Qwen 3.5 2B) with 1% of our adapters on top. Project Bespoke extracts a purpose-built model that exists because of mnemonic — smaller, faster, and genuinely ours.

The core insight from the Lottery Ticket Hypothesis (Frankle & Carlin, 2019): dense pretrained models contain sparse subnetworks that, trained in isolation, match full-model performance. Applied at scale via Sheared LLaMA (Xia et al., ICLR 2024): structured pruning with learned masks reduced LLaMA2-7B to 1.3B at 3% of from-scratch training cost, outperforming all comparably-sized models.

Target: Take a ~7-9B dense pretrained model → extract a 1.5-2B structured subnetwork optimized for mnemonic's tasks → export as a standalone GGUF → deploy as the daemon's native model.

Why This Matters

Property Current (Qwen 2B + spokes) Bespoke 1.5B
Identity "Qwen with adapters" "Mnemonic's model"
Total params 2B + 25M adapters ~1.5B standalone
Inference VRAM ~3GB ~1-1.5GB
Encoding latency ~20s ~5-8s (est.)
Daemon + training Mutually exclusive (VRAM) Can coexist
Architecture Frozen base + hooks Clean single model
Spoke overhead Per-layer injection at inference Baked in (or minimal)

Architecture Decision: Dense vs Hybrid Base

Critical finding: Qwen 3.5 9B uses a hybrid architecture (Gated DeltaNet + Gated Attention + Sparse MoE). All published structured pruning methods (Sheared LLaMA, SliceGPT, LLM-Pruner) assume dense attention + dense FFN. Adapting to DeltaNet/MoE is novel research.

Decision needed (Phase 0): Evaluate these candidates:

  1. Qwen 3.5 9B (hybrid) — Richest representations but DeltaNet/MoE complicates pruning
  2. Qwen 2.5 7B (dense) — Pure attention, Sheared LLaMA code works directly, but older generation
  3. Qwen 3.5 4B (dense?) — Need to verify architecture. Smaller starting point = less aggressive pruning needed (4B → 1.5B = 2.7:1 vs 9B → 1.5B = 6:1)
  4. Other dense 7-9B models — LLaMA 3 8B, Gemma 2 9B (pure attention, well-studied)

Research References

  • The Lottery Ticket Hypothesis (Frankle & Carlin, 2019) — foundational theory
  • Sheared LLaMA (Xia et al., ICLR 2024) — primary method. 7B → 1.3B via targeted structural pruning + continued pretraining. Code
  • SliceGPT (Ashkboos et al., ICLR 2024) — post-training PCA-based width reduction, no retraining. Good for initial 20-25% reduction.
  • Wanda (Sun et al., ICLR 2024) — fast pruning via weight × activation magnitude. Unstructured only, not applicable for architecture extraction.
  • LLM-Pruner (Ma et al., NeurIPS 2023) — dependency-graph-aware structured pruning + LoRA recovery.

Phases

Phase 0: Feasibility & Architecture Selection (EXP-28)

Goal: Pick the base model and validate that structured pruning works on it.

  • Profile Qwen 3.5 9B, 4B, and one dense alternative (LLaMA 3 8B or Gemma 2 9B) on mnemonic encoding tasks
  • Verify architecture details: which are dense attention, which are hybrid/MoE
  • Run SliceGPT (no retraining) on each candidate at 25% reduction — measure encoding quality retention
  • Estimate MI300X compute budget for full Sheared-LLaMA-style pruning on each
  • Decide: which base model to prune
  • Hardware: MI300X for profiling, local 7800 XT for quality evaluation

Phase 1: Full Fine-Tune Baseline (MI300X)

Goal: Establish quality ceiling and collect importance metrics.

  • Full fine-tune chosen model (all params unfrozen) on v7 encoding dataset
  • Collect per-layer importance metrics during training: gradient magnitude, activation variance, attention entropy
  • Evaluate: encoding quality (7 faithfulness metrics), stress test, eval loss
  • This becomes the quality ceiling — the pruned model must approach this

Phase 2: Structured Pruning (MI300X)

Goal: Find the minimal architecture that maintains encoding quality.

Following Sheared LLaMA methodology:

  • Define target shape: ~1.5B params (~20 layers, hidden 2048, 16 heads, FFN 5504)
  • Learn pruning masks: joint optimization of task loss + pruning objective (~3K steps)
  • Progressive targets: evaluate at 4B, 3B, 2B, 1.5B — find the quality cliff
  • For each target size, record: which layers survived, which heads, which FFN dimensions
  • Key question: Does the pruned 1.5B from 9B beat the existing full 2B on encoding? If not, pruning is not worth it.

Phase 3: Continued Pretraining (MI300X)

Goal: Recover quality lost during pruning.

  • Continue pretraining the pruned model on mnemonic's encoding data
  • Dynamic batch loading (per Sheared LLaMA): weight data mix by per-domain loss
  • Evaluate after every 1K steps: faithfulness metrics, stress test
  • Target: match or exceed Phase 1 quality ceiling on encoding tasks

Phase 4: Lottery Ticket Validation

Goal: Test whether initialization matters (the core LTH claim).

  • Take the pruned architecture from Phase 2
  • Variant A: Keep trained weights (standard pruning)
  • Variant B: Reset to original pretrained initialization, retrain from scratch on encoding data
  • Variant C: Random initialization (ablation — should fail, confirming pretrained init matters)
  • Compare A vs B vs C on all metrics
  • If B ≈ A: the initialization IS the value, confirming true lottery ticket
  • If A >> B: the trained weights matter more, standard pruning is sufficient

Phase 5: Export & Local Deployment

Goal: Standalone bespoke model running in the daemon.

  • Export pruned model as GGUF (standard format, no adapter hooks needed)
  • Benchmark on RX 7800 XT: tok/s, VRAM, encoding latency
  • Target: >200 tok/s, <1.5GB VRAM, <10s per encoding
  • Integration test with mnemonic daemon (replace llama-server model)
  • Lifecycle test: full 8-phase lifecycle with bespoke model

Phase 6: Felix-LM Integration

Goal: Add task-specific spokes to the bespoke post.

  • Train spoke adapters on the pruned model (encoding, synthesis, retrieval)
  • Test hot-swap capability: switch between spoke sets at inference
  • This is the full Felix-LM vision: small bespoke post + swappable spoke tools
  • Final benchmark: the complete system

Hardware Plan

Phase Hardware Estimated Time Estimated Cost
Phase 0 MI300X droplet + local 1-2 days ~$20-30
Phase 1 MI300X droplet 4-8 hours ~$10-20
Phase 2 MI300X droplet 8-16 hours ~$20-40
Phase 3 MI300X droplet 8-24 hours ~$20-50
Phase 4 MI300X droplet 4-8 hours ~$10-20
Phase 5 Local 7800 XT 1-2 hours $0
Phase 6 Local 7800 XT 4-8 hours $0

Total estimated: ~$80-160 in MI300X compute, 1-2 weeks of research time.

Success Criteria

  1. Quality: Pruned model matches or exceeds current Qwen 2B + spokes on all 7 faithfulness metrics + 7/7 stress test
  2. Speed: >200 tok/s inference on RX 7800 XT (current: 95 tok/s)
  3. Size: <1.5GB VRAM for inference (current: ~3GB)
  4. Identity: Standalone GGUF with no external model dependency — this is mnemonic's model

Risks

  1. Hybrid architecture complexity — If the chosen model is DeltaNet/MoE, pruning tool adaptation could take weeks
  2. Quality cliff — The encoding task might not have enough signal to guide structural pruning decisions, leading to arbitrary cuts
  3. Diminishing returns — If the pruned 1.5B doesn't beat the existing full 2B, the entire effort is wasted. Phase 2 has an explicit go/no-go gate for this.
  4. MI300X cost — Iterative pruning experiments could exceed budget. Set hard $150 cap.

Relationship to Other Work

  • EXP-26 (v7 data training) should complete first — establishes the data quality baseline
  • EXP-27 (Qwen 3.5 4B) explores a larger base model, may inform Phase 0 architecture selection
  • Issue EXP-25: Faithfulness probe — can Qwen 2B spokes learn to encode diverse inputs? #381 (faithfulness) — solved by EXP-25/26, provides the evaluation framework for Project Bespoke
  • Felix-LM design paper (~/Projects/felixlm/docs/felix_lm_design.tex) — this is the path to realizing the post-and-spoke vision

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority:highImportant, fix soonresearchML research experimentstrainingModel training, data, and evaluation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions