You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Mnemonic's "brain" is currently 99% someone else's model (Qwen 3.5 2B) with 1% of our adapters on top. Project Bespoke extracts a purpose-built model that exists because of mnemonic — smaller, faster, and genuinely ours.
The core insight from the Lottery Ticket Hypothesis (Frankle & Carlin, 2019): dense pretrained models contain sparse subnetworks that, trained in isolation, match full-model performance. Applied at scale via Sheared LLaMA (Xia et al., ICLR 2024): structured pruning with learned masks reduced LLaMA2-7B to 1.3B at 3% of from-scratch training cost, outperforming all comparably-sized models.
Target: Take a ~7-9B dense pretrained model → extract a 1.5-2B structured subnetwork optimized for mnemonic's tasks → export as a standalone GGUF → deploy as the daemon's native model.
Why This Matters
Property
Current (Qwen 2B + spokes)
Bespoke 1.5B
Identity
"Qwen with adapters"
"Mnemonic's model"
Total params
2B + 25M adapters
~1.5B standalone
Inference VRAM
~3GB
~1-1.5GB
Encoding latency
~20s
~5-8s (est.)
Daemon + training
Mutually exclusive (VRAM)
Can coexist
Architecture
Frozen base + hooks
Clean single model
Spoke overhead
Per-layer injection at inference
Baked in (or minimal)
Architecture Decision: Dense vs Hybrid Base
Critical finding: Qwen 3.5 9B uses a hybrid architecture (Gated DeltaNet + Gated Attention + Sparse MoE). All published structured pruning methods (Sheared LLaMA, SliceGPT, LLM-Pruner) assume dense attention + dense FFN. Adapting to DeltaNet/MoE is novel research.
Decision needed (Phase 0): Evaluate these candidates:
Identity: Standalone GGUF with no external model dependency — this is mnemonic's model
Risks
Hybrid architecture complexity — If the chosen model is DeltaNet/MoE, pruning tool adaptation could take weeks
Quality cliff — The encoding task might not have enough signal to guide structural pruning decisions, leading to arbitrary cuts
Diminishing returns — If the pruned 1.5B doesn't beat the existing full 2B, the entire effort is wasted. Phase 2 has an explicit go/no-go gate for this.
MI300X cost — Iterative pruning experiments could exceed budget. Set hard $150 cap.
Relationship to Other Work
EXP-26 (v7 data training) should complete first — establishes the data quality baseline
EXP-27 (Qwen 3.5 4B) explores a larger base model, may inform Phase 0 architecture selection
Vision
Mnemonic's "brain" is currently 99% someone else's model (Qwen 3.5 2B) with 1% of our adapters on top. Project Bespoke extracts a purpose-built model that exists because of mnemonic — smaller, faster, and genuinely ours.
The core insight from the Lottery Ticket Hypothesis (Frankle & Carlin, 2019): dense pretrained models contain sparse subnetworks that, trained in isolation, match full-model performance. Applied at scale via Sheared LLaMA (Xia et al., ICLR 2024): structured pruning with learned masks reduced LLaMA2-7B to 1.3B at 3% of from-scratch training cost, outperforming all comparably-sized models.
Target: Take a ~7-9B dense pretrained model → extract a 1.5-2B structured subnetwork optimized for mnemonic's tasks → export as a standalone GGUF → deploy as the daemon's native model.
Why This Matters
Architecture Decision: Dense vs Hybrid Base
Critical finding: Qwen 3.5 9B uses a hybrid architecture (Gated DeltaNet + Gated Attention + Sparse MoE). All published structured pruning methods (Sheared LLaMA, SliceGPT, LLM-Pruner) assume dense attention + dense FFN. Adapting to DeltaNet/MoE is novel research.
Decision needed (Phase 0): Evaluate these candidates:
Research References
Phases
Phase 0: Feasibility & Architecture Selection (EXP-28)
Goal: Pick the base model and validate that structured pruning works on it.
Phase 1: Full Fine-Tune Baseline (MI300X)
Goal: Establish quality ceiling and collect importance metrics.
Phase 2: Structured Pruning (MI300X)
Goal: Find the minimal architecture that maintains encoding quality.
Following Sheared LLaMA methodology:
Phase 3: Continued Pretraining (MI300X)
Goal: Recover quality lost during pruning.
Phase 4: Lottery Ticket Validation
Goal: Test whether initialization matters (the core LTH claim).
Phase 5: Export & Local Deployment
Goal: Standalone bespoke model running in the daemon.
Phase 6: Felix-LM Integration
Goal: Add task-specific spokes to the bespoke post.
Hardware Plan
Total estimated: ~$80-160 in MI300X compute, 1-2 weeks of research time.
Success Criteria
Risks
Relationship to Other Work
~/Projects/felixlm/docs/felix_lm_design.tex) — this is the path to realizing the post-and-spoke vision