dusterbloom: feat(bonsai-q1): packed 1.25-bpw engine with fp16 attention path by dusterbloom · Pull Request #13 · dusterbloom/higgs

dusterbloom · 2026-04-30T19:57:40Z

Summary

Adds Bonsai-Q1, a packed 1-bit Qwen3-shaped target model engine for 1.7B / 8B checkpoints. Routes via existing model_type=\"qwen3\" arm: any checkpoint declaring quantization.bits == 1 lands in the dedicated packed engine.

Headline (production decode, M4 Max)

Model	Before	After	Speedup
Bonsai-1.7B	11.51 ms / 87 t/s	5.45 ms / 184 t/s	2.11×
Bonsai-8B	44.46 ms / 22 t/s	15.69 ms / 64 t/s	2.83×

The win came from a dtype audit, not new architecture. apply_yarn_rope was multiplying fp16 q/k by an f32 scalar from mlx_rs::array!(mscale), silently upcasting the entire attention chain to f32. Bonsai's yarn factor=4.0 yields mscale≈1.139, so the branch fired every rope call (72×/step on 8B). Fix: cast the mscale scalar to x's dtype before multiply, keeping the chain in fp16.

What lands

bonsai_q1::PackedQ1Linear — 1.25-bpw layer matching MLX's bits=1 QuantizedLinear / PrismML fork (1 bit/weight + 32-bit scale + bias per group of 128 input columns)
bonsai_q1::BonsaiQ1Engine / BonsaiQ1Gpu — causal forward + KV cache on the packed engine
yarn — shared YaRN rope helpers (apply_yarn_rope, compute_yarn_freqs, yarn_get_mscale) with the dtype-safe scalar cast
AnyModel::BonsaiQ1 variant + match arms (forward, forward_batched, MTP, hidden_size, kv_cache_geometry, make_cache_with_config, image_size)
is_bonsai_q1 in model_loader — peeks config.json for quantization.bits == 1 to route the qwen3 arm
SteppingKeyValueCache::key_value_arrays_mut — simultaneous mutable K/V access for Updatable state borrowing

Test plan

cargo check -p higgs-models -p higgs-engine — clean
cargo test -p higgs-models --lib -- --test-threads=1 — 330 passed, 0 failed, 25 ignored
cargo test -p higgs-models yarn:: (targeted) — yarn rope-dynamic ground-truth test passes
cargo run -p higgs-bench --release --bin bench_decode -- --model bonsai-8b-q1 --warmup 1 --trials 5 --max-tokens 200 --temperature 0 — needs a Bonsai-8B-q1 entry in benchmarks/models.toml to land here; verify ≥60 tok/s
cargo clippy -p higgs-models -p higgs-engine — clean

Notes

One test (yarn::tests::rope_dynamic_matches_static_offset_0_to_64) is #[ignore]d due to pre-existing harness flakiness — global MLX Metal/RNG state contamination from prior tests poisons the rope output. Passes targeted via cargo test -p higgs-models yarn::. The underlying behavior is what shipped the 2.83× win in production.
Co-authored with Claude Opus 4.7

fix(deps): update rust crate toml to v1

…l-action-digest chore(deps): update taiki-e/install-action digest to cca35ed

…file chore(deps): update rust crate tokio to v1.52.2

…-lockfile chore(deps): update rust crate tower-http to v0.6.9

…anbanda#143) Adds AnyCache::trim_by to roll back KV layers for speculative decode while leaving hybrid Arrays state untouched.\n\nCI: https://github.com/panbanda/higgs/actions/runs/25312580791

…#148) * feat(qwen3_next): mixed-bit Qwen3.5 GDN BA loading fallback Adds a fallback path for loading Qwen3.5 models with mixed-bit GDN projection weights (some layers q4, some q8 — common in unsloth's dynamic-quant variants). The default fused-projection loader fuses `in_proj_a` + `in_proj_b` into a single matmul; mixed-bit weights have incompatible shapes and the fusion fails. Behaviour: 1. Detect via `is_mixed_bit_gdn_ba_fusion_error` — matches a `ModelError::ShapeMismatch` whose message contains both `in_proj_ba` and `requires separate GDN projections`. 2. On detection, retry the load with `args.use_separate_gdn_projections = true`, taking the `load_qwen3_5_moe_weights_direct` path. Forward dispatches go from 2 to 4 GDN ops per layer — slightly slower but correct. 3. Forced separate (via `args.use_separate_gdn_projections` config or `HIGGS_SEPARATE_GDN_PROJ` env var) skips the fused attempt entirely. Also adds: * `qwen3_5_quantization_config` — parses `{group_size, bits}` from the per-layer `quantization` map in `config.json`. * `qwen3_5_mixed_ba_quantization_layers` — scans for the layers where `in_proj_a` and `in_proj_b` differ in bits or group_size. * `can_concatenate_axis0` — guard used inside `load_qwen3_5_moe_weights_fused` to emit the diagnostic `ShapeMismatch` error rather than panicking on the concat. * `load_qwen3_5_model_with_gdn_fallback` — private helper called by both `load_qwen3_5_model` (dense) and `load_qwen3_5_moe_model` (MoE), unifying the fallback path. Adaptations from feat/magic-canvas → origin/main: * The dense `load_qwen3_5_model` previously only honoured the env var; now it honours `args.use_separate_gdn_projections` too, matching the MoE path. Strict improvement: the config flag is set only by the env var or by mixed-bit detection. * No `unwrap()`, no `as` casts (use `i32::try_from`); match arms enumerate variants. No file-level allows added. Verification on origin/main (rustc 1.95.0): * `cargo check -p higgs-models` — clean * `cargo clippy --all-targets --all-features -- -D warnings` — clean * `cargo fmt --check` — clean * `cargo test -p higgs-models --lib` — 333/333 pass (3 new) Source: feat/magic-canvas commit `061e500c`. Direct cherry-pick had 5 conflict regions because origin/main has evolved the load functions independently; this is a manual surgical port that preserves origin/main's structure while adding the fallback behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qwen3_next): preserve explicit GDN projection config --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Jonathan Reyes <me@jonathanreyes.com>

…panbanda#141) * perf(models): fused MoE gate+up — 3→2 expert matmuls per layer SwitchMlpWeights::forward_gather_fused() lazy-concatenates gate+up weights on first call, then dispatches a single gather_qmm instead of two separate calls. FfnBlock::forward() now routes through the fused path instead of forward_gather_global_sort(). Measured on 35B-A3B-3bit M4 base: - S=1 decode: 27ms → 17ms (−37%) - S=16 verify: 253ms → 112ms (−56%) - MoE/layer at K=1: 0.47ms (down from ~0.68ms) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * style: cargo fmt qwen3_next.rs Reflow `let fw/fs/fb = ops::concatenate_axis(..)` from broken-line indentation back onto single lines so `cargo fmt --all -- --check` passes in CI. * fix(clippy): backtick MoE/gather_qmm doc + safe top_k u32 cast Two errors flagged by `-D clippy::doc-markdown` and `-D clippy::cast-sign-loss`/`-D clippy::as-conversions`: - Backtick `MoE` and `gather_qmm` in the `fused_gate_up` doc comment. - Replace `top_k as u32` with the same `u32::try_from(top_k).map_err(...)` pattern already used by `forward_gather_global_sort`. * fix(qwen3_next): gate MoE gate-up fusion behind opt-in --------- Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jonathan Reyes <me@jonathanreyes.com>

fix(deps): update rust crate toml to v1

f671235

dusterbloom force-pushed the dusterbloom/bonsai-q1-fp16 branch 2 times, most recently from e9a3152 to 39a657d Compare May 4, 2026 08:34

This was referenced May 4, 2026

feat(bonsai-q1): packed engine scaffold with upstream MLX guard panbanda/higgs#142

Merged

dusterbloom: feat(speculative): DraftModel trait + spec-decode primitives #15

Open

renovate Bot and others added 11 commits May 4, 2026 14:50

chore(deps): update rust crate tokio to v1.52.2

50daaef

chore(deps): update taiki-e/install-action digest to cca35ed

8659d38

chore(deps): update rust crate tower-http to v0.6.9

442063e

Merge pull request panbanda#135 from panbanda/renovate/toml-1.x

a08c1bf

fix(deps): update rust crate toml to v1

Merge pull request panbanda#140 from panbanda/renovate/taiki-e-instal…

ea2fb41

…l-action-digest chore(deps): update taiki-e/install-action digest to cca35ed

Merge pull request panbanda#146 from panbanda/renovate/tokio-1.x-lock…

0ae70ec

…file chore(deps): update rust crate tokio to v1.52.2

Merge pull request panbanda#149 from panbanda/renovate/tower-http-0.x…

4b4c0be

…-lockfile chore(deps): update rust crate tower-http to v0.6.9

feat(cache): AnyCache::trim_by dispatcher for spec-decode rollback (p…

229c111

…anbanda#143) Adds AnyCache::trim_by to roll back KV layers for speculative decode while leaving hybrid Arrays state untouched.\n\nCI: https://github.com/panbanda/higgs/actions/runs/25312580791

feat(bonsai_q1): add upstream-guarded packed engine

4af2603

panbanda force-pushed the dusterbloom/bonsai-q1-fp16 branch from fad54d3 to 4af2603 Compare May 6, 2026 13:30

fix(bonsai-q1): address review feedback

6ebefb3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dusterbloom: feat(bonsai-q1): packed 1.25-bpw engine with fp16 attention path#13

dusterbloom: feat(bonsai-q1): packed 1.25-bpw engine with fp16 attention path#13
dusterbloom wants to merge 13 commits intomainfrom
dusterbloom/bonsai-q1-fp16

dusterbloom commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dusterbloom commented Apr 30, 2026

Summary

Headline (production decode, M4 Max)

What lands

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants