dusterbloom: feat(qwen3_next): mixed-bit Qwen3.5 GDN BA loading fallback#19
Open
dusterbloom wants to merge 11 commits intomainfrom
Open
dusterbloom: feat(qwen3_next): mixed-bit Qwen3.5 GDN BA loading fallback#19dusterbloom wants to merge 11 commits intomainfrom
dusterbloom wants to merge 11 commits intomainfrom
Conversation
fix(deps): update rust crate toml to v1
…l-action-digest chore(deps): update taiki-e/install-action digest to cca35ed
…file chore(deps): update rust crate tokio to v1.52.2
…-lockfile chore(deps): update rust crate tower-http to v0.6.9
…anbanda#143) Adds AnyCache::trim_by to roll back KV layers for speculative decode while leaving hybrid Arrays state untouched.\n\nCI: https://github.com/panbanda/higgs/actions/runs/25312580791
Adds a fallback path for loading Qwen3.5 models with mixed-bit GDN
projection weights (some layers q4, some q8 — common in unsloth's
dynamic-quant variants). The default fused-projection loader fuses
`in_proj_a` + `in_proj_b` into a single matmul; mixed-bit weights
have incompatible shapes and the fusion fails.
Behaviour:
1. Detect via `is_mixed_bit_gdn_ba_fusion_error` — matches a
`ModelError::ShapeMismatch` whose message contains both
`in_proj_ba` and `requires separate GDN projections`.
2. On detection, retry the load with
`args.use_separate_gdn_projections = true`, taking the
`load_qwen3_5_moe_weights_direct` path. Forward dispatches go from
2 to 4 GDN ops per layer — slightly slower but correct.
3. Forced separate (via `args.use_separate_gdn_projections` config or
`HIGGS_SEPARATE_GDN_PROJ` env var) skips the fused attempt
entirely.
Also adds:
* `qwen3_5_quantization_config` — parses `{group_size, bits}` from
the per-layer `quantization` map in `config.json`.
* `qwen3_5_mixed_ba_quantization_layers` — scans for the layers
where `in_proj_a` and `in_proj_b` differ in bits or group_size.
* `can_concatenate_axis0` — guard used inside
`load_qwen3_5_moe_weights_fused` to emit the diagnostic
`ShapeMismatch` error rather than panicking on the concat.
* `load_qwen3_5_model_with_gdn_fallback` — private helper called by
both `load_qwen3_5_model` (dense) and `load_qwen3_5_moe_model`
(MoE), unifying the fallback path.
Adaptations from feat/magic-canvas → origin/main:
* The dense `load_qwen3_5_model` previously only honoured the env
var; now it honours `args.use_separate_gdn_projections` too,
matching the MoE path. Strict improvement: the config flag is set
only by the env var or by mixed-bit detection.
* No `unwrap()`, no `as` casts (use `i32::try_from`); match arms
enumerate variants. No file-level allows added.
Verification on origin/main (rustc 1.95.0):
* `cargo check -p higgs-models` — clean
* `cargo clippy --all-targets --all-features -- -D warnings` — clean
* `cargo fmt --check` — clean
* `cargo test -p higgs-models --lib` — 333/333 pass (3 new)
Source: feat/magic-canvas commit `061e500c`. Direct cherry-pick had 5
conflict regions because origin/main has evolved the load functions
independently; this is a manual surgical port that preserves
origin/main's structure while adding the fallback behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
358ac62 to
45ece14
Compare
45ece14 to
dd9cf7a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a load-time fallback for Qwen3.5 models with mixed-bit GDN BA projections (some layers q4, some q8 — common in unsloth dynamic-quant variants). The default fused loader concatenates
in_proj_a+in_proj_binto a single matmul; mixed-bit weights have incompatible shapes and fusion fails. This PR detects the specific failure and retries with separate (4-dispatch) GDN projections.This is PR-1c of the magic-canvas split — a deferred follow-up from PR-1's audit (commit
061e500c).Behaviour
What's added
is_mixed_bit_gdn_ba_fusion_error(err: &ModelError) -> bool— pattern-matches the diagnostic shape-mismatch errorqwen3_5_quantization_config— parses{group_size, bits}from the per-layerquantizationmapqwen3_5_mixed_ba_quantization_layers— scans for layers wherein_proj_aandin_proj_bdiffer in bits or group_sizecan_concatenate_axis0— guard used insideload_qwen3_5_moe_weights_fusedto emit the diagnostic error rather than panic on the concatload_qwen3_5_model_with_gdn_fallback— private helper called by both dense and MoE load paths, unifies the fallbackAdaptations from feat/magic-canvas
load_qwen3_5_modelpreviously only honoured the env var; now it honoursargs.use_separate_gdn_projectionstoo, matching the MoE path. Strict improvement.061e500chad 5 conflict regions; this is a manual surgical port that preserves origin/main's structure.Hygiene
unwrap()onResult/Optionascasts in shape-arithmetic paths (usei32::try_from)Test plan
cargo check -p higgs-models— cleancargo clippy --all-targets --all-features -- -D warnings— clean (rustc 1.95.0, matches CI)cargo fmt --check— cleancargo test -p higgs-models --lib— 333/333 pass (3 new)Brooooooklyn/Qwen3.5-27B-unsloth-mlx) and confirm the warn-then-retry path triggers, model loads successfully, generation produces coherent tokens.🤖 Generated with Claude Code