mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)#22101
mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)#22101ReinforcedKnowledge wants to merge 7 commits intoggml-org:masterfrom
Conversation
Conformer encoder with Shaw relative position encoding, QFormer projector, log-mel spectrogram with frame stacking. Encoder uses GLU gating, folded batch norm, and SSM depthwise conv. QFormer compresses encoder output via windowed cross-attention (window=15, queries=3) into the LLM embedding space. Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank, dynamic range compression, 2x frame stacking (80->160 mel). GGUF converter handles batch norm folding at export time, fused K/V split, and Conv1d weight reshaping. Tested against HF transformers reference: token-for-token match on 30s/60s audio clips with greedy decoding.
f4c14e1 to
7b313dc
Compare
| ggml_tensor * Q = build_mm(layer.q_w, normed); | ||
| ggml_tensor * K = build_mm(layer.k_w, normed); | ||
| ggml_tensor * V = build_mm(layer.v_w, normed); | ||
|
|
||
| Q = ggml_reshape_4d(ctx0, Q, d_head, n_head, context_size, num_blocks); | ||
| K = ggml_reshape_4d(ctx0, K, d_head, n_head, context_size, num_blocks); | ||
| V = ggml_reshape_4d(ctx0, V, d_head, n_head, context_size, num_blocks); | ||
|
|
||
| ggml_tensor * Q_perm = ggml_permute(ctx0, Q, 0, 2, 1, 3); | ||
| ggml_tensor * K_perm = ggml_cont(ctx0, ggml_permute(ctx0, K, 0, 2, 1, 3)); | ||
|
|
||
| ggml_tensor * kq = ggml_mul_mat(ctx0, K_perm, Q_perm); | ||
|
|
||
| // Shaw RPE: pos_emb ne[2]=1 broadcasts against Q ne[2]=num_blocks in mul_mat | ||
| ggml_tensor * pos_emb = ggml_get_rows(ctx0, layer.attn_rel_pos_emb, attn_dists); | ||
| pos_emb = ggml_reshape_3d(ctx0, pos_emb, d_head, context_size, context_size); | ||
| pos_emb = ggml_reshape_4d(ctx0, pos_emb, d_head, context_size, 1, context_size); | ||
|
|
||
| ggml_tensor * Q_shaw = ggml_permute(ctx0, Q, 0, 1, 3, 2); | ||
| ggml_tensor * pos_attn = ggml_mul_mat(ctx0, pos_emb, Q_shaw); | ||
| pos_attn = ggml_cont(ctx0, ggml_permute(ctx0, pos_attn, 0, 2, 3, 1)); | ||
|
|
||
| ggml_tensor * scores = ggml_add(ctx0, kq, pos_attn); | ||
| ggml_tensor * attn_weights = ggml_soft_max_ext(ctx0, scores, attn_mask, | ||
| kq_scale, 0.0f); | ||
|
|
||
| ggml_tensor * V_perm = ggml_cont(ctx0, ggml_permute(ctx0, V, 1, 2, 0, 3)); | ||
| ggml_tensor * attn_out = ggml_mul_mat(ctx0, V_perm, attn_weights); | ||
|
|
||
| attn_out = ggml_permute(ctx0, attn_out, 0, 2, 1, 3); | ||
| attn_out = ggml_cont_2d(ctx0, attn_out, n_embd, padded_len); |
There was a problem hiding this comment.
Is it possible to use build_attn here? I suspect the only thing missing from build_attn was the mask, right?
There was a problem hiding this comment.
Unfortunately no :/ Shaw RPE injects pos_attn = mul_mat(pos_emb, Q) between the KQ product and softmax, and build_attn goes directly from mul_mat(k, q) to soft_max_ext with no hook for that. Flash attention path also fuses the whole thing into one op. But the QFormer attention already uses build_attn.
|
@ReinforcedKnowledge Thanks for tackling this support! I'd been slowly working through Granite 3.3 Speech support, but had stalled out badly. I'll pull this down and give it a shot on both the new 4.0-based model and the older 3.2 and 3.3 models. |
|
🤦 Nope, I'm wrong here! The 3.x speech models used the conditional adapter while the 3.x vision models did not. It appears that this swapped for 4.0 (I didn't realize speech had dropped the conditional adapter for the HF release). |
|
Confirmed that this is working nicely for 4.0 with the embedded multilingual sample from the repo: python convert_hf_to_gguf.py ~/models/ibm-granite/granite-4.0-1b-speech/ --outtype bf16
python convert_hf_to_gguf.py ~/models/ibm-granite/granite-4.0-1b-speech/ --outtype bf16 --mmproj
./build-rel/bin/llama-mtmd-cli -m ~/models/ibm-granite/granite-4.0-1b-speech/granite-4.0-1B-speech-BF16.gguf --mmproj ~/models/ibm-granite/granite-4.0-1b-speech/mmproj-granite-4.0-1b-speech-BF16.gguf --audio ~/models/ibm-granite/granite-4.0-1b-speech/multilingual_sample.wav -p "can you transcribe the speech into a written format?" --jinja --temp 0full logs |
|
This is also working nicely for 3.3-2b! Note that for that model, you do need the adapter (though interestingly it does seem to transcribe the english without the adapter before apparently translating the french to english). Convertpython convert_hf_to_gguf.py ~/models/granite-speech-3.3-2b/ --outtype bf16
python convert_hf_to_gguf.py ~/models/granite-speech-3.3-2b/ --outtype bf16 --mmproj
python convert_lora_to_gguf.py ~/models/granite-speech-3.3-2b/ --outtype bf16Run with adapter./build-rel/bin/llama-mtmd-cli -m ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16.gguf --mmproj ~/models/granite-speech-3.3-2b/mmproj-granite-speech-3.3-2b-BF16.gguf --lora ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16-LoRA.gguf --audio ~/models/ibm-granite/granite-4.0-1b-speech/multilingual_sample.wav -p "can you transcribe the speech into a written format?" --jinja --temp 0full logsRun without adapter./build-rel/bin/llama-mtmd-cli -m ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16.gguf --mmproj ~/models/granite-speech-3.3-2b/mmproj-granite-speech-3.3-2b-BF16.gguf --audio ~/models/ibm-granite/granite-4.0-1b-speech/multilingual_sample.wav -p "can you transcribe the speech into a written format?" --jinja --temp 0full logs |
|
@gabe-l-hart If I understand correctly, the model contains specific adapters for audio / vision input, and the adapter is only activated during prompt processing of the corresponding modality input, right? IIRC there was also a discussion about having built-in LoRA adapter (because currently adapters are loaded as separated files, which is not very convenient in terms of UX). I don't remember exactly where was the discussion, but may worth re-visit it. |
gabe-l-hart
left a comment
There was a problem hiding this comment.
Thank you SO much for putting this together! It's been on my TODO list for a very long time and just hasn't made it to the top.
I've got a number of nitty questions about things that should maybe be hparams instead of being hard-coded as well as a few structural questions for @ngxson about any future plans for further model-specific modularity in the codebase. The only concrete change request (besides the naming conventions from @ngxson) is that you update the base GraniteModel in convert_hf_to_gguf.py rather than introducing a special text model for Granite Speech.
| } break; | ||
| case PROJECTOR_TYPE_GRANITE_SPEECH: | ||
| { | ||
| hparams.audio_chunk_len = 0; |
There was a problem hiding this comment.
@ngxson I've been curious about these hard-coded values. These seem like properties of the model instance and not the model architecture and thus something that would make sense as hparam values in the GGUF for the specific model. Is there something I'm missing that explicitly links the projector architecture to these specific values? I know that the upstream transformers models hard-code them, but I would imagine it might make sense to proactively put them in the GGUF so that if in the future the architecture is reused with different values, we don't need a code-change and/or reconverted GGUFs to support it. The fields are already there in the internal clip_hparams (the ones being set here), so I think it would just be a matter of defining the string constants for conversion and then adding these as the default values in the convert_hf_to_gguf.py stack.
| const int n_layer_orig = hparams.n_layer; | ||
| if (model.proj_type == PROJECTOR_TYPE_GEMMA3NV | ||
| || model.proj_type == PROJECTOR_TYPE_GRANITE_SPEECH) { | ||
| hparams.n_layer = 0; // these models do not use the generic layer structure |
There was a problem hiding this comment.
I was confused reading this comment (which I know you didn't add). It would be helpful to flesh out the comment to indicate that hparams.n_layer will be re-set below so this is just a workaround to skip the generic layer processing.
Also, it seems like this workaround is a bit of a hack. Would it make more sense to have bool skip_standard_layers = false; and then check it directly below before performing the default layer functionality? Or better yet, put a conditional around the default layer logic that makes it explicit that these architectures are skipping it?
There was a problem hiding this comment.
Also agree that it's a bit of a hack, it doesn't sit comfortably with me and if I can get rid of it with minimal changes to the code it'd be great. Currently if someone reads they won't understand why we're doing const int n_layer_orig = hparams.n_layer; until, and if, they read the code down the line for granite.
I think the best is as you suggested, wrap the loop in a conditional and avoid any mutation and save/restore. On it!
There was a problem hiding this comment.
Just pushed, let me know if it's good 😄
There was a problem hiding this comment.
@ngxson owns this area, so his opinion counts for a lot more than mine, but IMO, this clean-up is worth the slight scope-creep on the PR rather than perpetuating a pattern that is hard to maintain.
| } | ||
| set_input_f32("pos_emb", pos_emb); | ||
| } break; | ||
| case PROJECTOR_TYPE_GRANITE_SPEECH: |
There was a problem hiding this comment.
Question for @ngxson: Is there any plan to break up clip.cpp so that this kind of model-specific code can live in a <model-name>.cpp file? Right now, it looks like the arch-specific files are only for graph building, but it seems like it could go a lot further to encode this sort of logic as well (this is probably a much bigger conversation that bleeds into the model-modularity conversation in the core as well).
| mtmd_audio_cache cache; | ||
| }; | ||
|
|
||
| struct mtmd_audio_preprocessor_granite_speech : mtmd_audio_preprocessor { |
There was a problem hiding this comment.
Similar question to @ngxson about the modularity plans. This also seems ripe for isolation.
Right, that's the goal of these modular models. I was clearly a bit confused thinking that 4.0 speech had kept the adapter separate like 3.3 did. I know that 4.0 vision did keep them separate. The ultimate goal is a single running model with modality-specific adapters that toggle on/off automatically allowing a single model to server all modalities without sacrificing the text quality for text-only. Now that we've got this working for the 3.3 model, I'll use that as a testbed for my modality-conditional-adapter branch. |
That would be #13693 |
Right! Thanks for the reminder. I'll look back over that and make sure I haven't duplicated anything |
Overview
Adds support for ibm-granite/granite-4.0-1b-speech.
granite-speech.cpp)Tested with greedy decoding on 30s/60s/120s/180s/360s clips, token-for-token match against HF transformers (following their script on the model card) for 30s and 60s. Too heavy for me to run for longer on HF but at 120s/180s there is noticeable degradation and at 360s it completely loops.
Test command:
ffmpeg -i input.wav -t 30 -ar 16000 -ac 1 test.wav python convert_hf_to_gguf.py models/granite-4.0-1b-speech --outtype f16 python convert_hf_to_gguf.py models/granite-4.0-1b-speech --outtype f16 --mmproj ./build/bin/llama-mtmd-cli -m models/granite-4.0-1b-speech/granite-4.0-1B-speech-F16.gguf --mmproj models/granite-4.0-1b-speech/mmproj-granite-4.0-1b-speech-F16.gguf --audio test.wav -p "can you transcribe the speech into a written format?" --jinja --temp 0 -c 4096Also test the UI:
Uploading an audio file and using the prompt above produces the same transcription as the CLI.
Notes:
--jinjais required and the prompt "can you transcribe the speech into a written format?" is taken from the model card.Requirements
EDIT: Added the comment on testing the chat UI.