Modality conditional adapters by gabe-l-hart · Pull Request #22184 · ggml-org/llama.cpp

gabe-l-hart · 2026-04-20T19:26:13Z

Overview

This PR introduces a new mechanism for automatic LoRA adapter toggling based on the presence of one-or-more modalities that are tied to the adapter. This is a required feature for serving modular models such as ibm-granite/granite-speech-3.3-2b and ibm-granite/granite-4.0-3b-vision where the base LLM is preserved and the modality support is added through the adapter. Without this, a modular model must be booted in either text-mode or modality-mode. With this change, the model can be booted once and used in either mode based on the presence of the modality.

Related Work

The existing PR (#22101) by @ReinforcedKnowledge for Granite Speech adds support for ibm-granite/granite-speech-3.3-2b. I'm not aware of any other models that use this pattern and are already supported, so while this PR is still in review, there are no existing test models to verify the functionality with.

Testing

I have a temporary merge point between this branch and #22101 where I've tested the ability for granite-speech-3.3-2b to leverage its conditional adapter. With this combination (using the conversion steps in my comment here), I've tested the following scenarios:

Text Only Request

curl http://localhost:9696/chat/completions -d '{"model": "granite-speech-3.3-2B-BF16.gguf", "temperature": 0.0, "messages": [{"role": "user", "content": "Tell me a story about a developer and their dog"}]}' | jq -r ".choices[0].message.content"

ASR Request

# NOTE: Using inline python to get around curl request length limit
python -c 'import base64, requests, json
from pathlib import Path
audio = base64.b64encode(open(Path("~/models/ibm-granite/granite-4.0-1b-speech/multilingual_sample.wav").expanduser(), "rb").read()).decode("utf-8")
print(requests.post("http://localhost:9696/chat/completions", json={
    "model": "granite-speech-3.3-2B-BF16.gguf",
    "temperature": 0.0,
    "messages": [{
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": audio,
                    "format": "wav"
                }
            },
            {"type": "text", "text": "can you transcribe the speech into a written format?"}
        ]
    }]
}).json()["choices"][0]["message"]["content"])'

Run without adapter -> Good Text / Bad ASR

./bin/llama-server -m ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16.gguf --mmproj ~/models/granite-speech-3.3-2b/mmproj-granite-speech-3.3-2b-BF16.gguf --port 9696 --lora ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16-LoRA.gguf

Text Response

Once upon a time, in the bustling city of San Francisco, lived a brilliant developer named Alex. Alex was known for their exceptional skills in coding and creating innovative software solutions. However, what many didn't know was Alex's equally endearing companion, a golden retriever named Max.

Max was not just any dog; he was Alex's constant coding companion. While other developers might take breaks to socialize or exercise, Alex would often bring Max to their workspace. Max would sit patiently at Alex's feet, his tail wagging in anticipation as his owner typed away at their keyboard.

One day, Alex was working on a complex project involving machine learning algorithms. The task was daunting, and the deadlines were looming. As Alex delved deeper into the code, they found themselves stuck in a rut, unable to make progress.

Max, sensing his owner's frustration, nudged his paw against Alex's hand. With a sigh, Alex decided to take a break and spent the afternoon playing fetch with Max in the nearby park. The fresh air and exercise did wonders for Alex's mind.

Returning to the office, Alex felt refreshed and reinvigorated. They approached the problem with a new perspective, and after a few more hours of focused work, they finally cracked the code. The solution was elegant, efficient, and had never occurred to Alex when they were focused on the problem.

Word of Alex's innovative solution spread throughout the tech community. The project was a success, and Alex's reputation as a developer soared. But more importantly, Max became a symbol of the creative spark that could emerge from even the most mundane activities, like playing fetch in the park.

From then on, Max was no longer just a loyal canine companion but a co-creator, a source of inspiration, and a reminder that sometimes, the best ideas come from taking a break and allowing our minds to wander. And so, the developer and their dog continued their partnership, crafting innovative software and sharing countless adventures in the heart of San Francisco.

ASR Response

Sure, I'd be happy to transcribe the speech into written format. Here's the transcription:

---

For Timothy was a spoiled cat, and he allowed no one to interfere. Everybody waited upon him, moving their chairs even, for he was the monarch of the hearth.

The next night, Timothy's sister called him when he was still awake. "Sister," he said, "if you don't sleep, I beg you, wait until the day that will soon appear to continue the tale of the pecker."

---

This transcription maintains the original rhythm and tone of the text, preserving the poetic language and the sense of formality in the dialogue.

Run with unconditional adapter -> Bad Text / Good ASR

./bin/llama-server -m ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16.gguf --mmproj ~/models/granite-speech-3.3-2b/mmproj-granite-speech-3.3-2b-BF16.gguf --port 9696 --lora ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16-LoRA.gguf

Text Response

(empty newline)

ASR Response

for timothy was a spoiled cat and he allowed no one to interfere everybody waited upon him moving their chairs even for he was monarch of the hearth dinarzade la nuit suivante appela sa soeur quand il en fut temps si vous ne dormez pas ma soeur lui dit-elle je vous prie en attendant le jour qui paraîtra bientôt de continuer le compte du pêcheur

Run with conditional adapter -> Good Text / Good ASR

./bin/llama-server -m ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16.gguf --mmproj ~/models/granite-speech-3.3-2b/mmproj-granite-speech-3.3-2b-BF16.gguf --port 9696 --lora ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16-LoRA.gguf --lora-modality 0:audio

Text Response

Once upon a time, in the bustling city of San Francisco, lived a brilliant developer named Alex. Alex was known for their exceptional skills in coding and creating innovative software solutions. However, what many didn't know was Alex's equally endearing companion, a golden retriever named Max.

Max was not just any dog; he was Alex's constant coding companion. While other developers might take breaks to socialize or exercise, Alex would often bring Max to their workspace. Max would sit patiently at Alex's feet, his tail wagging in anticipation as his owner typed away at their keyboard.

One day, Alex was working on a complex project involving machine learning algorithms. The task was daunting, and the deadlines were looming. As Alex delved deeper into the code, they found themselves stuck in a rut, unable to make progress.

Max, sensing his owner's frustration, nudged his paw against Alex's hand. With a sigh, Alex decided to take a break and spent the afternoon playing fetch with Max in the nearby park. The fresh air and exercise did wonders for Alex's mind.

Returning to the office, Alex felt refreshed and reinvigorated. They approached the problem with a new perspective, and after a few more hours of focused work, they finally cracked the code. The solution was elegant, efficient, and had never occurred to Alex when they were focused on the problem.

Word of Alex's innovative solution spread throughout the tech community. The project was a success, and Alex's reputation as a developer soared. But more importantly, Max became a symbol of the creative spark that could emerge from even the most mundane activities, like playing fetch in the park.

From then on, Max was no longer just a loyal canine companion but a co-creator, a source of inspiration, and a reminder that sometimes, the best ideas come from taking a break and allowing our minds to wander. And so, the developer and their dog continued their partnership, crafting innovative software and sharing countless adventures in the heart of San Francisco.

ASR Response

for timothy was a spoiled cat and he allowed no one to interfere everybody waited upon him moving their chairs even for he was monarch of the hearth dinarzade la nuit suivante appela sa soeur quand il en fut temps si vous ne dormez pas ma soeur lui dit-elle je vous prie en attendant le jour qui paraîtra bientôt de continuer le compte du pêcheur

Additional information

Requirements

I have read and agree with the contributing guidelines: YES
AI usage disclosure: YES

AI Usage Disclosure

For this work, I used a combination of IBM Bob and Open Code with qwen3.5:122b running in Ollama. Bob was used primarily for planning while OC+qwen3.5 was used primarily for implementation.

I annotated each commit with AI-usage: [full, draft, none] (<agent>) based on how I used my assistants (full -> unaltered agent output, draft -> edited agent output, none -> no agent usage). This is a convention I've been using to track my ownership. Every commit, regardless of agent generation, was fully reviewed and (if needed) edited before committing. I have a small tool git-ai-stats to track the breakdown of commits by agent, usage type, and lines of code.

git-ai-stats output

╔══════════════════════════════════════════════════════════╗
║           GIT AI USAGE ANALYSIS                          ║
╚══════════════════════════════════════════════════════════╝

📊 COMMITS BY AGENT

--- Aggregate ---
Commits                        |      Count
---------------------------------------------
none                           |          7
IBM Bob, OpenCode + qwen3.5:122b |          6
---------------------------------------------
TOTAL                          |         13

📊 COMMITS BY USAGE TYPE

--- Aggregate ---
Commits                        |      Count
---------------------------------------------
none                           |          7
draft                          |          3
full                           |          3
---------------------------------------------
TOTAL                          |         13

📈 LINES OF CODE BY AGENT

--- Aggregate ---
Agent                     |    Commits |  Additions |  Deletions
------------------------------------------------------------
none                      |          7 |        808 |          6
IBM Bob, OpenCode + qwen3.5:122b |          6 |        277 |          2
------------------------------------------------------------
TOTAL                     |         13 |       1085 |          8

📈 LINES OF CODE BY USAGE TYPE

--- Aggregate ---
Usage Type           |    Commits |  Additions |  Deletions
-------------------------------------------------------
none                 |          7 |        808 |          6
draft                |          3 |         93 |          2
full                 |          3 |        184 |          0
-------------------------------------------------------
TOTAL                |         13 |       1085 |          8

In common, we store the modalities as strings which will be mapped to enums once handled in mtmd. Branch: ModalityConditionalAdapters AI-usage: draft (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: ModalityConditionalAdapters AI-usage: full (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Since there can be invalid input from user requests, this will be the default/invalid value for anything that can't be processed. Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…y chunks Branch: ModalityConditionalAdapters AI-usage: full (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: ModalityConditionalAdapters AI-usage: draft (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

This allows the lora to be toggled on/off without losing the value of scale that may have been set by the user intentionally. Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…e of modality tokens Branch: ModalityConditionalAdapters AI-usage: draft (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: ModalityConditionalAdapters AI-usage: full (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart · 2026-04-20T19:33:50Z

Looks like some platform-specific code in the tests. Will fix shortly.

Branch: temporary/GraniteVisionModular AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

ggerganov · 2026-04-20T19:45:52Z

Open Code with qwen3.5:122b running in Ollama

Why not use llama.cpp with OpenCode?

gabe-l-hart · 2026-04-20T19:49:37Z

Why not use llama.cpp with OpenCode?

😁 I knew I was going to get in trouble for this! Truth be told, Ollama got baked into my fingers a long time ago before the multi-model serving ecosystem was working here. It hasn't broken yet. I am actively working on transitioning my scripting ecosystem over though (just ask @0cc4m. He's been on me about this since joining RH).

gabe-l-hart · 2026-04-20T21:17:48Z

Looking over the discussion in #13693, it seems like what I have here is a subset of the proposed automatic-switching solution originally proposed by @CISC that was eventually decided against. It seems like this leaves two things to consider:

Is this sort of adapter swapping better handled explicitly in the user requests?
- That seemed to be the consensus of the earlier PR
- I would argue that for multimodality specifically, users would find it burdensome to have to opt into the adapter IFF they are presenting the modality since this wouldn't be a requirement of most multimodal models and would require client-side code changes that are model-specific.
Assuming we do want this feature, I should probably be using the llama_adapter_lora::gguf_kv and common_adapter_lora_info::task_name values that exist rather than adding the new mmlora_modality_types field.
- This would require restricting a given adapter to a single modality or extending the task_name field to be a vector

ngxson · 2026-04-20T22:10:27Z

I'm a bit low availability so I just reading the discussion quickly, I may miss something. But here my 2c:

I would argue that for multimodality specifically, users would find it burdensome to have to opt into the adapter IFF they are presenting the modality since this wouldn't be a requirement of most multimodal models and would require client-side code changes that are model-specific.

Not quite sure if I understand this correctly, but IMO we should offer a better UX by automatically load the built-in lora (opt-in default as you mentioned).

The main problem is that most people already familiar with using llama with a text model file plus a mmproj file, and imagine someone pretty new to llama.cpp wants to try your model, there is a good chance they will skip loading lora (as they don't know what it is), get bad result, then assume the model is broken somehow.

Assuming we do want this feature, I should probably be using the llama_adapter_lora::gguf_kv and common_adapter_lora_info::task_name values that exist rather than adding the new mmlora_modality_types field.

Yes, it is better to reuse the existing task_name

This would require restricting a given adapter to a single modality or extending the task_name field to be a vector

Yes, I think it should be a std::set<std::string>. And even better, std::set<enum lora_task_type>, so that we can explicitly define which tasks we support in the code base (better documentation)

For the gguf field, task_name can be either a string (for backward compat) or an array of strings

API-design-wise, I think these points could be some good additions to the lora support in llama.cpp. Ranging from easy to hard:

Add enum lora_task_type to lock-in the types of adapter
Add a new API to the core library: llama_adapter_lora_get_task_type that returns the enum; this hides the raw string from end-user
Upon creating mtmd_context, it searches for lora adapter(s) with the given type(s) and store the pointer-to-adapter inside the context
Add mtmd_pre_decode() call to setup the lora, and mtmd_post_decode() to clean it up
Extend libllama to store adapters and main model inside the same GGUF, maybe lora tensors prefixed by lora.{task_name}.*, then having an API like llama_model_get_adapter_lora(enum lora_task_type) to retrieve it (returns nullptr if it doesn't exist)

gabe-l-hart · 2026-04-20T22:16:18Z

Thanks for the thoughts @ngxson! I'll get the basic refactoring to reuse task_name done soon and explore the use of an enum. My only push-back on that one would be that some models may define their own task types with their own activation sequences (eg an adapter to activate a new builtin tool or something), so using an enum to make it explicit would imply that task types are a static attribute of the codebase rather than an attribute of the model. I'd need to think a little more on this though. The existing aLoRA implementation would already support this kind of user-defined task adapter that only activates when needed.

gabe-l-hart added 10 commits April 20, 2026 12:38

feat: Add common plumbing for parsing mmlora adapters

63e8430

In common, we store the modalities as strings which will be mapped to enums once handled in mtmd. Branch: ModalityConditionalAdapters AI-usage: draft (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(mtmd): Add mapping helpers to/from modality string names

17b31c5

Branch: ModalityConditionalAdapters AI-usage: full (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(mtmd): Add UNKNOWN modality type enum value

2a2c20c

Since there can be invalid input from user requests, this will be the default/invalid value for anything that can't be processed. Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix(mtmd): Handle UNKNOWN in string/enum mapping functions

76d44c5

Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(server): Add helpers to check when server_tokens contain modalit…

ef3039c

…y chunks Branch: ModalityConditionalAdapters AI-usage: full (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(server): Add mmlora modalities to lora task output

3faedb0

Branch: ModalityConditionalAdapters AI-usage: draft (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(server): Add enabled bool to common lora struct

12ebf6c

This allows the lora to be toggled on/off without losing the value of scale that may have been set by the user intentionally. Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(common): Default loras to enabled

992a515

Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(server): Add logic to toggle mmloras on/off based on the presenc…

f52409d

…e of modality tokens Branch: ModalityConditionalAdapters AI-usage: draft (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

test: Add unit tests for mmlora functionality

b330512

Branch: ModalityConditionalAdapters AI-usage: full (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart requested review from a team and ggerganov as code owners April 20, 2026 19:26

gabe-l-hart requested a review from ngxson April 20, 2026 19:26

gabe-l-hart mentioned this pull request Apr 20, 2026

mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) #22101

Open

fix(test): Fix missing header for strcmp across platforms

bd85365

Branch: temporary/GraniteVisionModular AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

github-actions bot added testing Everything test related examples server labels Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modality conditional adapters#22184

Modality conditional adapters#22184
gabe-l-hart wants to merge 11 commits intoggml-org:masterfrom
gabe-l-hart:ModalityConditionalAdapters

gabe-l-hart commented Apr 20, 2026 •

edited

Loading

Uh oh!

gabe-l-hart commented Apr 20, 2026

Uh oh!

ggerganov commented Apr 20, 2026

Uh oh!

gabe-l-hart commented Apr 20, 2026

Uh oh!

gabe-l-hart commented Apr 20, 2026

Uh oh!

ngxson commented Apr 20, 2026

Uh oh!

gabe-l-hart commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gabe-l-hart commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Related Work

Testing

Run without adapter -> Good Text / Bad ASR

Run with unconditional adapter -> Bad Text / Good ASR

Run with conditional adapter -> Good Text / Good ASR

Additional information

Requirements

AI Usage Disclosure

Uh oh!

gabe-l-hart commented Apr 20, 2026

Uh oh!

ggerganov commented Apr 20, 2026

Uh oh!

gabe-l-hart commented Apr 20, 2026

Uh oh!

gabe-l-hart commented Apr 20, 2026

Uh oh!

ngxson commented Apr 20, 2026

Uh oh!

gabe-l-hart commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gabe-l-hart commented Apr 20, 2026 •

edited

Loading