Skip to content

Modality conditional adapters#22184

Open
gabe-l-hart wants to merge 11 commits intoggml-org:masterfrom
gabe-l-hart:ModalityConditionalAdapters
Open

Modality conditional adapters#22184
gabe-l-hart wants to merge 11 commits intoggml-org:masterfrom
gabe-l-hart:ModalityConditionalAdapters

Conversation

@gabe-l-hart
Copy link
Copy Markdown
Collaborator

@gabe-l-hart gabe-l-hart commented Apr 20, 2026

Overview

This PR introduces a new mechanism for automatic LoRA adapter toggling based on the presence of one-or-more modalities that are tied to the adapter. This is a required feature for serving modular models such as ibm-granite/granite-speech-3.3-2b and ibm-granite/granite-4.0-3b-vision where the base LLM is preserved and the modality support is added through the adapter. Without this, a modular model must be booted in either text-mode or modality-mode. With this change, the model can be booted once and used in either mode based on the presence of the modality.

Related Work

The existing PR (#22101) by @ReinforcedKnowledge for Granite Speech adds support for ibm-granite/granite-speech-3.3-2b. I'm not aware of any other models that use this pattern and are already supported, so while this PR is still in review, there are no existing test models to verify the functionality with.

Testing

I have a temporary merge point between this branch and #22101 where I've tested the ability for granite-speech-3.3-2b to leverage its conditional adapter. With this combination (using the conversion steps in my comment here), I've tested the following scenarios:

Text Only Request

curl http://localhost:9696/chat/completions -d '{"model": "granite-speech-3.3-2B-BF16.gguf", "temperature": 0.0, "messages": [{"role": "user", "content": "Tell me a story about a developer and their dog"}]}' | jq -r ".choices[0].message.content"

ASR Request

# NOTE: Using inline python to get around curl request length limit
python -c 'import base64, requests, json
from pathlib import Path
audio = base64.b64encode(open(Path("~/models/ibm-granite/granite-4.0-1b-speech/multilingual_sample.wav").expanduser(), "rb").read()).decode("utf-8")
print(requests.post("http://localhost:9696/chat/completions", json={
    "model": "granite-speech-3.3-2B-BF16.gguf",
    "temperature": 0.0,
    "messages": [{
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "data": audio,
                    "format": "wav"
                }
            },
            {"type": "text", "text": "can you transcribe the speech into a written format?"}
        ]
    }]
}).json()["choices"][0]["message"]["content"])'

Run without adapter -> Good Text / Bad ASR

./bin/llama-server -m ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16.gguf --mmproj ~/models/granite-speech-3.3-2b/mmproj-granite-speech-3.3-2b-BF16.gguf --port 9696 --lora ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16-LoRA.gguf

Text Response

Once upon a time, in the bustling city of San Francisco, lived a brilliant developer named Alex. Alex was known for their exceptional skills in coding and creating innovative software solutions. However, what many didn't know was Alex's equally endearing companion, a golden retriever named Max.

Max was not just any dog; he was Alex's constant coding companion. While other developers might take breaks to socialize or exercise, Alex would often bring Max to their workspace. Max would sit patiently at Alex's feet, his tail wagging in anticipation as his owner typed away at their keyboard.

One day, Alex was working on a complex project involving machine learning algorithms. The task was daunting, and the deadlines were looming. As Alex delved deeper into the code, they found themselves stuck in a rut, unable to make progress.

Max, sensing his owner's frustration, nudged his paw against Alex's hand. With a sigh, Alex decided to take a break and spent the afternoon playing fetch with Max in the nearby park. The fresh air and exercise did wonders for Alex's mind.

Returning to the office, Alex felt refreshed and reinvigorated. They approached the problem with a new perspective, and after a few more hours of focused work, they finally cracked the code. The solution was elegant, efficient, and had never occurred to Alex when they were focused on the problem.

Word of Alex's innovative solution spread throughout the tech community. The project was a success, and Alex's reputation as a developer soared. But more importantly, Max became a symbol of the creative spark that could emerge from even the most mundane activities, like playing fetch in the park.

From then on, Max was no longer just a loyal canine companion but a co-creator, a source of inspiration, and a reminder that sometimes, the best ideas come from taking a break and allowing our minds to wander. And so, the developer and their dog continued their partnership, crafting innovative software and sharing countless adventures in the heart of San Francisco.

ASR Response

Sure, I'd be happy to transcribe the speech into written format. Here's the transcription:

---

For Timothy was a spoiled cat, and he allowed no one to interfere. Everybody waited upon him, moving their chairs even, for he was the monarch of the hearth.

The next night, Timothy's sister called him when he was still awake. "Sister," he said, "if you don't sleep, I beg you, wait until the day that will soon appear to continue the tale of the pecker."

---

This transcription maintains the original rhythm and tone of the text, preserving the poetic language and the sense of formality in the dialogue.

Run with unconditional adapter -> Bad Text / Good ASR

./bin/llama-server -m ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16.gguf --mmproj ~/models/granite-speech-3.3-2b/mmproj-granite-speech-3.3-2b-BF16.gguf --port 9696 --lora ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16-LoRA.gguf

Text Response

(empty newline)



ASR Response

for timothy was a spoiled cat and he allowed no one to interfere everybody waited upon him moving their chairs even for he was monarch of the hearth dinarzade la nuit suivante appela sa soeur quand il en fut temps si vous ne dormez pas ma soeur lui dit-elle je vous prie en attendant le jour qui paraîtra bientôt de continuer le compte du pêcheur

Run with conditional adapter -> Good Text / Good ASR

./bin/llama-server -m ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16.gguf --mmproj ~/models/granite-speech-3.3-2b/mmproj-granite-speech-3.3-2b-BF16.gguf --port 9696 --lora ~/models/granite-speech-3.3-2b/granite-speech-3.3-2B-BF16-LoRA.gguf --lora-modality 0:audio

Text Response

Once upon a time, in the bustling city of San Francisco, lived a brilliant developer named Alex. Alex was known for their exceptional skills in coding and creating innovative software solutions. However, what many didn't know was Alex's equally endearing companion, a golden retriever named Max.

Max was not just any dog; he was Alex's constant coding companion. While other developers might take breaks to socialize or exercise, Alex would often bring Max to their workspace. Max would sit patiently at Alex's feet, his tail wagging in anticipation as his owner typed away at their keyboard.

One day, Alex was working on a complex project involving machine learning algorithms. The task was daunting, and the deadlines were looming. As Alex delved deeper into the code, they found themselves stuck in a rut, unable to make progress.

Max, sensing his owner's frustration, nudged his paw against Alex's hand. With a sigh, Alex decided to take a break and spent the afternoon playing fetch with Max in the nearby park. The fresh air and exercise did wonders for Alex's mind.

Returning to the office, Alex felt refreshed and reinvigorated. They approached the problem with a new perspective, and after a few more hours of focused work, they finally cracked the code. The solution was elegant, efficient, and had never occurred to Alex when they were focused on the problem.

Word of Alex's innovative solution spread throughout the tech community. The project was a success, and Alex's reputation as a developer soared. But more importantly, Max became a symbol of the creative spark that could emerge from even the most mundane activities, like playing fetch in the park.

From then on, Max was no longer just a loyal canine companion but a co-creator, a source of inspiration, and a reminder that sometimes, the best ideas come from taking a break and allowing our minds to wander. And so, the developer and their dog continued their partnership, crafting innovative software and sharing countless adventures in the heart of San Francisco.

ASR Response

for timothy was a spoiled cat and he allowed no one to interfere everybody waited upon him moving their chairs even for he was monarch of the hearth dinarzade la nuit suivante appela sa soeur quand il en fut temps si vous ne dormez pas ma soeur lui dit-elle je vous prie en attendant le jour qui paraîtra bientôt de continuer le compte du pêcheur

Additional information

Requirements

AI Usage Disclosure

For this work, I used a combination of IBM Bob and Open Code with qwen3.5:122b running in Ollama. Bob was used primarily for planning while OC+qwen3.5 was used primarily for implementation.

I annotated each commit with AI-usage: [full, draft, none] (<agent>) based on how I used my assistants (full -> unaltered agent output, draft -> edited agent output, none -> no agent usage). This is a convention I've been using to track my ownership. Every commit, regardless of agent generation, was fully reviewed and (if needed) edited before committing. I have a small tool git-ai-stats to track the breakdown of commits by agent, usage type, and lines of code.

git-ai-stats output
╔══════════════════════════════════════════════════════════╗
║           GIT AI USAGE ANALYSIS                          ║
╚══════════════════════════════════════════════════════════╝

📊 COMMITS BY AGENT

--- Aggregate ---
Commits                        |      Count
---------------------------------------------
none                           |          7
IBM Bob, OpenCode + qwen3.5:122b |          6
---------------------------------------------
TOTAL                          |         13

📊 COMMITS BY USAGE TYPE

--- Aggregate ---
Commits                        |      Count
---------------------------------------------
none                           |          7
draft                          |          3
full                           |          3
---------------------------------------------
TOTAL                          |         13

📈 LINES OF CODE BY AGENT

--- Aggregate ---
Agent                     |    Commits |  Additions |  Deletions
------------------------------------------------------------
none                      |          7 |        808 |          6
IBM Bob, OpenCode + qwen3.5:122b |          6 |        277 |          2
------------------------------------------------------------
TOTAL                     |         13 |       1085 |          8

📈 LINES OF CODE BY USAGE TYPE

--- Aggregate ---
Usage Type           |    Commits |  Additions |  Deletions
-------------------------------------------------------
none                 |          7 |        808 |          6
draft                |          3 |         93 |          2
full                 |          3 |        184 |          0
-------------------------------------------------------
TOTAL                |         13 |       1085 |          8

In common, we store the modalities as strings which will be mapped to enums
once handled in mtmd.

Branch: ModalityConditionalAdapters
AI-usage: draft (Bob, OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters
AI-usage: full (Bob, OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Since there can be invalid input from user requests, this will be the
default/invalid value for anything that can't be processed.

Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…y chunks

Branch: ModalityConditionalAdapters
AI-usage: full (Bob, OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters
AI-usage: draft (Bob, OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This allows the lora to be toggled on/off without losing the value of scale
that may have been set by the user intentionally.

Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…e of modality tokens

Branch: ModalityConditionalAdapters
AI-usage: draft (Bob, OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters
AI-usage: full (Bob, OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Looks like some platform-specific code in the tests. Will fix shortly.

Branch: temporary/GraniteVisionModular
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@ggerganov
Copy link
Copy Markdown
Member

Open Code with qwen3.5:122b running in Ollama

Why not use llama.cpp with OpenCode?

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Why not use llama.cpp with OpenCode?

😁 I knew I was going to get in trouble for this! Truth be told, Ollama got baked into my fingers a long time ago before the multi-model serving ecosystem was working here. It hasn't broken yet. I am actively working on transitioning my scripting ecosystem over though (just ask @0cc4m. He's been on me about this since joining RH).

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Looking over the discussion in #13693, it seems like what I have here is a subset of the proposed automatic-switching solution originally proposed by @CISC that was eventually decided against. It seems like this leaves two things to consider:

  1. Is this sort of adapter swapping better handled explicitly in the user requests?
    • That seemed to be the consensus of the earlier PR
    • I would argue that for multimodality specifically, users would find it burdensome to have to opt into the adapter IFF they are presenting the modality since this wouldn't be a requirement of most multimodal models and would require client-side code changes that are model-specific.
  2. Assuming we do want this feature, I should probably be using the llama_adapter_lora::gguf_kv and common_adapter_lora_info::task_name values that exist rather than adding the new mmlora_modality_types field.
    • This would require restricting a given adapter to a single modality or extending the task_name field to be a vector

@github-actions github-actions bot added testing Everything test related examples server labels Apr 20, 2026
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 20, 2026

I'm a bit low availability so I just reading the discussion quickly, I may miss something. But here my 2c:

I would argue that for multimodality specifically, users would find it burdensome to have to opt into the adapter IFF they are presenting the modality since this wouldn't be a requirement of most multimodal models and would require client-side code changes that are model-specific.

Not quite sure if I understand this correctly, but IMO we should offer a better UX by automatically load the built-in lora (opt-in default as you mentioned).

The main problem is that most people already familiar with using llama with a text model file plus a mmproj file, and imagine someone pretty new to llama.cpp wants to try your model, there is a good chance they will skip loading lora (as they don't know what it is), get bad result, then assume the model is broken somehow.

Assuming we do want this feature, I should probably be using the llama_adapter_lora::gguf_kv and common_adapter_lora_info::task_name values that exist rather than adding the new mmlora_modality_types field.

Yes, it is better to reuse the existing task_name

This would require restricting a given adapter to a single modality or extending the task_name field to be a vector

Yes, I think it should be a std::set<std::string>. And even better, std::set<enum lora_task_type>, so that we can explicitly define which tasks we support in the code base (better documentation)

For the gguf field, task_name can be either a string (for backward compat) or an array of strings


API-design-wise, I think these points could be some good additions to the lora support in llama.cpp. Ranging from easy to hard:

  1. Add enum lora_task_type to lock-in the types of adapter
  2. Add a new API to the core library: llama_adapter_lora_get_task_type that returns the enum; this hides the raw string from end-user
  3. Upon creating mtmd_context, it searches for lora adapter(s) with the given type(s) and store the pointer-to-adapter inside the context
  4. Add mtmd_pre_decode() call to setup the lora, and mtmd_post_decode() to clean it up
  5. Extend libllama to store adapters and main model inside the same GGUF, maybe lora tensors prefixed by lora.{task_name}.*, then having an API like llama_model_get_adapter_lora(enum lora_task_type) to retrieve it (returns nullptr if it doesn't exist)

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Thanks for the thoughts @ngxson! I'll get the basic refactoring to reuse task_name done soon and explore the use of an enum. My only push-back on that one would be that some models may define their own task types with their own activation sequences (eg an adapter to activate a new builtin tool or something), so using an enum to make it explicit would imply that task types are a static attribute of the codebase rather than an attribute of the model. I'd need to think a little more on this though. The existing aLoRA implementation would already support this kind of user-defined task adapter that only activates when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants