Modality conditional adapters#22184
Conversation
In common, we store the modalities as strings which will be mapped to enums once handled in mtmd. Branch: ModalityConditionalAdapters AI-usage: draft (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters AI-usage: full (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Since there can be invalid input from user requests, this will be the default/invalid value for anything that can't be processed. Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…y chunks Branch: ModalityConditionalAdapters AI-usage: full (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters AI-usage: draft (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This allows the lora to be toggled on/off without losing the value of scale that may have been set by the user intentionally. Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…e of modality tokens Branch: ModalityConditionalAdapters AI-usage: draft (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: ModalityConditionalAdapters AI-usage: full (Bob, OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
|
Looks like some platform-specific code in the tests. Will fix shortly. |
Branch: temporary/GraniteVisionModular AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Why not use llama.cpp with OpenCode? |
😁 I knew I was going to get in trouble for this! Truth be told, Ollama got baked into my fingers a long time ago before the multi-model serving ecosystem was working here. It hasn't broken yet. I am actively working on transitioning my scripting ecosystem over though (just ask |
|
Looking over the discussion in #13693, it seems like what I have here is a subset of the proposed automatic-switching solution originally proposed by @CISC that was eventually decided against. It seems like this leaves two things to consider:
|
|
I'm a bit low availability so I just reading the discussion quickly, I may miss something. But here my 2c:
Not quite sure if I understand this correctly, but IMO we should offer a better UX by automatically load the built-in lora (opt-in default as you mentioned). The main problem is that most people already familiar with using llama with a text model file plus a mmproj file, and imagine someone pretty new to llama.cpp wants to try your model, there is a good chance they will skip loading lora (as they don't know what it is), get bad result, then assume the model is broken somehow.
Yes, it is better to reuse the existing task_name
Yes, I think it should be a For the gguf field, API-design-wise, I think these points could be some good additions to the lora support in llama.cpp. Ranging from easy to hard:
|
|
Thanks for the thoughts @ngxson! I'll get the basic refactoring to reuse |
Overview
This PR introduces a new mechanism for automatic LoRA adapter toggling based on the presence of one-or-more modalities that are tied to the adapter. This is a required feature for serving modular models such as ibm-granite/granite-speech-3.3-2b and ibm-granite/granite-4.0-3b-vision where the base LLM is preserved and the modality support is added through the adapter. Without this, a modular model must be booted in either text-mode or modality-mode. With this change, the model can be booted once and used in either mode based on the presence of the modality.
Related Work
The existing PR (#22101) by @ReinforcedKnowledge for Granite Speech adds support for ibm-granite/granite-speech-3.3-2b. I'm not aware of any other models that use this pattern and are already supported, so while this PR is still in review, there are no existing test models to verify the functionality with.
Testing
I have a temporary merge point between this branch and #22101 where I've tested the ability for
granite-speech-3.3-2bto leverage its conditional adapter. With this combination (using the conversion steps in my comment here), I've tested the following scenarios:Text Only Request
ASR Request
Run without adapter -> Good Text / Bad ASR
Text Response
ASR Response
Run with unconditional adapter -> Bad Text / Good ASR
Text Response
(empty newline)
ASR Response
Run with conditional adapter -> Good Text / Good ASR
Text Response
ASR Response
Additional information
Requirements
AI Usage Disclosure
For this work, I used a combination of IBM Bob and Open Code with
qwen3.5:122brunning in Ollama. Bob was used primarily for planning while OC+qwen3.5 was used primarily for implementation.I annotated each commit with
AI-usage: [full, draft, none] (<agent>)based on how I used my assistants (full-> unaltered agent output,draft-> edited agent output,none-> no agent usage). This is a convention I've been using to track my ownership. Every commit, regardless of agent generation, was fully reviewed and (if needed) edited before committing. I have a small tool git-ai-stats to track the breakdown of commits by agent, usage type, and lines of code.git-ai-stats output