add glora#3098
Conversation
…_adapter_config signature (for linting reasons)
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for reviving the GLoRA implementation to PEFT. There is still a lot missing (docs, examples, more tests) but let's focus on the integration for now and work on the rest later.
Unfortunately, this PR seems to be based on the state that PEFT was in when the PR was first suggested. We made several refactors since then, which require different patterns to implement. The good news is that the final result should be much simpler with less code required. I marked the corresponding parts, please check. Maybe this is even something that a coding agent can do if asked to update the implementation and pointed to the most recent PEFT code for reference.
|
|
||
|
|
||
| # Refactored GLoraLinear for PEFT compatibility | ||
| class GLoraLinear(GLoraLayer, nn.Linear): |
There was a problem hiding this comment.
You probably based this PR on the very old original PR. Since then, we had several refactors in PEFT, some of which affect how we designed the adapter layers. Could you please check the latest implementation in tuners/lora/layer or PRs of other recent PEFT additions (e.g. #2851)? Most notably, we now pass the base_layer (i.e. the original layer) and wrap it inside the PEFT layer. To get the results of the base layer, we can then call self.base_layer(x) in the forward call (no need for F.linear(x, self.weight)).
| m.bias.requires_grad = True | ||
|
|
||
|
|
||
| class GLoraModel(BaseTuner): |
There was a problem hiding this comment.
Similar argument as for GLoraLayer: We have substantially refactored this part of PEFT. The good news is that it should greatly simplify the overall implementation: You only need to define _create_and_replace and _create_new_module, the remaining methods should all be fine when inherited from the parent class. Moreover, we need these class attributes:
prefix: str = "glora_"tuner_layer_cls = GloraLayertarget_module_mapping = TRANSFORMERS_MODELS_TO_GLORA_TARGET_MODULES_MAPPING
| self.glora_Cu: nn.ParameterDict = nn.ParameterDict() | ||
| self.glora_D: nn.ParameterDict = nn.ParameterDict() | ||
| self.glora_E: nn.ParameterDict = nn.ParameterDict() | ||
| self.eval_config: dict[str, dict[str, object]] = {} |
There was a problem hiding this comment.
Why do we need this? IIUC, this is supposed to be an object that unifies the settings for each parameter, as we have distinct options like "vector" or "constant". I haven't checked all the details, but ideally we would just define during the layer initialization something like:
if config_A_B == "lora":
self.glora_Ad[adapter_name] = nn.Linear(...)
elif config_A_B == "vector":
self.glora_Ad[adapter_name] = ...The arguments config_A_B etc. should be passed to the __init__ and update_layer methods, directly coming from the GloraConfig.
This change may require some custom nn.Modules but that would be fine. I would really like to avoid the whole prepare_path call during forward and frontload the whole resolution to the initialization. This make the forward/merge/unmerge call simpler to understand and should also be slightly more performant.
There was a problem hiding this comment.
This comment is still relevant.
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
|
Note that due to another PEFT method being merged, there is now a merge conflict, but it should be straightforward to resolve. Once you're finished, please ping me for another review. |
- Updated GloraLayer to inherit from BaseTunerLayer - Enhanced adapter management with new methods and improved type checks. - Refined initialization to ensure compatibility with nn.Linear layers only. - Adjusted merge logic to handle weight and bias more robustly. - Updated tests to skip unsupported quantized layers.
|
@BenjaminBossan |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for the latest updates. I found that there is still some work left to make GLoRA consistent with other PEFT methods. Please check my comments and don't hesitate to ask if something is unclear.
Also note that in the meantime, we have updated the contribution guidelines for adding new PEFT methods: https://github.com/huggingface/peft/blob/main/docs/source/developer_guides/contributing.md#add-a-new-peft-fine-tuning-method. You already tick many boxes, but please refer to this when it comes to what's still missing.
| self.out_features: int | ||
| base = self.get_base_layer() | ||
| # Use exact type check: bitsandbytes.Linear4bit subclasses nn.Linear but is not compatible with GLORA math. | ||
| if type(base) is not nn.Linear: |
There was a problem hiding this comment.
Let's remove this. The check for the correct type should happen in _create_new_module.
| "config_D_E": peft_config.config_D_E, | ||
| } | ||
| new_module = GloraLinear(target, **kwargs_glora) | ||
| new_module.add_adapter( |
There was a problem hiding this comment.
Again, don't call add_adapter here, new_module = GloraLinear(target, **kwargs_glora) should be enough, as GloraLinear should call self.update_layer.
Also note that, as mentioned earlier, we did a small refactor here, so the peft_config should be passed to GloraLinear's __init__.
| _skip_if_merging_not_supported(test_name, config_cls, config_kwargs_1) | ||
| _skip_tests_with_multiple_adapters_with_target_parameters(config_cls, config_kwargs_2) | ||
|
|
||
| config_1 = config_cls(**config_kwargs_1) |
There was a problem hiding this comment.
This shouldn't be necessary. Why do the results differ?
There was a problem hiding this comment.
The issue is that GLoRA's merge is multiplicative, not purely additive, during multi-active forward, each adapter computes its delta against the original frozen W0 thus the formula :
result = W0@x + (W0*A1 + B1)@x + (W0*A2 + B2)@x
But sequential merging mutates W0 in place between adapters:
- after adapter1:
W0' = W0 + W0*A1 + B1 - after adapter2:
W0'' = W0' + W0'*A2 + B2
The second merge uses the already-modified W0', meaning W0'*A2 expands to (W0 + W0*A1 + B1)*A2 instead of the original W0*A2
LoRA doesn't have this issue because its merge is purely additive (W0 + delta), so i chose to skip the test for GLoRA for now.
There was a problem hiding this comment.
So you're saying that if I have a GLoRA model with two active adapters, it behaves differently when merged vs unmerged? I think this should be treated like a bug. IIUC, it should be possible to implement the merge so that it gives the same result by not merging sequentially but combining the adapters before merging.
That said, if a user wants to merge adapter 1 first and then adapter 2 in a separate call, that would indeed require extra logic, as the results of naively merging the 2nd adapter would not be correct. I see two solutions here:
- First unmerge adapter 1, then merge 1 and 2 together.
- When merging, check if there is already a merged adapter and raise an error, telling users they need to merge all adapters in one step.
WDYT?
There was a problem hiding this comment.
chose to unmerge adapters before merging, didn't want to introduce an edge case for glora
you can review the changes related to this change in 43ceaee
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for the updates. I still found a couple of issues, but the integration is close to being ready. Please take a look.
| self._active_adapter = adapter_name | ||
| self.update_layer(adapter_name, config.r, config=config) | ||
|
|
||
| def _mixed_batch_forward( |
There was a problem hiding this comment.
If you want GLoRA to support mixed adapter batches, you have to also implement _enable_peft_forward_hook on GloraModel, see e.g. here. Also, please ensure that they're also being tested. Please check TestMixedAdapterBatches in test_custom_models.py. It's also fine not to support it, in which case you can just remove _mixed_batch_forward and the adapter_names = kwargs.pop("adapter_names", None) part in forward.
There was a problem hiding this comment.
did implement this in 66c57fa
IMO this would look better if we can turn it into a mixin, maybe even open an issue and try to integrate mixed adapter batching into the rest of the peft methods. But i digress, let me know what you feel about this and feel free to resolve this thread since this is already done
There was a problem hiding this comment.
I agree that would be a nicer solution, though I think not quite trivial to implement. But that's for a separate PR. I'll resolve the discussion with the next review.
| _skip_if_merging_not_supported(test_name, config_cls, config_kwargs_1) | ||
| _skip_tests_with_multiple_adapters_with_target_parameters(config_cls, config_kwargs_2) | ||
|
|
||
| config_1 = config_cls(**config_kwargs_1) |
There was a problem hiding this comment.
So you're saying that if I have a GLoRA model with two active adapters, it behaves differently when merged vs unmerged? I think this should be treated like a bug. IIUC, it should be possible to implement the merge so that it gives the same result by not merging sequentially but combining the adapters before merging.
That said, if a user wants to merge adapter 1 first and then adapter 2 in a separate call, that would indeed require extra logic, as the results of naively merging the 2nd adapter would not be correct. I see two solutions here:
- First unmerge adapter 1, then merge 1 and 2 together.
- When merging, check if there is already a merged adapter and raise an error, telling users they need to merge all adapters in one step.
WDYT?
follow up on #780 and #2568
this pr adds GLoRA to the library
i also made a minor colab notebook to test this remotely in which you can find here