Skip to content

fix: migrate from IBM alora to PEFT 0.18.1 native aLoRA#422

Draft
planetf1 wants to merge 1 commit intogenerative-computing:mainfrom
planetf1:fix/issue-385-peft-migration
Draft

fix: migrate from IBM alora to PEFT 0.18.1 native aLoRA#422
planetf1 wants to merge 1 commit intogenerative-computing:mainfrom
planetf1:fix/issue-385-peft-migration

Conversation

@planetf1
Copy link
Contributor

@planetf1 planetf1 commented Feb 6, 2026

Fix: Migrate m train to PEFT 0.18.1 Native aLoRA

Description

Migrates m train command from IBM's deprecated alora==0.2.0 package to PEFT 0.18.1+ native aLoRA support. This removes an external dependency and uses the officially supported PEFT API.

Key Changes:

  • Removed alora==0.2.0 dependency
  • Updated to peft>=0.18.1
  • Replaced IBM-specific imports with PEFT native API (LoraConfig, get_peft_model)
  • Updated to use alora_invocation_tokens parameter (list of token IDs) instead of invocation_string

Special Note:

I set peft to 0.18.1 not 0.18.0 (a minor update) since there are issues in swapping adapters & loading parameters which seemed as if they could affect the activities mellea performs

Hugging face tests run on cuda - working except for FAILED test/backends/test_huggingface.py::test_error_during_generate_with_lock which seems a backend bug unrelated to this

Todos:

  • Extend test to do inference with mellea backend
  • Fix up alora 101 sample for further verification

Implementation Checklist

Protocol Compliance

  • Maintains backward compatibility - existing adapters work unchanged
  • Only affects training workflow, inference unchanged

Integration

  • Updated cli/alora/train.py with PEFT native API
  • Updated docs/alora.md documentation

Testing

  • Unit tests added to test/cli/test_alora_train.py (4 tests, all passing)
  • Integration tests added to test/cli/test_alora_train_integration.py (2 tests, verified on CUDA)

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@planetf1 planetf1 force-pushed the fix/issue-385-peft-migration branch from 89c4710 to c2fa5c8 Compare February 6, 2026 13:06
@mergify
Copy link

mergify bot commented Feb 6, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

@planetf1
Copy link
Contributor Author

planetf1 commented Feb 6, 2026

Example logs from run with cuda (integration+unit):

============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /u/jonesn/.conda/envs/mellea/bin/python3
cachedir: .pytest_cache
rootdir: /proj/dmfexp/eiger/users/jonesn/mellea
configfile: pyproject.toml
plugins: nbmake-1.5.5, asyncio-1.3.0, Faker-40.1.2, timeout-2.4.0, langsmith-0.6.6, anyio-4.12.1, cov-7.0.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
timeout: 900.0s
timeout method: signal
timeout func_only: False
collecting ... collected 6 items

test/cli/test_alora_train.py::test_alora_config_creation PASSED          [ 16%]
test/cli/test_alora_train.py::test_lora_config_creation PASSED           [ 33%]
test/cli/test_alora_train.py::test_invocation_prompt_tokenization PASSED [ 50%]
test/cli/test_alora_train.py::test_imports_work PASSED                   [ 66%]
test/cli/test_alora_train_integration.py::test_alora_training_integration PASSED [ 83%]
test/cli/test_alora_train_integration.py::test_lora_training_integration PASSED [100%]

=============================== warnings summary ===============================
test/cli/test_alora_train.py::test_alora_config_creation
test/cli/test_alora_train.py::test_lora_config_creation
test/cli/test_alora_train.py::test_invocation_prompt_tokenization
test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /u/jonesn/.conda/envs/mellea/lib/python3.12/site-packages/trl/trainer/sft_config.py:257: DeprecationWarning: `max_seq_length` is deprecated and will be removed in version 0.20.0. Use `max_length` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /u/jonesn/.conda/envs/mellea/lib/python3.12/site-packages/trl/trainer/utils.py:103: DeprecationWarning: This class is deprecated and will be removed in version 0.20.0. To train on completion only, please use the parameter `completion_only_loss` of `SFTConfig` instead.
    warnings.warn(

test/cli/test_alora_train_integration.py::test_alora_training_integration
test/cli/test_alora_train_integration.py::test_lora_training_integration
  /u/jonesn/.conda/envs/mellea/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:678: DeprecationWarning: Failed to apply the formatting function due to the following error: string index out of range. This may be because the function is designed for batched input. Please update it to process one example at a time (i.e., accept and return a single example). For now, we will attempt to apply the function in batched mode, but note that batched formatting is deprecated and will be removed in version 0.21.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

Copy link
Contributor

@jakelorocco jakelorocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm; a few nits and looks like the tests are failing

@planetf1
Copy link
Contributor Author

planetf1 commented Feb 6, 2026

The new alora tests do fail in CI with:

FAILED test/cli/test_alora_train_integration.py::test_alora_training_integration - NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
FAILED test/cli/test_alora_train_integration.py::test_lora_training_integration - NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
= 2 failed, 239 passed, 109 skipped, 1 xpassed, 25 warnings in 562.56s (0:09:22) =

Investigating

@planetf1 planetf1 force-pushed the fix/issue-385-peft-migration branch 2 times, most recently from 36f4b55 to c741486 Compare February 6, 2026 14:07
Migrated m train command from deprecated IBM alora package to PEFT 0.18+ native aLoRA support.
- Updated dependencies: removed alora==0.2.0, added peft>=0.18.1
- Replaced IBM imports with PEFT native API (LoraConfig, get_peft_model)
- Changed invocation format: invocation_string → alora_invocation_tokens (list of token IDs)
- Added comprehensive test suite: 4 unit tests + 2 integration tests with full adapter verification
- Tests validate config format, weight integrity, adapter loading, and inference with/without activation
@planetf1 planetf1 force-pushed the fix/issue-385-peft-migration branch from c741486 to 3e2a34e Compare February 6, 2026 14:09
@planetf1
Copy link
Contributor Author

planetf1 commented Feb 6, 2026

The CI test failure is caused by not having GPU. The train needs to use CPU instead if no GPUs available.
I think that's now fixed

we have the same issue running locally on mac arm (mps) with the current pytorch version. This is why originally we skip the alora test on mac. However this means a mac user cannot use alora at all -- so looking at whether we can detect mps/bad pytorch and revert to cpu-only with a warning rather than fail.

@planetf1
Copy link
Contributor Author

planetf1 commented Feb 6, 2026

It's quite hard to get the initialization working in the train cli such that after detecting mps/backlevel pytorch it then uses cpu only.

After a few attempts I feel an alternative is worthwhile. Fail as now .. but also add an option to use --device=cpu which forces cpu only and could be used on mac (or any system) if we want to avoid auto-detection of gpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants