Skip to content

reduce ram&vram usage for vlm calib stage#1488

Merged
WeiweiZhang1 merged 13 commits intomainfrom
reduce_vram/ram_usage_for_vlm_in_calib_stage
Mar 9, 2026
Merged

reduce ram&vram usage for vlm calib stage#1488
WeiweiZhang1 merged 13 commits intomainfrom
reduce_vram/ram_usage_for_vlm_in_calib_stage

Conversation

@WeiweiZhang1
Copy link
Contributor

@WeiweiZhang1 WeiweiZhang1 commented Mar 3, 2026

Description

Qwen3-VL-8B-Instruct example:
before:
image

after:
image

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to # #1214

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>
Copilot AI review requested due to automatic review settings March 3, 2026 06:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce RAM/VRAM usage during the MLLM/VLM calibration stage by limiting dataset item caching and by enabling earlier forward-stop behavior during input caching.

Changes:

  • Add an optional bounded (LRU) runtime cache for MLLM dataset samples, configurable via AR_MLLM_DATASET_CACHE_SIZE, and clear it after calibration.
  • Refactor MLLM dataset instantiation to pass optional cache_size only when supported.
  • Improve early-stop caching logic in the base compressor by inferring last_cache_name for MLLMs and stopping forward once the last target is reached.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
auto_round/compressors/mllm/dataset.py Adds bounded LRU caching for dataset __getitem__ and wiring for cache_size via env var.
auto_round/compressors/mllm/compressor.py Reduces VRAM during calib by passing use_cache=False and clears dataset runtime cache post-calib.
auto_round/compressors/base.py Infers the last cache target for MLLMs and uses early-stop to reduce runtime/memory during caching.

@wenhuach21
Copy link
Contributor

Besides, help test a moe model, like qwen35-35B, the patching code may introduce some issues

@WeiweiZhang1 WeiweiZhang1 removed the WIP label Mar 6, 2026
@WeiweiZhang1 WeiweiZhang1 requested review from n1ck-guo and yiliu30 March 6, 2026 07:49
@WeiweiZhang1 WeiweiZhang1 merged commit be38713 into main Mar 9, 2026
29 checks passed
@WeiweiZhang1 WeiweiZhang1 deleted the reduce_vram/ram_usage_for_vlm_in_calib_stage branch March 9, 2026 07:35
lr (float): The learning rate (default is 0.005).
minmax_lr (float): The learning rate for min-max tuning (default is None).
low_gpu_mem_usage (bool): Whether to use low GPU memory (default is False).
low_gpu_mem_usage (bool): Whether to use low GPU memory (default is True).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why set it to True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiliu30 please review the pr carefully

Copy link
Contributor Author

@WeiweiZhang1 WeiweiZhang1 Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cus your last modification reset this default to True, this is a doc fix.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants