Skip to content

release CUDA memory in WeightConverter and avoid meaningless print#1498

Open
xin3he wants to merge 5 commits intomainfrom
xinhe/3-5
Open

release CUDA memory in WeightConverter and avoid meaningless print#1498
xin3he wants to merge 5 commits intomainfrom
xinhe/3-5

Conversation

@xin3he
Copy link
Contributor

@xin3he xin3he commented Mar 5, 2026

Description

release CUDA memory in WeightConverter and avoid meaningless print

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #1497

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

@xin3he xin3he requested review from XuehaoSun, Copilot, lvliang-intel and yiliu30 and removed request for Copilot March 5, 2026 02:44
Copilot AI review requested due to automatic review settings March 5, 2026 03:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a VRAM memory leak during weight conversion in the WeightConverter (issue #1497), where CUDA memory was accumulating across layers because intermediate tensors and original layer data were not being freed. It also refines memory monitoring by moving per-module logging into the unfuse function and adjusting a decorator message.

Changes:

  • Added del dq_weight and layer.to("meta") in all four convert_layer implementations (FP8, MXFP4, MXFP8, NVFP4) to release intermediate CUDA tensors and original layer data after dequantization.
  • Moved memory monitoring from the @dump_mem_usage decorator on _handle_moe_modules into _unfuse_experts_weights_inplace for per-module VRAM tracking.
  • Changed the capitalization of the dump_mem_usage message for _apply_custom_replacements.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
auto_round/utils/weight_handler.py Added del dq_weight and layer.to("meta") in all four handler convert_layer methods to free CUDA memory after weight dequantization.
auto_round/modeling/fused_moe/replace_modules.py Removed @dump_mem_usage decorator from _handle_moe_modules; changed capitalization on _apply_custom_replacements decorator message.
auto_round/modeling/fused_moe/moe_experts_interface.py Added memory_monitor.update() and log_summary() calls before/after the unfusing logic in _unfuse_experts_weights_inplace.

You can also share your feedback on Copilot code review. Take the survey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants