release CUDA memory in WeightConverter and avoid meaningless print by xin3he · Pull Request #1498 · intel/auto-round

xin3he · 2026-03-05T02:44:05Z

Description

release CUDA memory in WeightConverter and avoid meaningless print

Type of Change

Related Issues

Fixes or relates to #1497

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.

Copilot

Pull request overview

This PR addresses a VRAM memory leak during weight conversion in the WeightConverter (issue #1497), where CUDA memory was accumulating across layers because intermediate tensors and original layer data were not being freed. It also refines memory monitoring by moving per-module logging into the unfuse function and adjusting a decorator message.

Changes:

Added del dq_weight and layer.to("meta") in all four convert_layer implementations (FP8, MXFP4, MXFP8, NVFP4) to release intermediate CUDA tensors and original layer data after dequantization.
Moved memory monitoring from the @dump_mem_usage decorator on _handle_moe_modules into _unfuse_experts_weights_inplace for per-module VRAM tracking.
Changed the capitalization of the dump_mem_usage message for _apply_custom_replacements.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
auto_round/utils/weight_handler.py	Added `del dq_weight` and `layer.to("meta")` in all four handler `convert_layer` methods to free CUDA memory after weight dequantization.
auto_round/modeling/fused_moe/replace_modules.py	Removed `@dump_mem_usage` decorator from `_handle_moe_modules`; changed capitalization on `_apply_custom_replacements` decorator message.
auto_round/modeling/fused_moe/moe_experts_interface.py	Added `memory_monitor.update()` and `log_summary()` calls before/after the unfusing logic in `_unfuse_experts_weights_inplace`.

You can also share your feedback on Copilot code review. Take the survey.

auto_round/modeling/fused_moe/moe_experts_interface.py

auto_round/modeling/fused_moe/replace_modules.py

release CUDA memory in WeightConverter and avoid meaningless print

94380d6

xin3he requested review from XuehaoSun, Copilot, lvliang-intel and yiliu30 and removed request for Copilot March 5, 2026 02:44

Copilot started reviewing on behalf of xin3he March 5, 2026 02:44 View session

fix logic

aa019ed

Copilot AI review requested due to automatic review settings March 5, 2026 03:11

Copilot started reviewing on behalf of xin3he March 5, 2026 03:11 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

auto_round/modeling/fused_moe/moe_experts_interface.py Show resolved Hide resolved

auto_round/modeling/fused_moe/replace_modules.py Show resolved Hide resolved

xin3he added 3 commits March 5, 2026 07:41

stop print moe if not updated

3537d7f

fix ut

2e0ea1b

fix CI

698919c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release CUDA memory in WeightConverter and avoid meaningless print#1498

release CUDA memory in WeightConverter and avoid meaningless print#1498
xin3he wants to merge 5 commits intomainfrom
xinhe/3-5

xin3he commented Mar 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xin3he commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xin3he commented Mar 5, 2026 •

edited

Loading