-
Notifications
You must be signed in to change notification settings - Fork 243
Support MLA nvfp4 quant for Deepseek for max perf #582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
examples/deepseek/ptq.py
Outdated
| mtq_cfg["quant_cfg"][layer+"_quantizer"] = {"enable": False} | ||
|
|
||
| # Disable BMM quantizers | ||
| mtq_cfg["quant_cfg"]["*attn.kv_bmm_quantizer*"] = {"enable": False} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to disable BMM quantizers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need changes for generating "v2.1" for now, no need for "v2.2".
1386863 to
f60a5b7
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #582 +/- ##
=======================================
Coverage 74.18% 74.18%
=======================================
Files 192 192
Lines 19236 19236
=======================================
Hits 14271 14271
Misses 4965 4965 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: binghanc <176802681+binghanc@users.noreply.github.com>
Signed-off-by: binghanc <176802681+binghanc@users.noreply.github.com>
Signed-off-by: binghanc <176802681+binghanc@users.noreply.github.com>
Signed-off-by: binghanc <176802681+binghanc@users.noreply.github.com>
62585b5 to
99b1b9f
Compare
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughThis PR extends MLA quantization support in the PTQ pipeline by introducing a new "nvfp4" quantization option. The changes add an Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
examples/deepseek/ptq.py (2)
314-315: Consider raisingValueErrorinstead ofassertfor user input validation.Using
assertfor input validation can be bypassed with-O(optimized mode) and produces a less user-friendly error message. AValueErrorwith a descriptive message would be more appropriate for CLI argument validation.Suggested fix
allowed_mla_quant = [None, "per_tensor_fp8", "nvfp4"] - assert mla_quant in allowed_mla_quant, f"mla_quant must be {allowed_mla_quant}" + if mla_quant not in allowed_mla_quant: + raise ValueError(f"mla_quant must be one of {allowed_mla_quant}, got {mla_quant!r}")
430-435: Consider usingchoicesfor CLI-level validation.Using argparse's
choicesparameter would provide immediate feedback to users and generate better--helpoutput.Suggested improvement
parser.add_argument( "--mla_quant", type=str, default=None, + choices=["per_tensor_fp8", "nvfp4"], help="MLA quantization type: None (disable), per_tensor_fp8, nvfp4", )Note:
choicesdoesn't directly supportNonefor omitted arguments, but leaving it out of choices while keepingdefault=Noneachieves the desired behavior - when the argument isn't provided, it'sNone; when provided, it must be one of the choices.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/deepseek/ptq.py
🧰 Additional context used
🧬 Code graph analysis (1)
examples/deepseek/ptq.py (1)
modelopt/torch/quantization/qtensor/nvfp4_tensor.py (1)
quantize(143-232)
🔇 Additional comments (4)
examples/deepseek/ptq.py (4)
276-283: LGTM!The new
mla_quantparameter is properly added with a sensible default ofNonefor backward compatibility.
341-347: LGTM!The formatting improvements with blank lines enhance code readability.
442-442: LGTM!The new
mla_quantargument is correctly propagated to theptqfunction.
322-339: The code explicitly disableswkv_bby design, not by oversight.The implementation intentionally excludes
wkv_bfrom NVFP4 quantization. The separatemla_nvfp4_linear_layerslist contains only four layers (wq_a,wkv_a,wq_b,wo), and the comment confirms "wq_a, wkv_a, wq_b, wo use NVFP4 quantization"—wkv_b is not included. The codebase contains no FP8 configuration forwkv_bin any code path. While the PR description naming may suggest otherwise, the actual implementation shows this exclusion is deliberate.Likely an incorrect or invalid review comment.
What does this PR do?
Type of change: new feature
Overview: support for newer checkpoints
Usage
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.