Skip to content

Conversation

@binghanc
Copy link

@binghanc binghanc commented Nov 20, 2025

What does this PR do?

Type of change: new feature

Overview: support for newer checkpoints

Usage

torchrun --nproc-per-node=8 ptq.py --mla_quant nvfp4_wq_a_wkv_a_wq_b_wo_fp8_wkv_b --batch_size 4 --model_path $DS_CKPT --config DeepSeek-V3/inference/configs/config_671B.json --quant_cfg NVFP4_DEFAULT_CFG --output_path $AMAX_PATH

Summary by CodeRabbit

Release Notes

  • New Features
    • Added NVFP4 quantization option for MLA model quantization workflow.
    • Expanded quantization configuration choices to include "nvfp4" alongside existing per_tensor_fp8 option.
    • Introduced new CLI parameter to specify MLA quantization type during post-training quantization.

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

mtq_cfg["quant_cfg"][layer+"_quantizer"] = {"enable": False}

# Disable BMM quantizers
mtq_cfg["quant_cfg"]["*attn.kv_bmm_quantizer*"] = {"enable": False}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to disable BMM quantizers?

Copy link
Author

@binghanc binghanc Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, we don't quantize wkv_b and BMM weights in the turbo checkpoint. Also, the BMM weights are decomposed from wkv_b in TRTLLM.
If I have missed something, please correct me. @kaiyux

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need changes for generating "v2.1" for now, no need for "v2.2".

@binghanc binghanc force-pushed the binghanc/dpskr1_nvfp4_v3 branch from 1386863 to f60a5b7 Compare January 9, 2026 05:32
@binghanc binghanc marked this pull request as ready for review January 9, 2026 05:33
@binghanc binghanc requested a review from a team as a code owner January 9, 2026 05:33
@binghanc binghanc requested a review from meenchen January 9, 2026 05:33
@binghanc binghanc changed the title support for newer checkpoints Feat: Support quantization for DeepSeek-R1-0528-NVFP4-Turbo Jan 9, 2026
@binghanc binghanc changed the title Feat: Support quantization for DeepSeek-R1-0528-NVFP4-Turbo Feat: Support quantization for new DeepSeek-R1 checkpoints Jan 9, 2026
@codecov
Copy link

codecov bot commented Jan 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.18%. Comparing base (945ee02) to head (a3a2712).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #582   +/-   ##
=======================================
  Coverage   74.18%   74.18%           
=======================================
  Files         192      192           
  Lines       19236    19236           
=======================================
  Hits        14271    14271           
  Misses       4965     4965           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@binghanc binghanc changed the title Feat: Support quantization for new DeepSeek-R1 checkpoints Support MLA nvfp4 quant for Deepseek for max perf Jan 10, 2026
Signed-off-by: binghanc <176802681+binghanc@users.noreply.github.com>
Signed-off-by: binghanc <176802681+binghanc@users.noreply.github.com>
Signed-off-by: binghanc <176802681+binghanc@users.noreply.github.com>
Signed-off-by: binghanc <176802681+binghanc@users.noreply.github.com>
@binghanc binghanc force-pushed the binghanc/dpskr1_nvfp4_v3 branch from 62585b5 to 99b1b9f Compare January 13, 2026 01:55
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 13, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This PR extends MLA quantization support in the PTQ pipeline by introducing a new "nvfp4" quantization option. The changes add an mla_quant parameter to the quantization flow, validate it against allowed values (None, "per_tensor_fp8", "nvfp4"), assign NVFP4 quantizers to specific linear layers when enabled, and expose the option via CLI argument.

Changes

Cohort / File(s) Summary
MLA Quantization Configuration
examples/deepseek/ptq.py
Added support for new "nvfp4" quantization type with specific layer assignments (wq_a, wkv_a, wq_b, wo) and BMM quantizer disabling. Expanded validation to enforce allowed values: None, "per_tensor_fp8", "nvfp4". Updated ptq function signature to accept mla_quant: str | None = None parameter and added CLI argument --mla_quant for user configuration.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: adding MLA nvfp4 quantization support for DeepSeek to achieve maximum performance, which matches the core objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
examples/deepseek/ptq.py (2)

314-315: Consider raising ValueError instead of assert for user input validation.

Using assert for input validation can be bypassed with -O (optimized mode) and produces a less user-friendly error message. A ValueError with a descriptive message would be more appropriate for CLI argument validation.

Suggested fix
     allowed_mla_quant = [None, "per_tensor_fp8", "nvfp4"]
-    assert mla_quant in allowed_mla_quant, f"mla_quant must be {allowed_mla_quant}"
+    if mla_quant not in allowed_mla_quant:
+        raise ValueError(f"mla_quant must be one of {allowed_mla_quant}, got {mla_quant!r}")

430-435: Consider using choices for CLI-level validation.

Using argparse's choices parameter would provide immediate feedback to users and generate better --help output.

Suggested improvement
     parser.add_argument(
         "--mla_quant",
         type=str,
         default=None,
+        choices=["per_tensor_fp8", "nvfp4"],
         help="MLA quantization type: None (disable), per_tensor_fp8, nvfp4",
     )

Note: choices doesn't directly support None for omitted arguments, but leaving it out of choices while keeping default=None achieves the desired behavior - when the argument isn't provided, it's None; when provided, it must be one of the choices.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ae96b5 and 99b1b9f.

📒 Files selected for processing (1)
  • examples/deepseek/ptq.py
🧰 Additional context used
🧬 Code graph analysis (1)
examples/deepseek/ptq.py (1)
modelopt/torch/quantization/qtensor/nvfp4_tensor.py (1)
  • quantize (143-232)
🔇 Additional comments (4)
examples/deepseek/ptq.py (4)

276-283: LGTM!

The new mla_quant parameter is properly added with a sensible default of None for backward compatibility.


341-347: LGTM!

The formatting improvements with blank lines enhance code readability.


442-442: LGTM!

The new mla_quant argument is correctly propagated to the ptq function.


322-339: The code explicitly disables wkv_b by design, not by oversight.

The implementation intentionally excludes wkv_b from NVFP4 quantization. The separate mla_nvfp4_linear_layers list contains only four layers (wq_a, wkv_a, wq_b, wo), and the comment confirms "wq_a, wkv_a, wq_b, wo use NVFP4 quantization"—wkv_b is not included. The codebase contains no FP8 configuration for wkv_b in any code path. While the PR description naming may suggest otherwise, the actual implementation shows this exclusion is deliberate.

Likely an incorrect or invalid review comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants