Fix FP8 quantizer for Transformers v4 by yiliu30 · Pull Request #1504 · intel/auto-round

yiliu30 · 2026-03-06T03:12:37Z

Description

Please briefly describe your main changes, the motivation.

Type of Change

Related Issues

Fixes or relates to #

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.

Signed-off-by: yiliu30 <yi4.liu@intel.com>

for more information, see https://pre-commit.ci

Copilot

Pull request overview

This PR aims to restore/fix FP8 fine-grained integration compatibility with Transformers v4 by switching the HPU finegrained-fp8 monkeypatch to a version-specific patch module, and by introducing a dedicated v4 patch implementation.

Changes:

Update HPU patching to select a Transformers v4 vs v5+ compatible finegrained_fp8 replacement module at runtime.
Add a new finegrained_fp8_patch_v4.py module providing FP8Linear replacement logic for Transformers v4.
Add an example script for running FP8 static quantization and saving output.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.

File	Description
examples/quant_model.py	Adds an example CLI for quantizing/saving a model with AutoRound FP8 static scheme.
auto_round/modeling/hpu_patch.py	Selects the correct `finegrained_fp8` patch module depending on Transformers major version.
auto_round/modeling/finegrained_fp8_patch_v4.py	Introduces a Transformers v4-specific FP8Linear replacement implementation.

Copilot · 2026-03-06T03:16:56Z

auto_round/modeling/finegrained_fp8_patch_v4.py

+            scale_in_features = (in_features + block_size[1] - 1) // block_size[1]
+            self.weight_scale_inv = nn.Parameter(
+                torch.empty(scale_out_features, scale_in_features, dtype=torch.float32, device=device)
+            )
+        else:
+            self.register_parameter("weight_scale_inv", None)
+
+        self.block_size = block_size


FP8Linear.__init__ indexes block_size[0]/block_size[1] even though block_size is optional and defaults to None. This will raise at runtime for per-tensor FP8 (or any call site that passes block_size=None). Handle the None case (e.g., create a scalar scale/inv-scale like the v5 patch) before computing tiled scale shapes.

Copilot · 2026-03-06T03:16:57Z

auto_round/modeling/finegrained_fp8_patch_v4.py

+        quantization_config=quantization_config,
+    )
+
+    if not has_been_replaced:
+        logger.warning(
+            "You are loading your model using fp8 but no linear modules were found in your model."


replace_with_fp8_linear unconditionally accesses model._tp_plan, but many Transformers models (especially v4) won’t have this attribute, which will raise AttributeError and prevent loading. Use getattr(model, "_tp_plan", None) (or avoid passing tp_plan at all since it isn’t used) to keep this compatible.

Copilot · 2026-03-06T03:16:57Z

auto_round/modeling/finegrained_fp8_patch_v4.py

+):
+    """Helper function to replace model layers with FP8 versions."""
+    modules_to_not_convert = ["lm_head"] if modules_to_not_convert is None else modules_to_not_convert
+
+    if quantization_config.modules_to_not_convert is not None:
+        modules_to_not_convert.extend(quantization_config.modules_to_not_convert)
+    modules_to_not_convert = list(set(modules_to_not_convert))
+    model, has_been_replaced = _replace_with_fp8_linear(
+        model,
+        tp_plan=model._tp_plan,
+        modules_to_not_convert=modules_to_not_convert,


quantization_config is optional in the signature but is treated as required (e.g., quantization_config.modules_to_not_convert). If this helper is called with quantization_config=None it will crash. Either make the argument required (no default) or add an explicit guard/error message early.

Copilot · 2026-03-06T03:16:57Z

examples/quant_model.py

+    output_dir = args.output_dir
+    if output_dir is None:
+        output_dir = "/mnt/disk5/hf_models/" + model_base_name + "-" + scheme + "-fp8-kv-2-test"
+    print(f"Output dir: {output_dir}")


Defaulting output_dir to an absolute path under /mnt/... makes the example fail on most machines. Use a relative default (e.g., ./output/<model>-<scheme>) or require --output_dir.

Copilot · 2026-03-06T03:16:57Z

auto_round/modeling/finegrained_fp8_patch_v4.py

+        if bias:
+            self.bias = nn.Parameter(torch.empty(self.out_features))
+        else:
+            self.register_parameter("bias", None)


FP8Linear stores activation_scheme but never creates an activation_scale parameter when activation_scheme == "static" (the v5 patch does). If Transformers v4 expects activation_scale to exist for static FP8 (e.g., during state_dict load), this will break loading/saving. Add the missing parameter (or confirm v4 does not use it and remove the flag).

Copilot · 2026-03-06T03:16:57Z

auto_round/modeling/finegrained_fp8_patch_v4.py

+
+
+def _replace_with_fp8_linear(
+    model,


When bias=True, the bias parameter is created without device/dtype, so it will default to CPU and can end up on a different device than weight. Create the bias on the same device (and preferably dtype) as the weight to avoid device mismatch errors during forward passes.

Copilot · 2026-03-06T03:16:58Z

auto_round/modeling/finegrained_fp8_patch_v4.py

+logger = logging.get_logger(__name__)
+
+
+logger = logging.get_logger(__name__)
+
+
+_FP8_DTYPE = torch.float8_e4m3fn
+_FP8_MIN = torch.finfo(_FP8_DTYPE).min
+_FP8_MAX = torch.finfo(_FP8_DTYPE).max
+
+


There are two identical logger = logging.get_logger(__name__) assignments; the second is redundant and the extra blank lines make the module harder to scan. Remove the duplicate and tidy whitespace.

Copilot · 2026-03-06T03:16:58Z

examples/quant_model.py

+# model_name = "/dataset/meta-llama/Meta-Llama-3-8B/"
+# model_name = "/data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4"
+# model_name = "/models/Qwen3-8B-FP8/"
+# model_name = "/data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4"
+# model_name = "Qwen/Qwen2.5-0.5B-Instruct"
+# model_name="/models/Qwen3-235B-A22B/"
+model_name = "/mnt/disk5/unsloth/DeepSeek-R1-BF16"
+model_name = "/models/Qwen3-8B-FP8/"
+# model_name = "/mnt/disk8/Qwen/Qwen3-8B-FP8"
+# model_name = "/mnt/disk5/Qwen3-30B-A3B-FP8"
+# model_name = "/models/DeepSeek-V2-Lite-Chat/"
+# model_name = "/mnt/disk8/deepseek-ai/DeepSeek-V2-Lite-Chat"
+model_name = "/mnt/disk8/Qwen/Qwen3-30B-A3B"


This example script hardcodes multiple absolute, machine-specific model_name paths and overwrites model_name several times. This makes the example non-portable and confusing; prefer a single sensible default (e.g., a public HF model id) or require --model and remove the local-path assignments/comments.

Suggested change

# model_name = "/dataset/meta-llama/Meta-Llama-3-8B/"

# model_name = "/data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4"

# model_name = "/models/Qwen3-8B-FP8/"

# model_name = "/data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4"

# model_name = "Qwen/Qwen2.5-0.5B-Instruct"

# model_name="/models/Qwen3-235B-A22B/"

model_name = "/mnt/disk5/unsloth/DeepSeek-R1-BF16"

model_name = "/models/Qwen3-8B-FP8/"

# model_name = "/mnt/disk8/Qwen/Qwen3-8B-FP8"

# model_name = "/mnt/disk5/Qwen3-30B-A3B-FP8"

# model_name = "/models/DeepSeek-V2-Lite-Chat/"

# model_name = "/mnt/disk8/deepseek-ai/DeepSeek-V2-Lite-Chat"

model_name = "/mnt/disk8/Qwen/Qwen3-30B-A3B"

# Default model name; can be overridden via the --model CLI argument.

# Use a public Hugging Face model id to keep this example portable.

model_name = "Qwen/Qwen2.5-0.5B-Instruct"

Copilot · 2026-03-06T03:16:58Z

examples/quant_model.py

+def fix_everything(seed):
+    import random
+
+    import numpy as np
+    import torch
+
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    # torch.cuda.manual_seed_all(seed)


fix_everything is defined but never called. Either wire it into main/CLI (e.g., add --seed and call it) or remove it to avoid dead code in the example.

Copilot · 2026-03-06T03:16:58Z

auto_round/modeling/hpu_patch.py

+        from auto_round.utils import is_transformers_version_greater_or_equal_5
+
+        if is_transformers_version_greater_or_equal_5():
+            patch_file_name = "auto_round.modeling.finegrained_fp8_patch"
+        else:
+            patch_file_name = "auto_round.modeling.finegrained_fp8_patch_v4"
+


patch_finegrained_fp8 now selects between two patch modules, but the log message below still implies a single hard-coded module name. Consider logging patch_file_name (or otherwise reflecting which patch was applied) so debugging reflects the actual branch taken.

yiliu30 added 4 commits February 26, 2026 13:56

add v4 patch

0488453

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix import

ce04f76

Signed-off-by: yiliu30 <yi4.liu@intel.com>

quick fix

0dda8ab

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add quant code

619d935

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Copilot AI review requested due to automatic review settings March 6, 2026 03:12

Copilot started reviewing on behalf of yiliu30 March 6, 2026 03:13 View session

[pre-commit.ci] auto fixes from pre-commit.com hooks

b45122f

for more information, see https://pre-commit.ci

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Merge branch 'main' into hpu-v4

893169d

chensuyue added this to the 0.10.3 milestone Mar 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FP8 quantizer for Transformers v4#1504

Fix FP8 quantizer for Transformers v4#1504
yiliu30 wants to merge 6 commits intomainfrom
hpu-v4

yiliu30 commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yiliu30 commented Mar 6, 2026

Description

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants