Skip to content

Fix FP8 quantizer for Transformers v4#1504

Open
yiliu30 wants to merge 6 commits intomainfrom
hpu-v4
Open

Fix FP8 quantizer for Transformers v4#1504
yiliu30 wants to merge 6 commits intomainfrom
hpu-v4

Conversation

@yiliu30
Copy link
Contributor

@yiliu30 yiliu30 commented Mar 6, 2026

Description

Please briefly describe your main changes, the motivation.

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Copilot AI review requested due to automatic review settings March 6, 2026 03:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to restore/fix FP8 fine-grained integration compatibility with Transformers v4 by switching the HPU finegrained-fp8 monkeypatch to a version-specific patch module, and by introducing a dedicated v4 patch implementation.

Changes:

  • Update HPU patching to select a Transformers v4 vs v5+ compatible finegrained_fp8 replacement module at runtime.
  • Add a new finegrained_fp8_patch_v4.py module providing FP8Linear replacement logic for Transformers v4.
  • Add an example script for running FP8 static quantization and saving output.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.

File Description
examples/quant_model.py Adds an example CLI for quantizing/saving a model with AutoRound FP8 static scheme.
auto_round/modeling/hpu_patch.py Selects the correct finegrained_fp8 patch module depending on Transformers major version.
auto_round/modeling/finegrained_fp8_patch_v4.py Introduces a Transformers v4-specific FP8Linear replacement implementation.

Comment on lines +64 to +71
scale_in_features = (in_features + block_size[1] - 1) // block_size[1]
self.weight_scale_inv = nn.Parameter(
torch.empty(scale_out_features, scale_in_features, dtype=torch.float32, device=device)
)
else:
self.register_parameter("weight_scale_inv", None)

self.block_size = block_size
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8Linear.__init__ indexes block_size[0]/block_size[1] even though block_size is optional and defaults to None. This will raise at runtime for per-tensor FP8 (or any call site that passes block_size=None). Handle the None case (e.g., create a scalar scale/inv-scale like the v5 patch) before computing tiled scale shapes.

Copilot uses AI. Check for mistakes.
Comment on lines +142 to +147
quantization_config=quantization_config,
)

if not has_been_replaced:
logger.warning(
"You are loading your model using fp8 but no linear modules were found in your model."
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace_with_fp8_linear unconditionally accesses model._tp_plan, but many Transformers models (especially v4) won’t have this attribute, which will raise AttributeError and prevent loading. Use getattr(model, "_tp_plan", None) (or avoid passing tp_plan at all since it isn’t used) to keep this compatible.

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +141
):
"""Helper function to replace model layers with FP8 versions."""
modules_to_not_convert = ["lm_head"] if modules_to_not_convert is None else modules_to_not_convert

if quantization_config.modules_to_not_convert is not None:
modules_to_not_convert.extend(quantization_config.modules_to_not_convert)
modules_to_not_convert = list(set(modules_to_not_convert))
model, has_been_replaced = _replace_with_fp8_linear(
model,
tp_plan=model._tp_plan,
modules_to_not_convert=modules_to_not_convert,
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantization_config is optional in the signature but is treated as required (e.g., quantization_config.modules_to_not_convert). If this helper is called with quantization_config=None it will crash. Either make the argument required (no default) or add an explicit guard/error message early.

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +47
output_dir = args.output_dir
if output_dir is None:
output_dir = "/mnt/disk5/hf_models/" + model_base_name + "-" + scheme + "-fp8-kv-2-test"
print(f"Output dir: {output_dir}")
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting output_dir to an absolute path under /mnt/... makes the example fail on most machines. Use a relative default (e.g., ./output/<model>-<scheme>) or require --output_dir.

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +78
if bias:
self.bias = nn.Parameter(torch.empty(self.out_features))
else:
self.register_parameter("bias", None)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8Linear stores activation_scheme but never creates an activation_scale parameter when activation_scheme == "static" (the v5 patch does). If Transformers v4 expects activation_scale to exist for static FP8 (e.g., during state_dict load), this will break loading/saving. Add the missing parameter (or confirm v4 does not use it and remove the flag).

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +82


def _replace_with_fp8_linear(
model,
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When bias=True, the bias parameter is created without device/dtype, so it will default to CPU and can end up on a different device than weight. Create the bias on the same device (and preferably dtype) as the weight to avoid device mismatch errors during forward passes.

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +42
logger = logging.get_logger(__name__)


logger = logging.get_logger(__name__)


_FP8_DTYPE = torch.float8_e4m3fn
_FP8_MIN = torch.finfo(_FP8_DTYPE).min
_FP8_MAX = torch.finfo(_FP8_DTYPE).max


Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two identical logger = logging.get_logger(__name__) assignments; the second is redundant and the extra blank lines make the module harder to scan. Remove the duplicate and tidy whitespace.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +13
# model_name = "/dataset/meta-llama/Meta-Llama-3-8B/"
# model_name = "/data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4"
# model_name = "/models/Qwen3-8B-FP8/"
# model_name = "/data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4"
# model_name = "Qwen/Qwen2.5-0.5B-Instruct"
# model_name="/models/Qwen3-235B-A22B/"
model_name = "/mnt/disk5/unsloth/DeepSeek-R1-BF16"
model_name = "/models/Qwen3-8B-FP8/"
# model_name = "/mnt/disk8/Qwen/Qwen3-8B-FP8"
# model_name = "/mnt/disk5/Qwen3-30B-A3B-FP8"
# model_name = "/models/DeepSeek-V2-Lite-Chat/"
# model_name = "/mnt/disk8/deepseek-ai/DeepSeek-V2-Lite-Chat"
model_name = "/mnt/disk8/Qwen/Qwen3-30B-A3B"
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example script hardcodes multiple absolute, machine-specific model_name paths and overwrites model_name several times. This makes the example non-portable and confusing; prefer a single sensible default (e.g., a public HF model id) or require --model and remove the local-path assignments/comments.

Suggested change
# model_name = "/dataset/meta-llama/Meta-Llama-3-8B/"
# model_name = "/data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4"
# model_name = "/models/Qwen3-8B-FP8/"
# model_name = "/data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4"
# model_name = "Qwen/Qwen2.5-0.5B-Instruct"
# model_name="/models/Qwen3-235B-A22B/"
model_name = "/mnt/disk5/unsloth/DeepSeek-R1-BF16"
model_name = "/models/Qwen3-8B-FP8/"
# model_name = "/mnt/disk8/Qwen/Qwen3-8B-FP8"
# model_name = "/mnt/disk5/Qwen3-30B-A3B-FP8"
# model_name = "/models/DeepSeek-V2-Lite-Chat/"
# model_name = "/mnt/disk8/deepseek-ai/DeepSeek-V2-Lite-Chat"
model_name = "/mnt/disk8/Qwen/Qwen3-30B-A3B"
# Default model name; can be overridden via the --model CLI argument.
# Use a public Hugging Face model id to keep this example portable.
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +26
def fix_everything(seed):
import random

import numpy as np
import torch

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
# torch.cuda.manual_seed_all(seed)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix_everything is defined but never called. Either wire it into main/CLI (e.g., add --seed and call it) or remove it to avoid dead code in the example.

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +25
from auto_round.utils import is_transformers_version_greater_or_equal_5

if is_transformers_version_greater_or_equal_5():
patch_file_name = "auto_round.modeling.finegrained_fp8_patch"
else:
patch_file_name = "auto_round.modeling.finegrained_fp8_patch_v4"

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patch_finegrained_fp8 now selects between two patch modules, but the log message below still implies a single hard-coded module name. Consider logging patch_file_name (or otherwise reflecting which patch was applied) so debugging reflects the actual branch taken.

Copilot uses AI. Check for mistakes.
@chensuyue chensuyue added this to the 0.10.3 milestone Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants