Conversation
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
This PR aims to restore/fix FP8 fine-grained integration compatibility with Transformers v4 by switching the HPU finegrained-fp8 monkeypatch to a version-specific patch module, and by introducing a dedicated v4 patch implementation.
Changes:
- Update HPU patching to select a Transformers v4 vs v5+ compatible
finegrained_fp8replacement module at runtime. - Add a new
finegrained_fp8_patch_v4.pymodule providing FP8Linear replacement logic for Transformers v4. - Add an example script for running FP8 static quantization and saving output.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| examples/quant_model.py | Adds an example CLI for quantizing/saving a model with AutoRound FP8 static scheme. |
| auto_round/modeling/hpu_patch.py | Selects the correct finegrained_fp8 patch module depending on Transformers major version. |
| auto_round/modeling/finegrained_fp8_patch_v4.py | Introduces a Transformers v4-specific FP8Linear replacement implementation. |
| scale_in_features = (in_features + block_size[1] - 1) // block_size[1] | ||
| self.weight_scale_inv = nn.Parameter( | ||
| torch.empty(scale_out_features, scale_in_features, dtype=torch.float32, device=device) | ||
| ) | ||
| else: | ||
| self.register_parameter("weight_scale_inv", None) | ||
|
|
||
| self.block_size = block_size |
There was a problem hiding this comment.
FP8Linear.__init__ indexes block_size[0]/block_size[1] even though block_size is optional and defaults to None. This will raise at runtime for per-tensor FP8 (or any call site that passes block_size=None). Handle the None case (e.g., create a scalar scale/inv-scale like the v5 patch) before computing tiled scale shapes.
| quantization_config=quantization_config, | ||
| ) | ||
|
|
||
| if not has_been_replaced: | ||
| logger.warning( | ||
| "You are loading your model using fp8 but no linear modules were found in your model." |
There was a problem hiding this comment.
replace_with_fp8_linear unconditionally accesses model._tp_plan, but many Transformers models (especially v4) won’t have this attribute, which will raise AttributeError and prevent loading. Use getattr(model, "_tp_plan", None) (or avoid passing tp_plan at all since it isn’t used) to keep this compatible.
| ): | ||
| """Helper function to replace model layers with FP8 versions.""" | ||
| modules_to_not_convert = ["lm_head"] if modules_to_not_convert is None else modules_to_not_convert | ||
|
|
||
| if quantization_config.modules_to_not_convert is not None: | ||
| modules_to_not_convert.extend(quantization_config.modules_to_not_convert) | ||
| modules_to_not_convert = list(set(modules_to_not_convert)) | ||
| model, has_been_replaced = _replace_with_fp8_linear( | ||
| model, | ||
| tp_plan=model._tp_plan, | ||
| modules_to_not_convert=modules_to_not_convert, |
There was a problem hiding this comment.
quantization_config is optional in the signature but is treated as required (e.g., quantization_config.modules_to_not_convert). If this helper is called with quantization_config=None it will crash. Either make the argument required (no default) or add an explicit guard/error message early.
| output_dir = args.output_dir | ||
| if output_dir is None: | ||
| output_dir = "/mnt/disk5/hf_models/" + model_base_name + "-" + scheme + "-fp8-kv-2-test" | ||
| print(f"Output dir: {output_dir}") |
There was a problem hiding this comment.
Defaulting output_dir to an absolute path under /mnt/... makes the example fail on most machines. Use a relative default (e.g., ./output/<model>-<scheme>) or require --output_dir.
| if bias: | ||
| self.bias = nn.Parameter(torch.empty(self.out_features)) | ||
| else: | ||
| self.register_parameter("bias", None) |
There was a problem hiding this comment.
FP8Linear stores activation_scheme but never creates an activation_scale parameter when activation_scheme == "static" (the v5 patch does). If Transformers v4 expects activation_scale to exist for static FP8 (e.g., during state_dict load), this will break loading/saving. Add the missing parameter (or confirm v4 does not use it and remove the flag).
|
|
||
|
|
||
| def _replace_with_fp8_linear( | ||
| model, |
There was a problem hiding this comment.
When bias=True, the bias parameter is created without device/dtype, so it will default to CPU and can end up on a different device than weight. Create the bias on the same device (and preferably dtype) as the weight to avoid device mismatch errors during forward passes.
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| _FP8_DTYPE = torch.float8_e4m3fn | ||
| _FP8_MIN = torch.finfo(_FP8_DTYPE).min | ||
| _FP8_MAX = torch.finfo(_FP8_DTYPE).max | ||
|
|
||
|
|
There was a problem hiding this comment.
There are two identical logger = logging.get_logger(__name__) assignments; the second is redundant and the extra blank lines make the module harder to scan. Remove the duplicate and tidy whitespace.
| # model_name = "/dataset/meta-llama/Meta-Llama-3-8B/" | ||
| # model_name = "/data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4" | ||
| # model_name = "/models/Qwen3-8B-FP8/" | ||
| # model_name = "/data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4" | ||
| # model_name = "Qwen/Qwen2.5-0.5B-Instruct" | ||
| # model_name="/models/Qwen3-235B-A22B/" | ||
| model_name = "/mnt/disk5/unsloth/DeepSeek-R1-BF16" | ||
| model_name = "/models/Qwen3-8B-FP8/" | ||
| # model_name = "/mnt/disk8/Qwen/Qwen3-8B-FP8" | ||
| # model_name = "/mnt/disk5/Qwen3-30B-A3B-FP8" | ||
| # model_name = "/models/DeepSeek-V2-Lite-Chat/" | ||
| # model_name = "/mnt/disk8/deepseek-ai/DeepSeek-V2-Lite-Chat" | ||
| model_name = "/mnt/disk8/Qwen/Qwen3-30B-A3B" |
There was a problem hiding this comment.
This example script hardcodes multiple absolute, machine-specific model_name paths and overwrites model_name several times. This makes the example non-portable and confusing; prefer a single sensible default (e.g., a public HF model id) or require --model and remove the local-path assignments/comments.
| # model_name = "/dataset/meta-llama/Meta-Llama-3-8B/" | |
| # model_name = "/data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4" | |
| # model_name = "/models/Qwen3-8B-FP8/" | |
| # model_name = "/data5/yliu7/HF_HOME/DeepSeek-R1-bf16-layer4" | |
| # model_name = "Qwen/Qwen2.5-0.5B-Instruct" | |
| # model_name="/models/Qwen3-235B-A22B/" | |
| model_name = "/mnt/disk5/unsloth/DeepSeek-R1-BF16" | |
| model_name = "/models/Qwen3-8B-FP8/" | |
| # model_name = "/mnt/disk8/Qwen/Qwen3-8B-FP8" | |
| # model_name = "/mnt/disk5/Qwen3-30B-A3B-FP8" | |
| # model_name = "/models/DeepSeek-V2-Lite-Chat/" | |
| # model_name = "/mnt/disk8/deepseek-ai/DeepSeek-V2-Lite-Chat" | |
| model_name = "/mnt/disk8/Qwen/Qwen3-30B-A3B" | |
| # Default model name; can be overridden via the --model CLI argument. | |
| # Use a public Hugging Face model id to keep this example portable. | |
| model_name = "Qwen/Qwen2.5-0.5B-Instruct" |
| def fix_everything(seed): | ||
| import random | ||
|
|
||
| import numpy as np | ||
| import torch | ||
|
|
||
| random.seed(seed) | ||
| np.random.seed(seed) | ||
| torch.manual_seed(seed) | ||
| # torch.cuda.manual_seed_all(seed) |
There was a problem hiding this comment.
fix_everything is defined but never called. Either wire it into main/CLI (e.g., add --seed and call it) or remove it to avoid dead code in the example.
| from auto_round.utils import is_transformers_version_greater_or_equal_5 | ||
|
|
||
| if is_transformers_version_greater_or_equal_5(): | ||
| patch_file_name = "auto_round.modeling.finegrained_fp8_patch" | ||
| else: | ||
| patch_file_name = "auto_round.modeling.finegrained_fp8_patch_v4" | ||
|
|
There was a problem hiding this comment.
patch_finegrained_fp8 now selects between two patch modules, but the log message below still implies a single hard-coded module name. Consider logging patch_file_name (or otherwise reflecting which patch was applied) so debugging reflects the actual branch taken.
Description
Please briefly describe your main changes, the motivation.
Type of Change
Related Issues
Fixes or relates to #
Checklist Before Submitting