Skip to content

Add FP8 support for the ONNX backend#4072

Open
andrey-churkin wants to merge 2 commits into
openvinotoolkit:developfrom
andrey-churkin:ac/fp8_onnx
Open

Add FP8 support for the ONNX backend#4072
andrey-churkin wants to merge 2 commits into
openvinotoolkit:developfrom
andrey-churkin:ac/fp8_onnx

Conversation

@andrey-churkin
Copy link
Copy Markdown
Contributor

@andrey-churkin andrey-churkin commented May 15, 2026

Changes

  • Add support for nncf.CompressWeightsMode.FP8_E4M3 mode in the nncf.compress_weights() method for the ONNX backend.
  • Add support for quantization using nncf.QuantizationMode.FP8_E4M3 and nncf.QuantizationMode.FP8_E5M2 modes in the nncf.quantize() method for the ONNX backend.

Reason for changes

Add support for FP8 quantization and weight compression in the ONNX backend.

Related tickets

Tests

TBD

Weight compression - success

@andrey-churkin andrey-churkin requested a review from a team as a code owner May 15, 2026 08:24
@github-actions github-actions Bot added the NNCF ONNX Pull requests that updates NNCF ONNX label May 15, 2026
Copy link
Copy Markdown
Collaborator

@daniil-lyakhov daniil-lyakhov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major comments, please add some tests

get_weight_quantization_axis(node, target_point.port_id) if target_point.is_weight_target_point() else 1
)
onnx_parameters = convert_fc_params_to_onnx_params(parameters, axis)
nncf_input_node_next_nodes = ONNXMinMaxAlgoBackend._get_input_edges_mapping(nncf_graph)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential point for an optimization in future. Maybe we could add a comment to highligt it

Comment on lines +367 to +371
if weight_dtype == onnx.TensorProto.FLOAT8E4M3FN:
np_dtype = helper.tensor_dtype_to_np_dtype(weight_dtype)
vals = onnx.numpy_helper.saturate_cast(np.asarray(quantized_weights), np_dtype).flatten()
else:
vals = quantized_weights
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two similar code blocks, maybe worth a private method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NNCF ONNX Pull requests that updates NNCF ONNX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants