Skip to content

[DRAFT][quantization] Full quantization of LLama compatible models#436

Draft
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:full_quantization_br
Draft

[DRAFT][quantization] Full quantization of LLama compatible models#436
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:full_quantization_br

Conversation

@stamalakhov
Copy link
Contributor

@stamalakhov stamalakhov commented Jan 13, 2026

This draft tries to get fully quantized circle layers for Llama model.

TODO:

  • tests/cleanup

TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

@stamalakhov stamalakhov self-assigned this Jan 13, 2026
@stamalakhov stamalakhov force-pushed the full_quantization_br branch 9 times, most recently from 0253cb9 to 5201525 Compare January 20, 2026 11:37
gptq[name] = GPTQ(subset[name])
gptq[name].quantizer.configure(
bits=8, perchannel=True, sym=False, mse=False
bits=4, perchannel=True, sym=False, mse=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, you can give the option for this with this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, you can give the option for this with this PR.

@mhs4670go
Thank you. I'll rebase after merging of #441.

Copy link
Contributor

@mhs4670go mhs4670go Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give me some explanations for the reason of changes related with observers?

  1. Deleting some attributes and register them as buffer.
  2. Change ObserverBase's parent from ABC to torch.nn.Module. (and MinMaxObserver)

Copy link
Contributor Author

@stamalakhov stamalakhov Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give me some explanations for the reason of changes related with observers?

  1. Deleting some attributes and register them as buffer.

Ahh. It occured that model.to("cuda") or model.to("cpu") do not transfer scales and zero_points to gpu/cpu, they were not registered as buffers or parameters, that is why they were registered as buffers. Deleting them is needed, because otherwise torch fails to register known attributes as buffers.

  1. Change ObserverBase's parent from ABC to torch.nn.Module

It will enable using buffer registering and correct automatical transfer of scales/zp to cpu<->gpu. The same approach is used in gptq/quant.py for the same reason (i suppose).
Please see

class Quantizer(nn.Module):

This draft is just PoC (quick and dirty).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the clarification. I'll reconsider those and apply the changes soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the clarification. I'll reconsider those and apply the changes soon.

ok. Thank you very much.

continue
if (
dq.target
!= torch.ops.circle_custom.dequantize_mx_to_float.default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that just quantize_mx and dequantize_mx are simpler. Is there some consideration for exposing dtypes in the name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no fake_quantize for mx types (just circle_custom::quantize_mx). So quantize_float_to_mx is a try (m.b. failed) to distinguish it from quantize_mx. In case circle_custom::quantize_mx will become circle_custom::fakequantize_mx, then usual quantize/dequantize naming scheme applies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhs4670go
It can be renamed to any other (more appropriate) name.

Copy link
Contributor

@mhs4670go mhs4670go Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. How about go with quantize_mx_decomposed, and dequantize_mx_decomposed? This aligns with torch.ops.quantized_decomposed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. How about go with quantize_mx_decomposed, and dequantize_mx_decomposed? This aligns with torch.ops.quantized_decomposed.

@mhs4670go
Ok. Got it. Thank you.

@stamalakhov stamalakhov force-pushed the full_quantization_br branch 7 times, most recently from 60fcd6a to f7bb4d9 Compare January 27, 2026 13:51
@stamalakhov stamalakhov force-pushed the full_quantization_br branch 6 times, most recently from 06581cb to 542db37 Compare January 30, 2026 06:07
@stamalakhov stamalakhov changed the title [DRAFT][NO_MERGE][quantization] Full quantization [DRAFT][quantization] Full quantization Jan 30, 2026
@stamalakhov stamalakhov force-pushed the full_quantization_br branch 2 times, most recently from 628786e to 234d069 Compare February 5, 2026 06:56
try:
k_total, v_total = past_key_value.update(k_rot, v)
if len(sig.parameters) == 2:
k_total, v_total = past_key_value.update(k_rot, v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stamalakhov Just for curiosity, can this be exported? Or, it is just a fallback for float model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. it works. a lot of model outputs (which are just k, v outputs) for use_cache==True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if len(sig.parameters) == 2 to make test passable

Copy link
Contributor

@mhs4670go mhs4670go Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When sig.parameters is 2? I'm asking this because I'm gonna trim the cache logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've added if len(sig.parameters) == 2 to make quantization.wrapq.wrappers.llama.test_quant_attn.TestQuantLlamaAttention passable.
MockCache.update uses just two inputs.
For export of the whole LLama 3 inputs are required.

I'm gonna trim the cache logic.

@mhs4670go
You mean remove it completely?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I got it. I think you can just modify MockCache to have third parameter - layer_idx. Which will makes the codes simpler.

You mean remove it completely?

No. I just checked whether it has redundant logic. But, seems not that much:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I got it. I think you can just modify MockCache to have third parameter - layer_idx. Which will makes the codes simpler.

@mhs4670go
Understood. I'll update MockCache. Thank you.

@stamalakhov stamalakhov force-pushed the full_quantization_br branch 6 times, most recently from fb0bd17 to 19cc002 Compare February 9, 2026 10:46
# Case A: HuggingFace-style transformers: model.model.layers
lm = getattr(root, "model", None)

embeddings = (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhs4670go
May be it will be better to introduce something like QuantLlamaModel and wrap all internal structure (embed, lm_head, norm) inside of it? and leave this general code as a fallback? like this:

try:
    wrap(the_whole_model)
catch:
  no specific wrapper for a class then use  general logic

Copy link
Contributor

@mhs4670go mhs4670go Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have QuantLlamaModel for convinience when we evaluate it or something like that. Just for note, even though the whole model is quantized, only decoder layers would need to be exported because of the runtime requirements. Nothing has been fixed though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Got it. Thank you.

@stamalakhov stamalakhov force-pushed the full_quantization_br branch 4 times, most recently from 83f9b1e to 93aa4c8 Compare February 13, 2026 06:04
@stamalakhov stamalakhov force-pushed the full_quantization_br branch 4 times, most recently from 797e714 to 07cf6fd Compare February 22, 2026 09:38
@stamalakhov stamalakhov force-pushed the full_quantization_br branch 2 times, most recently from be606ae to 91b6916 Compare February 27, 2026 12:28
@stamalakhov stamalakhov force-pushed the full_quantization_br branch from 91b6916 to cde8f76 Compare March 5, 2026 08:22
@stamalakhov
Copy link
Contributor Author

stamalakhov commented Mar 5, 2026

@mhs4670go
I've run advanced mse modes from this draft for unsloth/Llama-3.2-3B-Instruct:

seq_line == 2048:

Config ID PPL arc_easy(%) arc_challenge (%) winogrande (%) openbookqa(%)
FP32 11.05 75 44 69 29
GPTQ_MSE_w4A16_main_branch_mse 12.92 72 41 67 28
GPTQ_MSE_w4A16_smse 12.12 73 41 67 30
GPTQ_MSE_w4A16_smse_for_gptq 12.11 72 41 67 27
tables from logs

GPTQ_MSE_w4A16_main_branch_mse:

Quantized RESULTS ARE:
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.4078|±  |0.0144|
|             |       |none  |     0|acc_norm|↑  |0.4386|±  |0.0145|
|arc_easy     |      1|none  |     0|acc     |↑  |0.7226|±  |0.0092|
|             |       |none  |     0|acc_norm|↑  |0.6738|±  |0.0096|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2840|±  |0.0202|
|             |       |none  |     0|acc_norm|↑  |0.3920|±  |0.0219|
|winogrande   |      1|none  |     0|acc     |↑  |0.6661|±  |0.0133|

GPTQ_MSE_w4A16_smse:

Quantized RESULTS ARE:
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.4096|±  |0.0144|
|             |       |none  |     0|acc_norm|↑  |0.4352|±  |0.0145|
|arc_easy     |      1|none  |     0|acc     |↑  |0.7323|±  |0.0091|
|             |       |none  |     0|acc_norm|↑  |0.6696|±  |0.0097|
|openbookqa   |      1|none  |     0|acc     |↑  |0.3020|±  |0.0206|
|             |       |none  |     0|acc_norm|↑  |0.3760|±  |0.0217|
|winogrande   |      1|none  |     0|acc     |↑  |0.6748|±  |0.0132|

GPTQ_MSE_w4A16_smse_for_gptq:

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.4061|±  |0.0144|
|             |       |none  |     0|acc_norm|↑  |0.4309|±  |0.0145|
|arc_easy     |      1|none  |     0|acc     |↑  |0.7247|±  |0.0092|
|             |       |none  |     0|acc_norm|↑  |0.6692|±  |0.0097|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2720|±  |0.0199|
|             |       |none  |     0|acc_norm|↑  |0.3740|±  |0.0217|
|winogrande   |      1|none  |     0|acc     |↑  |0.6654|±  |0.0133|

GPTQ_MSE_w4A16_smse_for_gptq_with_4096_calib_seqlen:

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.4078|±  |0.0144|
|             |       |none  |     0|acc_norm|↑  |0.4369|±  |0.0145|
|arc_easy     |      1|none  |     0|acc     |↑  |0.7256|±  |0.0092|
|             |       |none  |     0|acc_norm|↑  |0.6696|±  |0.0097|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2680|±  |0.0198|
|             |       |none  |     0|acc_norm|↑  |0.3780|±  |0.0217|
|winogrande   |      1|none  |     0|acc     |↑  |0.6677|±  |0.0132|

GPTQ_MSE_w4A16_smse_for_gptq_with_4096_calib_seqlen_and_256_qcalib_samples:

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.4104|±  |0.0144|
|             |       |none  |     0|acc_norm|↑  |0.4241|±  |0.0144|
|arc_easy     |      1|none  |     0|acc     |↑  |0.7201|±  |0.0092|
|             |       |none  |     0|acc_norm|↑  |0.6545|±  |0.0098|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2800|±  |0.0201|
|             |       |none  |     0|acc_norm|↑  |0.3660|±  |0.0216|
|winogrande   |      1|none  |     0|acc     |↑  |0.6567|±  |0.0133|

Do we need something alike (e.g. smse) in the main branch?


Please note that smse_for_gptq produces better ppl but fails to improve accuracy, seems like it's overfitting, which can be avoided by using more data e.g.

This draft tries to get fully quantized model.

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
@stamalakhov stamalakhov force-pushed the full_quantization_br branch from cde8f76 to c6d7b32 Compare March 10, 2026 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants