[quantization] Quantization of Llama by stamalakhov · Pull Request #492 · Samsung/TICO

stamalakhov · 2026-02-13T10:21:05Z

This PR quantizes the full LLama model and converts it to circle format.

Log of `python tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py --model Maykeye/TinyLLama-v0 --save_circles_to_folder "." --max_seq_len 2048`


Namespace(model='Maykeye/TinyLLama-v0', device='cuda', dtype='float32', seed=42, trust_remote_code=False, hf_token=None, no_tqdm=False, no_GPTQ=False, no_PTQ=False, save_circle_to_folder='.', cache_dir='/mnt/storage/transformers_cache', nsamples_for_qcalibration=128, linear_weight_bits=4, gptq_mse=False, max_seq_len=2048, embedding_weight_bits=8, lm_head_weight_bits=4, eval_tasks=None)
=== Config ===
Model            : Maykeye/TinyLLama-v0
Device           : cuda
DType            : float32

Loading FP model …
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.

Calculating original perplexities …
Token indices sequence length is longer than the specified maximum sequence length for this model (324381 > 2048). Running this sequence through the model will result in indexing errors
PPL:  99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 158/159 [00:04<00:00, 37.60it/s]

┌── Wikitext-2 test perplexity ─────────────
│ FP32 :  7584.31
└───────────────────────────────────────────
Applying GPTQ …
Quantizing layers: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:08<00:00,  1.07s/layer]
Wrapping layers with PTQWrapper …                                                                                                                                                                                                                                                       
Calibrating PTQ obeservers…
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:31<00:00,  4.12it/s]

Calculating perplexities …
PPL:  99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 158/159 [00:35<00:00,  4.47it/s]

┌── Wikitext-2 test perplexity ─────────────
│ int16 :  7410.80
└───────────────────────────────────────────
saving input embedding to /mnt/storage/slow_repos/TICO/embedding.q.circle
saving lm_head to /mnt/storage/slow_repos/TICO/lm_head.q.circle
saving layers
saving model layer_0 to /mnt/storage/slow_repos/TICO/decoder_layer_0.q.circle
saving model layer_1 to /mnt/storage/slow_repos/TICO/decoder_layer_1.q.circle
saving model layer_2 to /mnt/storage/slow_repos/TICO/decoder_layer_2.q.circle
saving model layer_3 to /mnt/storage/slow_repos/TICO/decoder_layer_3.q.circle
saving model layer_4 to /mnt/storage/slow_repos/TICO/decoder_layer_4.q.circle
saving model layer_5 to /mnt/storage/slow_repos/TICO/decoder_layer_5.q.circle
saving model layer_6 to /mnt/storage/slow_repos/TICO/decoder_layer_6.q.circle
saving model layer_7 to /mnt/storage/slow_repos/TICO/decoder_layer_7.q.circle
saving model.model to /mnt/storage/slow_repos/TICO/model.model.q.circle
saving the whole model to /mnt/storage/slow_repos/TICO/model.q.circle

Draft: #436
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

stamalakhov · 2026-02-13T11:43:17Z

@mhs4670go
Should i provide tests for it and/or split in smaller PRs?

mhs4670go · 2026-02-19T06:02:01Z

Should i provide tests for it and/or split in smaller PRs?

No you don't have to. The exmaple scirpt can be a test by itself.

stamalakhov · 2026-02-19T13:46:35Z

@mhs4670go
not sure about CI failure check-style.

mhs4670go · 2026-02-20T04:26:08Z

not sure about CI failure check-style.

This is a problem of UFMT. It uses deprecated feature in Python 3.14. Let's try to use ufmt==2.9.1. I'll post a PR for this.

mhs4670go · 2026-02-20T04:49:12Z

@stamalakhov Could you rebase the PR?

stamalakhov · 2026-02-20T05:03:40Z

@stamalakhov Could you rebase the PR?

@mhs4670go
Rebased, but still no luck 😢

mhs4670go · 2026-02-20T05:13:14Z

Then, please set the python version 3.12.

check-style:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.x'  # HERE

      - name: "Run configure"
        run: |
          ./ccex configure format

      - name: "Run linters"
        run: |
          ./ccex format --no-apply-patches

stamalakhov · 2026-02-20T05:33:06Z

Then, please set the python version 3.12.

@mhs4670go
Thank you very much. Finally the problem was shown. I think i should revert set python version.

This PR quantizes the full `LLama` model and converts it to circle format. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

mhs4670go · 2026-02-20T06:15:06Z

tico/quantization/wrapq/wrappers/llama/quant_decoder_layer.py

+        # to prevent introduction of attention_mask as a parameter let's use preset attention_mask
+        L = hidden_states.size(1)
+        attention_mask = self._slice_causal(L, hidden_states.device)


Could you elaborate this change? Is it necessary?

Right now. Without setting explicit attention mask to model, LlamaModel sets its own default casual mask and it has wrong dimensions.
I'll try to set it explicitely in conversion.

@mhs4670go
This change is not necessary. I'll remove it. Thank you!

Without setting explicit attention mask to model, LlamaModel sets its own default casual mask and it has wrong dimensions.

Ah, I understood.

@mhs4670go
In case attention_mask is set explicitely like this:

attention_mask = torch.ones(1, q_m.config.max_position_embeddings, dtype=torch.bool) cm = tico.convert(q_m, (calib_inputs[0], attention_mask), strict=False) cm.save(save_path)

_prepare_4d_causal_attention_mask_with_cache_position of transformers will turn it into floats. So we need to quantize it inside QuantLLamaDecoderLayer

@mhs4670go
Sorry. Right now we need this change 😢 .
Attention_mask if set outside has some non-trivial transforms inside transformers, each of them is unquantized.
Setting attention_mask to None in conversion is the same.
So resetting it explicitely in quant_decoder_layer.py breaks any dependency on unquantized attention_mask and produces correct model without any floats.
Please correct me if i'm wrong.

mhs4670go

LGTM

stamalakhov self-assigned this Feb 13, 2026

stamalakhov marked this pull request as draft February 13, 2026 10:38

stamalakhov marked this pull request as ready for review February 13, 2026 11:43

stamalakhov marked this pull request as draft February 13, 2026 13:54

stamalakhov mentioned this pull request Feb 16, 2026

[quantization][DRAFT] Disk space consumption improvements for full model quantization #495

Closed

stamalakhov force-pushed the quant_full_model_PR branch 2 times, most recently from 2dc8751 to 5f315b3 Compare February 18, 2026 11:23

stamalakhov force-pushed the quant_full_model_PR branch from 5f315b3 to 4c918e6 Compare February 19, 2026 13:31

stamalakhov changed the title ~~[quantization][draft] Quantization of Llama~~ [quantization] Quantization of Llama Feb 19, 2026

stamalakhov marked this pull request as ready for review February 19, 2026 13:43

stamalakhov force-pushed the quant_full_model_PR branch from 4c918e6 to be327c8 Compare February 19, 2026 14:05

mhs4670go mentioned this pull request Feb 20, 2026

Upgrade UFMT version #503

Merged

stamalakhov force-pushed the quant_full_model_PR branch from be327c8 to 62ab91c Compare February 20, 2026 05:00

[quantization] Quantization of Llama

dbea659

This PR quantizes the full `LLama` model and converts it to circle format. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov force-pushed the quant_full_model_PR branch from 52ef612 to dbea659 Compare February 20, 2026 05:37

mhs4670go reviewed Feb 20, 2026

View reviewed changes

mhs4670go approved these changes Feb 20, 2026

View reviewed changes

mhs4670go merged commit 4ad84c7 into Samsung:main Feb 20, 2026
7 checks passed

stamalakhov deleted the quant_full_model_PR branch February 20, 2026 08:22

Conversation

stamalakhov commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stamalakhov commented Feb 13, 2026

Uh oh!

mhs4670go commented Feb 19, 2026

Uh oh!

stamalakhov commented Feb 19, 2026

Uh oh!

mhs4670go commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhs4670go commented Feb 20, 2026

Uh oh!

stamalakhov commented Feb 20, 2026

Uh oh!

mhs4670go commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stamalakhov commented Feb 20, 2026

Uh oh!

mhs4670go Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

stamalakhov Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

stamalakhov Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

stamalakhov Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

stamalakhov Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stamalakhov commented Feb 13, 2026 •

edited

Loading

mhs4670go commented Feb 20, 2026 •

edited

Loading

mhs4670go commented Feb 20, 2026 •

edited

Loading