[quantization] Quantization of Llama#492
Conversation
|
@mhs4670go |
2dc8751 to
5f315b3
Compare
No you don't have to. The exmaple scirpt can be a test by itself. |
5f315b3 to
4c918e6
Compare
|
@mhs4670go |
4c918e6 to
be327c8
Compare
This is a problem of UFMT. It uses deprecated feature in Python 3.14. Let's try to use ufmt==2.9.1. I'll post a PR for this. |
|
@stamalakhov Could you rebase the PR? |
be327c8 to
62ab91c
Compare
@mhs4670go |
|
Then, please set the python version 3.12. check-style:
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.x' # HERE
- name: "Run configure"
run: |
./ccex configure format
- name: "Run linters"
run: |
./ccex format --no-apply-patches |
@mhs4670go |
This PR quantizes the full `LLama` model and converts it to circle format. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
52ef612 to
dbea659
Compare
| # to prevent introduction of attention_mask as a parameter let's use preset attention_mask | ||
| L = hidden_states.size(1) | ||
| attention_mask = self._slice_causal(L, hidden_states.device) |
There was a problem hiding this comment.
Could you elaborate this change? Is it necessary?
There was a problem hiding this comment.
Right now. Without setting explicit attention mask to model, LlamaModel sets its own default casual mask and it has wrong dimensions.
I'll try to set it explicitely in conversion.
There was a problem hiding this comment.
@mhs4670go
This change is not necessary. I'll remove it. Thank you!
There was a problem hiding this comment.
Without setting explicit attention mask to model, LlamaModel sets its own default casual mask and it has wrong dimensions.
Ah, I understood.
There was a problem hiding this comment.
@mhs4670go
In case attention_mask is set explicitely like this:
attention_mask = torch.ones(1, q_m.config.max_position_embeddings, dtype=torch.bool)
cm = tico.convert(q_m, (calib_inputs[0], attention_mask), strict=False)
cm.save(save_path)
_prepare_4d_causal_attention_mask_with_cache_position of transformers will turn it into floats. So we need to quantize it inside QuantLLamaDecoderLayer
There was a problem hiding this comment.
@mhs4670go
Sorry. Right now we need this change 😢 .
Attention_mask if set outside has some non-trivial transforms inside transformers, each of them is unquantized.
Setting attention_mask to None in conversion is the same.
So resetting it explicitely in quant_decoder_layer.py breaks any dependency on unquantized attention_mask and produces correct model without any floats.
Please correct me if i'm wrong.
This PR quantizes the full
LLamamodel and converts it to circle format.Log of `python tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py --model Maykeye/TinyLLama-v0 --save_circles_to_folder "." --max_seq_len 2048`
Draft: #436
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com