[quantization] Use cuda-accelerated `forward` by stamalakhov · Pull Request #539 · Samsung/TICO

stamalakhov · 2026-03-06T09:31:49Z

This PR speeds-up quantized model evaluation using forward method accelerated for cuda.

Running quantization

python tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py --model unsloth/Llama-3.2-1B-Instruct --max_seq_len 256 --linear_weight_bits "4" --gptq_mse --eval_tasks="winogrande,arc_easy,arc_challenge,openbookqa"

with enabled optimization took 28:09 and results of quantization:

Quantized RESULTS ARE:
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.3259|±  |0.0137|
|             |       |none  |     0|acc_norm|↑  |0.3507|±  |0.0139|
|arc_easy     |      1|none  |     0|acc     |↑  |0.6414|±  |0.0098|
|             |       |none  |     0|acc_norm|↑  |0.6107|±  |0.0100|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2340|±  |0.0190|
|             |       |none  |     0|acc_norm|↑  |0.3460|±  |0.0213|
|winogrande   |      1|none  |     0|acc     |↑  |0.5588|±  |0.0140|

While with disabled optimization it took '2:20:44` with results:

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.3285|±  |0.0137|
|             |       |none  |     0|acc_norm|↑  |0.3473|±  |0.0139|
|arc_easy     |      1|none  |     0|acc     |↑  |0.6431|±  |0.0098|
|             |       |none  |     0|acc_norm|↑  |0.6090|±  |0.0100|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2320|±  |0.0189|
|             |       |none  |     0|acc_norm|↑  |0.3480|±  |0.0213|
|winogrande   |      1|none  |     0|acc     |↑  |0.5446|±  |0.0140|

Seems like results are slightly different due to numerical instabilities.

TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

This PR speeds-up quantized model evaluation using `forward` method accelerated for cuda. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

mhs4670go · 2026-03-09T01:32:11Z

tico/quantization/wrapq/wrappers/llama/quant_attn_prefill.py

+            self.obs_q_x1,
+            self.obs_q_x2,
+            self.obs_q_cat,
+            self.obs_q_cos,
+            self.obs_q_sin,
+            self.obs_q_rot,


Suggested change

self.obs_q_x1,

self.obs_q_x2,

self.obs_q_cat,

self.obs_q_cos,

self.obs_q_sin,

self.obs_q_rot,

self.obs_k_x1,

self.obs_k_x2,

self.obs_k_cat,

self.obs_k_cos,

self.obs_k_sin,

self.obs_k_rot,

mhs4670go · 2026-03-09T01:53:28Z

I think it is safer to treat this as a separate evaluation-only attention implementation, rather than as a funciton for cuda inside the existing prefill path.

The current fast path is not simply a faster implementation of the same execution graph. It actually changes the structure of the computation. In addition, observers are no longer applied in exactly the same way.

In the current implementation, attention is computed head-by-head and fake quant / observers are applied to each head’s intermediate tensors individually (e.g., logits, attention weights, and outputs). In the accelerated path in this PR, however, observers are applied to the aggregated tensors instead.

This means the quantization statistics and fake quantization behavior can differ depending on:

how activations are grouped
where observers are applied
whether calibration was performed on the same execution path
the presence of outlier heads

Because of this, the fast path should be considered an approximate evaluation path, not a strict acceleration of the original quantized execution.

Proposed direction

Therefore, instead of hiding this behind a runtime flag inside prefill, I think it is cleaner to make it an explicit variant, for example:

prefill (for current main branch implementation)
prefill_reference (for this PR by introducing QuantLlamaAttentionPrefillReference wrapper)

(or another clearly named evaluation-oriented variant).

This makes the contract clearer:

prefill remains the hardware-faithful / unrolled implementation
prefill_reference provides a faster evaluation-oriented implementation
benchmark results can explicitly indicate which path was used
we avoid mixing two different execution semantics inside the same wrapper

With this separation, we can use the two variants for different purposes:

Use prefill_reference during development and experimentation:
- fast ablation comparisons
- pruning multiple quantization configurations
- checking calibration effects
- monitoring regression trends
- quick pre-screening before running the final evaluation

Since PTQ experiments often require running many configurations, the ~5× speed improvement is very helpful here.

Use the original prefill implementation when reporting final results:
- final benchmark numbers
- strict baseline comparisons
- CI accuracy checks
- results that should reflect the exact accelerator-oriented execution

In other words, the typical flow would be:

Run most experiments using prefill_reference for fast iteration.
Select promising configurations.
Re-run evaluation with the prefill implementation to obtain the final reported numbers.

This keeps experimentation fast while ensuring that the final reported results reflect the reference execution path.

stamalakhov · 2026-03-10T04:57:15Z

In other words, the typical flow would be:

Run most experiments using prefill_reference for fast iteration.

Select promising configurations.

Re-run evaluation with the prefill implementation to obtain the final reported numbers.

This keeps experimentation fast while ensuring that the final reported results reflect the reference execution path.

@mhs4670go
Understood. Thank you for your review!

[quantization] Use cuda-accelerated forward

1a75ef9

This PR speeds-up quantized model evaluation using `forward` method accelerated for cuda. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov self-assigned this Mar 6, 2026

stamalakhov marked this pull request as ready for review March 6, 2026 11:53

stamalakhov requested a review from mhs4670go March 6, 2026 11:53

mhs4670go reviewed Mar 9, 2026

View reviewed changes

stamalakhov closed this Mar 10, 2026

stamalakhov deleted the speed_up_attn branch March 10, 2026 04:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quantization] Use cuda-accelerated `forward`#539

[quantization] Use cuda-accelerated `forward`#539
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:speed_up_attn

stamalakhov commented Mar 6, 2026 •

edited

Loading

Uh oh!

mhs4670go Mar 9, 2026

Uh oh!

mhs4670go commented Mar 9, 2026

Uh oh!

stamalakhov commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stamalakhov commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhs4670go Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go commented Mar 9, 2026

Proposed direction

Uh oh!

stamalakhov commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stamalakhov commented Mar 6, 2026 •

edited

Loading