Skip to content

[quantization] Use cuda-accelerated forward#539

Closed
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:speed_up_attn
Closed

[quantization] Use cuda-accelerated forward#539
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:speed_up_attn

Conversation

@stamalakhov
Copy link
Contributor

@stamalakhov stamalakhov commented Mar 6, 2026

This PR speeds-up quantized model evaluation using forward method accelerated for cuda.

Running quantization

python tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py --model unsloth/Llama-3.2-1B-Instruct --max_seq_len 256 --linear_weight_bits "4" --gptq_mse --eval_tasks="winogrande,arc_easy,arc_challenge,openbookqa" 

with enabled optimization took 28:09 and results of quantization:

Quantized RESULTS ARE:
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.3259|±  |0.0137|
|             |       |none  |     0|acc_norm|↑  |0.3507|±  |0.0139|
|arc_easy     |      1|none  |     0|acc     |↑  |0.6414|±  |0.0098|
|             |       |none  |     0|acc_norm|↑  |0.6107|±  |0.0100|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2340|±  |0.0190|
|             |       |none  |     0|acc_norm|↑  |0.3460|±  |0.0213|
|winogrande   |      1|none  |     0|acc     |↑  |0.5588|±  |0.0140|

While with disabled optimization it took '2:20:44` with results:

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.3285|±  |0.0137|
|             |       |none  |     0|acc_norm|↑  |0.3473|±  |0.0139|
|arc_easy     |      1|none  |     0|acc     |↑  |0.6431|±  |0.0098|
|             |       |none  |     0|acc_norm|↑  |0.6090|±  |0.0100|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2320|±  |0.0189|
|             |       |none  |     0|acc_norm|↑  |0.3480|±  |0.0213|
|winogrande   |      1|none  |     0|acc     |↑  |0.5446|±  |0.0140|

Seems like results are slightly different due to numerical instabilities.

TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

This PR speeds-up quantized model evaluation using
`forward` method accelerated for cuda.

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
@stamalakhov stamalakhov self-assigned this Mar 6, 2026
@stamalakhov stamalakhov marked this pull request as ready for review March 6, 2026 11:53
@stamalakhov stamalakhov requested a review from mhs4670go March 6, 2026 11:53
Comment on lines +204 to +209
self.obs_q_x1,
self.obs_q_x2,
self.obs_q_cat,
self.obs_q_cos,
self.obs_q_sin,
self.obs_q_rot,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.obs_q_x1,
self.obs_q_x2,
self.obs_q_cat,
self.obs_q_cos,
self.obs_q_sin,
self.obs_q_rot,
self.obs_k_x1,
self.obs_k_x2,
self.obs_k_cat,
self.obs_k_cos,
self.obs_k_sin,
self.obs_k_rot,

@mhs4670go
Copy link
Contributor

I think it is safer to treat this as a separate evaluation-only attention implementation, rather than as a funciton for cuda inside the existing prefill path.

The current fast path is not simply a faster implementation of the same execution graph. It actually changes the structure of the computation. In addition, observers are no longer applied in exactly the same way.

In the current implementation, attention is computed head-by-head and fake quant / observers are applied to each head’s intermediate tensors individually (e.g., logits, attention weights, and outputs). In the accelerated path in this PR, however, observers are applied to the aggregated tensors instead.

This means the quantization statistics and fake quantization behavior can differ depending on:

  • how activations are grouped
  • where observers are applied
  • whether calibration was performed on the same execution path
  • the presence of outlier heads

Because of this, the fast path should be considered an approximate evaluation path, not a strict acceleration of the original quantized execution.

Proposed direction

Therefore, instead of hiding this behind a runtime flag inside prefill, I think it is cleaner to make it an explicit variant, for example:

  • prefill (for current main branch implementation)
  • prefill_reference (for this PR by introducing QuantLlamaAttentionPrefillReference wrapper)

(or another clearly named evaluation-oriented variant).

This makes the contract clearer:

  • prefill remains the hardware-faithful / unrolled implementation
  • prefill_reference provides a faster evaluation-oriented implementation
  • benchmark results can explicitly indicate which path was used
  • we avoid mixing two different execution semantics inside the same wrapper

With this separation, we can use the two variants for different purposes:

  • Use prefill_reference during development and experimentation:
    • fast ablation comparisons
    • pruning multiple quantization configurations
    • checking calibration effects
    • monitoring regression trends
    • quick pre-screening before running the final evaluation

Since PTQ experiments often require running many configurations, the ~5× speed improvement is very helpful here.

  • Use the original prefill implementation when reporting final results:
    • final benchmark numbers
    • strict baseline comparisons
    • CI accuracy checks
    • results that should reflect the exact accelerator-oriented execution

In other words, the typical flow would be:

  1. Run most experiments using prefill_reference for fast iteration.
  2. Select promising configurations.
  3. Re-run evaluation with the prefill implementation to obtain the final reported numbers.

This keeps experimentation fast while ensuring that the final reported results reflect the reference execution path.

@stamalakhov
Copy link
Contributor Author

In other words, the typical flow would be:

  1. Run most experiments using prefill_reference for fast iteration.
  2. Select promising configurations.
  3. Re-run evaluation with the prefill implementation to obtain the final reported numbers.

This keeps experimentation fast while ensuring that the final reported results reflect the reference execution path.

@mhs4670go
Understood. Thank you for your review!

@stamalakhov stamalakhov deleted the speed_up_attn branch March 10, 2026 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants