[quantization] Use cuda-accelerated forward#539
[quantization] Use cuda-accelerated forward#539stamalakhov wants to merge 1 commit intoSamsung:mainfrom
forward#539Conversation
This PR speeds-up quantized model evaluation using `forward` method accelerated for cuda. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
| self.obs_q_x1, | ||
| self.obs_q_x2, | ||
| self.obs_q_cat, | ||
| self.obs_q_cos, | ||
| self.obs_q_sin, | ||
| self.obs_q_rot, |
There was a problem hiding this comment.
| self.obs_q_x1, | |
| self.obs_q_x2, | |
| self.obs_q_cat, | |
| self.obs_q_cos, | |
| self.obs_q_sin, | |
| self.obs_q_rot, | |
| self.obs_k_x1, | |
| self.obs_k_x2, | |
| self.obs_k_cat, | |
| self.obs_k_cos, | |
| self.obs_k_sin, | |
| self.obs_k_rot, |
|
I think it is safer to treat this as a separate evaluation-only attention implementation, rather than as a funciton for cuda inside the existing The current fast path is not simply a faster implementation of the same execution graph. It actually changes the structure of the computation. In addition, observers are no longer applied in exactly the same way. In the current implementation, attention is computed head-by-head and fake quant / observers are applied to each head’s intermediate tensors individually (e.g., logits, attention weights, and outputs). In the accelerated path in this PR, however, observers are applied to the aggregated tensors instead. This means the quantization statistics and fake quantization behavior can differ depending on:
Because of this, the fast path should be considered an approximate evaluation path, not a strict acceleration of the original quantized execution. Proposed directionTherefore, instead of hiding this behind a runtime flag inside
(or another clearly named evaluation-oriented variant). This makes the contract clearer:
With this separation, we can use the two variants for different purposes:
Since PTQ experiments often require running many configurations, the ~5× speed improvement is very helpful here.
In other words, the typical flow would be:
This keeps experimentation fast while ensuring that the final reported results reflect the reference execution path. |
@mhs4670go |
This PR speeds-up quantized model evaluation using
forwardmethod accelerated for cuda.Running quantization
with enabled optimization took
28:09and results of quantization:While with disabled optimization it took '2:20:44` with results:
Seems like results are slightly different due to numerical instabilities.
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com