In order to enable Llama3.2 1B (see #8 ), we had to upgrade from transformers v4.34.1 to v4.45.2.
This new version of transformers had refactored the KV cache implementation to a more efficient implementation that would have required us to refactor forward_early(...) and forward_remainder(...) in self_speculation/llama_model_utils.py. Instead, we opted to use the less efficient legacy KV cache.
In order to ensure apples-to-apples comparison, in 62debc0, we changed autoregressive decoding to use legacy cache.
Ideally, we should ensure forward_early(...) and forward_remainder(...) to use transformers new more efficient KV cache implementation.
In order to enable Llama3.2 1B (see #8 ), we had to upgrade from
transformersv4.34.1 to v4.45.2.This new version of
transformershad refactored the KV cache implementation to a more efficient implementation that would have required us to refactorforward_early(...)andforward_remainder(...)inself_speculation/llama_model_utils.py. Instead, we opted to use the less efficient legacy KV cache.In order to ensure apples-to-apples comparison, in 62debc0, we changed autoregressive decoding to use legacy cache.
Ideally, we should ensure
forward_early(...)andforward_remainder(...)to usetransformersnew more efficient KV cache implementation.