server : disable similarity slot selection with --cache-idle-slots and --parallel 1#22083
Open
kiwixz wants to merge 1 commit intoggml-org:masterfrom
Open
server : disable similarity slot selection with --cache-idle-slots and --parallel 1#22083kiwixz wants to merge 1 commit intoggml-org:masterfrom
kiwixz wants to merge 1 commit intoggml-org:masterfrom
Conversation
2 tasks
Member
|
I'm not sure I understand the "cache trashing" that you observe. Can you demonstrate with some sample |
Author
|
It happens when you have two interleaved conversations going, that match by more than 10% (easy with the huge system prompts). To reproduce with curl: function prompt() {
curl -H "content-type: application/json" http://localhost:8080/chat/completions \
-d '{"messages": [{"role": "user","content": "'"$*"'"}]}'
}
some_text="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
prompt "conversation 1\n$some_text"
prompt "conversation 2\n$some_text" # this erase the checkpoint of previous prompt
prompt "conversation 1\n$some_text\ntest" # this erase the checkpoint of previous prompt, full prompt reprocessingEDIT: tested with |
Author
|
Rebased now that #21741 is merged. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Based on #21741 to avoid conflicts.
Overview
As long as the input prompts match enough of the previous one (by default 10% of input prompt size,
--slot-prompt-similarity), the server will always pick the same slot.This lead to massive cache trashing, similar to what is reported in #19977.
This was actually a preexisting issue, but #20993 made it a worse by unloading idle slots.
So even if you had 2 active slots at one point, the server will "forget" about the idle one as far as the similarity slot selection is concerned.
#20993 gave me a new idea to fix the whole issue, we can just rely on the prompt cache to do the heavy lifting!
This PR disables prompt similarity in two cases:
--cache-idle-slotsis enabled (default), the similarity slot selection just won't work as designed so let's avoid future confusion--parallel 1, the similarity slot selection does nothing for usNote: there is an important side effect of using LRU cache instead of slot similarity: cache is always updated.
This is what helps the most with one slot. I think this behavior is more aligned with user expectations and other defaults parameters.
Users can always tune caching with
--cache-ramas advertised when launching the server.There is an edge case I have no idea how to fix, but I hope it can be improved in another PR if necessary: the idle slots are saved after the initialization of the current task.
This means the current task cannot benefit from the cache of the idle slot even though it's about to be saved anyway.
Additional information
You can get the same effect before this PR by passing
--slot-prompt-similarity 0.This PR changes the defaults so that users don't have to do this to get the best caching.
Other things I tried:
--clear-idle) #20993Requirements