Skip to content

server : disable similarity slot selection with --cache-idle-slots and --parallel 1#22083

Open
kiwixz wants to merge 1 commit intoggml-org:masterfrom
kiwixz:slot_selection
Open

server : disable similarity slot selection with --cache-idle-slots and --parallel 1#22083
kiwixz wants to merge 1 commit intoggml-org:masterfrom
kiwixz:slot_selection

Conversation

@kiwixz
Copy link
Copy Markdown

@kiwixz kiwixz commented Apr 18, 2026

Based on #21741 to avoid conflicts.

Overview

As long as the input prompts match enough of the previous one (by default 10% of input prompt size, --slot-prompt-similarity), the server will always pick the same slot.
This lead to massive cache trashing, similar to what is reported in #19977.

This was actually a preexisting issue, but #20993 made it a worse by unloading idle slots.
So even if you had 2 active slots at one point, the server will "forget" about the idle one as far as the similarity slot selection is concerned.

#20993 gave me a new idea to fix the whole issue, we can just rely on the prompt cache to do the heavy lifting!

This PR disables prompt similarity in two cases:

  • when --cache-idle-slots is enabled (default), the similarity slot selection just won't work as designed so let's avoid future confusion
  • with --parallel 1, the similarity slot selection does nothing for us

Note: there is an important side effect of using LRU cache instead of slot similarity: cache is always updated.
This is what helps the most with one slot. I think this behavior is more aligned with user expectations and other defaults parameters.
Users can always tune caching with --cache-ram as advertised when launching the server.

There is an edge case I have no idea how to fix, but I hope it can be improved in another PR if necessary: the idle slots are saved after the initialization of the current task.
This means the current task cannot benefit from the cache of the idle slot even though it's about to be saved anyway.

Additional information

You can get the same effect before this PR by passing --slot-prompt-similarity 0.
This PR changes the defaults so that users don't have to do this to get the best caching.

Other things I tried:

Requirements

@ggerganov
Copy link
Copy Markdown
Member

I'm not sure I understand the "cache trashing" that you observe. Can you demonstrate with some sample curl commands?

@kiwixz
Copy link
Copy Markdown
Author

kiwixz commented Apr 18, 2026

It happens when you have two interleaved conversations going, that match by more than 10% (easy with the huge system prompts).

To reproduce with curl:

function prompt() {
    curl -H "content-type: application/json" http://localhost:8080/chat/completions  \
        -d '{"messages": [{"role": "user","content": "'"$*"'"}]}'
}

some_text="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

prompt "conversation 1\n$some_text"
prompt "conversation 2\n$some_text"  # this erase the checkpoint of previous prompt
prompt "conversation 1\n$some_text\ntest"  # this erase the checkpoint of previous prompt, full prompt reprocessing

EDIT: tested with -np 1 it may need more text at start of prompt for -np -1; or a lower threshold eg. -sps 0.001.

@kiwixz
Copy link
Copy Markdown
Author

kiwixz commented Apr 20, 2026

Rebased now that #21741 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants