server : disable similarity slot selection with --cache-idle-slots and --parallel 1 by kiwixz · Pull Request #22083 · ggml-org/llama.cpp

kiwixz · 2026-04-18T14:09:38Z

Based on #21741 to avoid conflicts.

Overview

As long as the input prompts match enough of the previous one (by default 10% of input prompt size, --slot-prompt-similarity), the server will always pick the same slot.
This lead to massive cache trashing, similar to what is reported in #19977.

This was actually a preexisting issue, but #20993 made it a worse by unloading idle slots.
So even if you had 2 active slots at one point, the server will "forget" about the idle one as far as the similarity slot selection is concerned.

#20993 gave me a new idea to fix the whole issue, we can just rely on the prompt cache to do the heavy lifting!

This PR disables prompt similarity in two cases:

when --cache-idle-slots is enabled (default), the similarity slot selection just won't work as designed so let's avoid future confusion
with --parallel 1, the similarity slot selection does nothing for us

Note: there is an important side effect of using LRU cache instead of slot similarity: cache is always updated.
This is what helps the most with one slot. I think this behavior is more aligned with user expectations and other defaults parameters.
Users can always tune caching with --cache-ram as advertised when launching the server.

There is an edge case I have no idea how to fix, but I hope it can be improved in another PR if necessary: the idle slots are saved after the initialization of the current task.
This means the current task cannot benefit from the cache of the idle slot even though it's about to be saved anyway.

Additional information

You can get the same effect before this PR by passing --slot-prompt-similarity 0.
This PR changes the defaults so that users don't have to do this to get the best caching.

Other things I tried:

implementing a smarter heuristic with unified KV, failed because of server: save and clear idle slots on new task (--clear-idle) #20993
changing condition to update cache during similarity slot selection

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

ggerganov · 2026-04-18T14:35:21Z

I'm not sure I understand the "cache trashing" that you observe. Can you demonstrate with some sample curl commands?

kiwixz · 2026-04-18T15:08:48Z

It happens when you have two interleaved conversations going, that match by more than 10% (easy with the huge system prompts).

To reproduce with curl:

function prompt() {
    curl -H "content-type: application/json" http://localhost:8080/chat/completions  \
        -d '{"messages": [{"role": "user","content": "'"$*"'"}]}'
}

some_text="Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

prompt "conversation 1\n$some_text"
prompt "conversation 2\n$some_text"  # this erase the checkpoint of previous prompt
prompt "conversation 1\n$some_text\ntest"  # this erase the checkpoint of previous prompt, full prompt reprocessing

EDIT: tested with -np 1 it may need more text at start of prompt for -np -1; or a lower threshold eg. -sps 0.001.

…d --parallel 1

kiwixz · 2026-04-20T21:44:08Z

Rebased now that #21741 is merged.

kiwixz requested review from a team as code owners April 18, 2026 14:09

github-actions bot added examples python python script changes server labels Apr 18, 2026

kiwixz mentioned this pull request Apr 18, 2026

server: save and clear idle slots on new task (--clear-idle) #20993

Merged

2 tasks

server : disable similarity slot selection with --cache-idle-slots an…

e0da25a

…d --parallel 1

kiwixz force-pushed the slot_selection branch from cae0027 to e0da25a Compare April 20, 2026 21:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : disable similarity slot selection with --cache-idle-slots and --parallel 1#22083

server : disable similarity slot selection with --cache-idle-slots and --parallel 1#22083
kiwixz wants to merge 1 commit intoggml-org:masterfrom
kiwixz:slot_selection

kiwixz commented Apr 18, 2026

Uh oh!

ggerganov commented Apr 18, 2026

Uh oh!

kiwixz commented Apr 18, 2026 •

edited

Loading

Uh oh!

kiwixz commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kiwixz commented Apr 18, 2026

Overview

Additional information

Requirements

Uh oh!

ggerganov commented Apr 18, 2026

Uh oh!

kiwixz commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiwixz commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kiwixz commented Apr 18, 2026 •

edited

Loading