Skip to content

server : do not cap slot context to training context (#22140)#22145

Open
jinweihan-ai wants to merge 1 commit intoggml-org:masterfrom
jinweihan-ai:server-no-cap-slot-ctx
Open

server : do not cap slot context to training context (#22140)#22145
jinweihan-ai wants to merge 1 commit intoggml-org:masterfrom
jinweihan-ai:server-no-cap-slot-ctx

Conversation

@jinweihan-ai
Copy link
Copy Markdown

@jinweihan-ai jinweihan-ai commented Apr 20, 2026

Summary

Fixes #22140.

server_context silently capped each slot's n_ctx to the model's training context, so any user who extended the context via RoPE scaling (YaRN) — the whole point of models like Qwen3 — effectively had their --ctx-size ignored once the slot was created, even though the KV cache had already been sized for the full n_ctx_seq.

This PR drops the cap and keeps only the warning. llama_context itself already logs "n_ctx_seq (...) > n_ctx_train (...) -- possible training context overflow", so users still see the safety signal.

Before

llama_context: n_ctx_seq     = 4096
llama_kv_cache: size =    2.50 MiB (  4096 cells, ...)
srv    load_model: the slot context (4096) exceeds the training context of the model (2048) - capping
slot   load_model: id  0 | task -1 | new slot, n_ctx = 2048   ← halved

After

llama_context: n_ctx_seq     = 4096
llama_kv_cache: size =    2.50 MiB (  4096 cells, ...)
srv    load_model: the slot context (4096) exceeds the training context of the model (2048) - generation quality may degrade beyond the training context unless RoPE scaling is configured
slot   load_model: id  0 | task -1 | new slot, n_ctx = 4096  ← as requested

/props now reports default_generation_settings.n_ctx = 4096 (previously 2048).

Test plan

  • Reproduced the bug on master with stories260K.gguf (n_ctx_train = 2048) and -c 4096.
  • Verified the patched build preserves the user-requested n_ctx in both the slot init log and the /props endpoint.
  • /completion still returns correctly after the change (20 tokens, stop=true, coherent output).

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes. This PR was produced in an AI-assisted workflow — an agent helped surface the candidate issue, drafted the patch, and wrote this description; the fix and reproduction were reviewed and verified locally (bug reproduced on master, fix built cleanly, slot n_ctx and /completion checked) before submitting. Human review and validation in the loop.

The per-slot cap overrides the user-requested context size even when
it was explicitly extended via RoPE scaling (YaRN), which is the whole
point of YaRN-aware models such as Qwen3. The KV cache is already
allocated for the full n_ctx_seq, so capping slot.n_ctx only throws
away addressable cells that the user paid memory for.

llama_context already warns about "possible training context overflow"
when n_ctx_seq > n_ctx_train, so dropping the server-side cap keeps
the safety signal without silently ignoring --ctx-size.

Closes ggml-org#22140
@jinweihan-ai jinweihan-ai requested a review from a team as a code owner April 20, 2026 06:30
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot bot commented Apr 20, 2026

Hi @jinweihan-ai, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: context length incorrectly capped in server for yarn extendable context models

1 participant