Skip to content

Upgrade llama.cpp from b8854 to b8887#94

Open
bernardladenthin wants to merge 16 commits intomasterfrom
claude/update-openvino-b8887-zijoE
Open

Upgrade llama.cpp from b8854 to b8887#94
bernardladenthin wants to merge 16 commits intomasterfrom
claude/update-openvino-b8887-zijoE

Conversation

@bernardladenthin
Copy link
Copy Markdown
Owner

Breaking API changes addressed in server.hpp:

  • common_chat_msg_diff_to_json_oaicompat removed from common/chat.h; define equivalent server_chat_msg_diff_to_json_oaicompat locally
  • params_base.reasoning_budget → params_base.sampling.reasoning_budget_tokens

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B

claude added 16 commits April 22, 2026 20:54
Breaking API changes addressed in server.hpp:
- common_chat_msg_diff_to_json_oaicompat removed from common/chat.h;
  define equivalent server_chat_msg_diff_to_json_oaicompat locally
- params_base.reasoning_budget → params_base.sampling.reasoning_budget_tokens

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
Instead of a local copy, compile tools/server/server-chat.cpp into jllama
and include tools/server/server-chat.h directly — the same pattern used
for tools/mtmd.

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
…nflicts

server-chat.h transitively pulls in server-common.h which redefines macros
and types already present in utils.hpp (SLT_* macros, json_value, etc.).

Instead of including the header, forward-declare server_chat_msg_diff_to_json_oaicompat
after the json using-declaration in server.hpp. The upstream server-chat.cpp
is still compiled directly into both jllama and jllama_test (with tools/server
in their include paths), so the linker resolves the symbol from upstream code.

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
Importing server-chat.cpp causes a linker error: convert_transcriptions_to_chatcmpl
(unused by jllama) calls get_media_marker() which is defined in server-common.cpp.
Compiling server-common.cpp would cascade further.

The local static definition in server.hpp is self-contained and avoids the
entire server-common.h dependency chain.

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
…/cpp

utils.hpp: add #include "server-common.h" which provides JSON_ASSERT,
json, raw_buffer, json_value<T>, server_grammar_trigger, server_tokens,
error_type, SRV_* macros, and 14 identical utility functions.  All
identical static definitions are deleted; the SLT_* macros are redefined
after the include to keep our simpler (slot).id_task access; the
tokenize_input_prompts / tokens_to_str overloads with different
signatures are retained.

server.hpp: remove now-redundant `using json` and `enum error_type`
(both provided by server-common.h); update process_chunk call to the
upstream signature (idx + pos + n_tokens_out) and find_chunk call to
size_t index.

CMakeLists.txt: compile server-common.cpp into both jllama and
jllama_test targets; add tools/server to their include paths.

net change: -668 lines (utils.hpp 1367 → 699)

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
utils.hpp: restore base64_chars/is_base64/base64_decode — they are
static-internal in server-common.cpp and not visible from outside that
translation unit, so the local copies are still required.

server.hpp: remove local static format_error_response — it is now
declared in server-common.h (non-static) and defined in
server-common.cpp; the local static definition created an ambiguous
overload. The implementations are identical.

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
Upstream oaicompat_completion_params_parse (b8887) removed the n != 1
check; the value is now forwarded as-is. Rename the test and flip the
expectation from EXPECT_THROW to EXPECT_NO_THROW.

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
…_oaicompat, format_response_rerank

Remove local 3-function definitions from utils.hpp now that server-common.h/cpp
provides them. Update all call sites:
- tokenize_input_prompts: add nullptr mctx arg, change return to vector<server_tokens>,
  use .get_tokens() where llama_tokens is needed
- format_embeddings_response_oaicompat: add model_name param extracted from request
- format_response_rerank: add model_name param extracted from request

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
…f versions

server-common.h declares tokens_to_str(ctx, const llama_tokens&) and
tokens_to_str(vocab, const llama_tokens&); the local templates (Iter begin, end)
were the only callers. Update both call sites in jllama.cpp to pass the
vector directly and remove the now-redundant templates from utils.hpp.

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
….hpp

base64_chars/is_base64/base64_decode are static in server-common.cpp (internal
linkage) so cannot be called from other TUs even though they're in the same .so.
Mark the block with BEGIN/END fences and explain exactly why removal requires
upstream to change static to inline in a header.

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
…ver-chat.cpp

server-chat.cpp only depends on server-common.h symbols (json_value,
string_format, get_media_marker), all of which are already provided by the
server-common.cpp added to the build earlier. Adding server-chat.cpp to both
jllama and jllama_test targets unblocks the include of server-chat.h, so the
30-line local copy in server.hpp can be deleted.

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
server_chat_params is the superset struct from server-common.h; upstream
oaicompat_chat_params_parse (in server-common.cpp) is already linked into
the build. Adopt them directly:
- utils.hpp: drop the ~260-line local struct + function copy
- server.hpp: change field type, convert aggregate init to named-field
  assignments (tmpls now a shared_ptr, no .get())
- test_utils.cpp: rename struct, tighten EXPECT_THROW to std::exception
  (upstream throws std::invalid_argument for some validation errors that
  the local copy threw as std::runtime_error)

New behaviour inherited from upstream: media_marker + handle_media() for
image handling, reasoning_budget pass-through, chat_parser propagation,
content_parts support in prefill_assistant, grammar_type=tool_calls,
chat_template_kwargs from CLI merged with body.

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
common_chat_templates_ptr is a unique_ptr, so it cannot be copy-assigned.
Remove the separate chat_templates field and assign oai_parser_opt.tmpls
directly during load_model(), eliminating the broken copy in init().

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
Two LOG_INF call sites still used ctx_server->chat_templates which was
removed in the previous commit. Update them to ctx_server->oai_parser_opt.tmpls.

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
enable_thinking was always true because reasoning_budget_tokens defaults to
-1 and the != 0 check fires. Mirror the upstream server-context.cpp logic:
enable only when enable_reasoning != 0 AND the chat template actually
supports thinking. This prevents the "prefill incompatible with thinking"
error for models like CodeLlama that have no thinking support in their template.

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
Replace manual JSON iteration with a call to upstream's parse_lora_request(json)
(from server-common.cpp) which returns a map<int,float>. The local overload still
owns the lora_base copy and ID bounds check; the JSON extraction is no longer duplicated.

https://claude.ai/code/session_01CzAdkHfyKwKGu7GzkoYK2B
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants