This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Java bindings for llama.cpp via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
Current llama.cpp pinned version: b8854
Current CUDA version: 13.2
To change the CUDA version, update the following three places:
.github/build_cuda_linux.sh— Line 10:sudo dnf install -y cuda-toolkit-13-2.github/build_cuda_linux.sh— Line 12:-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.2/bin/nvccpom.xml— The<classifier>tag in thecudajar execution:cuda13-linux-x86-64
Also update the header comment in build_cuda_linux.sh and the job name in .github/workflows/release.yaml for clarity.
Available CUDA versions for RHEL8/Manylinux_2_28 can be browsed at:
https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/
Note: Each CUDA version supports only certain GCC versions. If the dockcross container uses a newer GCC than CUDA supports, the build will fail with unsupported GNU version. Check NVIDIA's compatibility table before downgrading CUDA.
Example: To upgrade from 13.2 to a hypothetical 13.3:
# Edit .github/build_cuda_linux.sh:
# line 10: cuda-toolkit-13-2 -> cuda-toolkit-13-3
# line 12: /usr/local/cuda-13.2/bin/nvcc -> /usr/local/cuda-13.3/bin/nvcc
# Edit pom.xml classifier: cuda13-linux-x86-64 (major version only, no need to change for minor bumps)
# Edit CLAUDE.md line: Current CUDA version: **13.2** -> **13.3**
git add .github/build_cuda_linux.sh pom.xml CLAUDE.md
git commit -m "Upgrade CUDA from 13.2 to 13.3"To change the llama.cpp version, update the following three files:
- CMakeLists.txt — the
GIT_TAGline for llama.cpp:GIT_TAG b8831 - README.md — the badge and link line with the version number
- CLAUDE.md — the "Current llama.cpp pinned version" line
Example: To upgrade from b8808 to b8831:
# Edit CMakeLists.txt: change GIT_TAG b8808 to b8831
# Edit README.md: change b8808 to b8831 (in both badge and link)
# Edit CLAUDE.md: change b8808 to b8831
git add CMakeLists.txt README.md CLAUDE.md
git commit -m "Upgrade llama.cpp from b8808 to b8831"
git push -u origin <your-branch>Note: Always test the build with cmake -B build && cmake --build build --config Release after version changes to catch compatibility issues early.
Use the GitHub compare URL to diff any two llama.cpp builds:
https://github.com/ggml-org/llama.cpp/compare/b<FROM>...b<TO>
Example — what changed between b6721 and b6732:
https://github.com/ggml-org/llama.cpp/compare/b6721...b6732
The GitHub HTML page may time out for large ranges; fall back to the API:
https://api.github.com/repos/ggml-org/llama.cpp/compare/b<FROM>...b<TO>
For individual file content at a specific build:
https://raw.githubusercontent.com/ggerganov/llama.cpp/b<VERSION>/common/chat.h
The three project C++ files (jllama.cpp, server.hpp, utils.hpp) pull in the following
llama.cpp headers. Any of these can introduce breaking changes on upgrade.
Include dependency graph:
jllama.cpp / server.hpp / utils.hpp
│
├── arg.h ──────────────────────────► common.h ─┐
├── common.h ──────────────────────────────────►├── ggml-opt.h ──► ggml.h
├── chat.h ─────────────► common.h, peg-parser.h └── ggml-backend.h ──► ggml-alloc.h
├── speculative.h ──────► llama.h, common.h
├── sampling.h ─────────► llama.h, common.h
├── download.h ─────────► (stdlib only, no deps)
├── log.h ──────────────► ggml.h
├── llama.h ────────────────────────────────────► ggml.h, ggml-cpu.h, ggml-backend.h, ggml-opt.h
│ └── llama-cpp.h ──► llama.h
├── json-schema-to-grammar.h
├── base64.hpp
├── mtmd.h
└── mtmd-helper.h
Priority-ordered review list for upgrade diffs (highest break risk first)
The top 8 rows cover all known API-level breaking changes from b5022 → b8831.
For future upgrades, provide diffs for at least these 8 files rather than the full patch.
Also review the project CMakeLists.txt for build-system-level breaks (e.g. renamed link targets, new required headers) — those are not visible in header file diffs alone.
| File | What to watch for |
|---|---|
common/common.h |
common_params/common_params_speculative struct fields, model_alias container type, common_init_result shape, build_info symbol (removed in b8831 — now llama_build_info() from build-info.h) |
common/chat.h |
common_chat_parser_params (was common_chat_syntax), to_json_oaicompat, common_chat_msg_diff_to_json_oaicompat, set_tool_call_ids |
common/speculative.h |
common_speculative_init, common_speculative_draft, common_speculative_accept signatures, struct names |
tools/mtmd/mtmd.h |
mtmd_context_params fields, image_marker/media_marker API, deprecated symbols (was common/mtmd.h before ~b8190) |
include/llama-cpp.h |
common_init_result_ptr type, access pattern changes (.get() vs ->method()) |
common/arg.h |
n_parallel sentinel value, what moved to download.h across versions |
include/llama.h |
Core llama_ function signatures, token types, llama_model_ptr, renamed structs |
common/download.h |
common_remote_params struct, headers field format (string vs key-value pair) |
common/common.cpp |
Implementation of any inline API used directly |
common/speculative.cpp |
Speculative decoding implementation details |
common/chat.cpp |
Chat parsing implementation |
common/sampling.h |
Sampler API, common_sampler_* functions |
common/log.h |
Log macro signatures |
tools/mtmd/mtmd-helper.h |
Multimodal helper functions |
common/json-schema-to-grammar.h |
Grammar API |
ggml/include/ggml.h |
ggml_type enum values (e.g. GGML_TYPE_F16), tensor primitives |
ggml/include/ggml-backend.h |
Backend/device abstraction types |
ggml/include/ggml-opt.h |
Optimizer params pulled in via common.h |
Safe to skip (have never caused a break; not used directly by project code):
common/sampling.h, common/log.h, tools/mtmd/mtmd-helper.h, common/json-schema-to-grammar.h,
ggml/include/ggml.h, ggml/include/ggml-backend.h, ggml/include/ggml-opt.h,
ggml-alloc.h, ggml-cpu.h, peg-parser.h, base64.hpp
Known breaking changes by version range (b5022 → b8841):
| Version | File | Change |
|---|---|---|
| ~b7217–b7433 | common/common.h, include/llama-cpp.h |
common_init_result became common_init_result_ptr; access changed to ->model() / ->context() / ->free_context() |
| ~b7433 | common/arg.h |
n_parallel default changed to sentinel -1 (auto); Java bindings must resolve to 1 before model load |
| ~b7217–b7783 | common/arg.h → common/download.h |
common_remote_get_content and common_remote_params split into new download.h; headers changed from vector<string> to vector<pair> |
| ~b7783 | common/common.h |
build_info string moved into common.h; local definition must be removed |
| ~b7783–b7858 | common/chat.h |
common_chat_syntax renamed to common_chat_parser_params; to_json_oaicompat<json>() template removed (no template arg); ensure_tool_call_ids_set() → set_tool_call_ids() |
| ~b7858–b7864 | common/speculative.h |
Full redesign: common_speculative_init(ctx_tgt, ctx_dft) → common_speculative_init(params_speculative, ctx); common_speculative_gen_draft → common_speculative_draft; new common_speculative_accept(); common_speculative_params struct replaced by common_params_speculative; draft model loaded via llama_model_load_from_file into llama_model_ptr |
| ~b7858–b7864 | common/common.h |
params_speculative: .model.path/.hf_repo replaced by .has_dft()/.mparams_dft; new .model_dft and .cparams_dft fields; speculative.type enum added (COMMON_SPECULATIVE_TYPE_NONE) |
| ~b7858–b7864 | server.hpp (internal) |
slot_action.slot_id → slot_action.id_slot; llama_init_dft removed from server_context; model_dft changed from llama_model* to llama_model_ptr; slot.ctx_tgt/ctx_dft removed |
| ~b7864 | common/mtmd.h |
mtmd_init_params.verbosity field removed |
| ~b7904–b8190 | common/common.h |
params_base.model_alias changed from std::string to a container; use *model_alias.begin() instead of direct string cast |
| ~b8778–b8808 | tools/mtmd/mtmd.h |
MTMD_DEFAULT_IMAGE_MARKER macro removed; mtmd_image_tokens_get_nx/ny deprecated; new mtmd_decoder_pos struct + mtmd_image_tokens_get_decoder_pos(); mtmd_context_params_default() now sets image_marker = nullptr (throws "custom image_marker is not supported anymore" if non-null); upstream server adds randomized get_media_marker() in server-common.h — our server.hpp is unaffected since it does not include that header and uses mtmd_default_marker() consistently |
| ~b8808–b8831 | project CMakeLists.txt |
CMake target common renamed to llama-common; update target_link_libraries for jllama and jllama_test |
| ~b8808–b8831 | common/common.h → new common/build-info.h |
build_info std::string removed; replaced by llama_build_info() (const char*) in new build-info.h; add #include "build-info.h" in server.hpp and utils.hpp; call sites: std::string(llama_build_info()) in server.hpp (6×), llama_build_info() in jllama.cpp (1×) and utils.hpp (1×) |
| ~b8808–b8831 | ggml/src/ggml.c |
New ggml_graph_next_uid() calls _InterlockedIncrement64 via <intrin.h> on x86; intrinsic unavailable on 32-bit MSVC; fix: src/main/cpp/compat/ggml_x86_compat.c provides __cdecl _InterlockedIncrement64 via InterlockedIncrement64 (CMPXCHG8B), added to ggml-base via target_sources guarded by MSVC AND CMAKE_SIZEOF_VOID_P EQUAL 4 |
| ~b8838–b8841 | src/llama-model.h |
Attention bias fields renamed: bq→wq_b, bk→wk_b, bv→wv_b, bo→wo_b, bqkv→wqkv_b; internal to llama.cpp, no impact on this project |
| ~b8841–b8854 | common/common.h |
common_params::clear_idle renamed to cache_idle_slots; new common_context_seq_rm_type enum + common_context_can_seq_rm() replacing common_speculative_is_compat(); get_model_endpoint() → common_get_model_endpoint() |
| ~b8841–b8854 | tools/mtmd/mtmd.h + mtmd-helper.h |
mtmd_decoder_pos gains z field; mtmd_image_tokens_get_decoder_pos() + mtmd_helper_image_get_decoder_pos() gain new pos_0 parameter |
| ~b8841–b8854 | project utils.hpp / server.hpp |
server_tokens::get_text_tokens() split: get_tokens() returns raw const llama_tokens &; new get_text_tokens() returns filtered copy (removes LLAMA_TOKEN_NULL mtmd placeholders); save/load and context-shift call sites updated to get_tokens() |
mvn compile # Compiles Java and generates JNI headers
mvn test # Run all tests (requires native library and model files)
mvn package # Build JAR
mvn test -Dtest=LlamaModelTest#testGenerate # Run a single test methodMust run mvn compile first to generate JNI headers, then:
# CPU only
cmake -B build
cmake --build build --config Release
# CUDA (Linux)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
# Metal (macOS)
cmake -B build -DLLAMA_METAL=ON
cmake --build build --config Release
# Optional: enable model downloading via URL
cmake -B build -DLLAMA_CURL=ONBuilt libraries are placed in src/main/resources/de/kherud/llama/{OS}/{ARCH}/.
clang-format -i src/main/cpp/*.cpp src/main/cpp/*.hpp # Format C++ codeJava layer (src/main/java/de/kherud/llama/):
LlamaModel— Main API class (AutoCloseable). Wraps native context for inference, embeddings, re-ranking, and tokenization.ModelParameters/InferenceParameters— Builder-pattern parameter classes that serialize to JSON (extendJsonParameters) for passing to native code.LlamaIterator/LlamaIterable— Streaming generation via JavaIterator/Iterable.LlamaLoader— Extracts the platform-specific native library from the JAR to a temp directory, or finds it onjava.library.path.OSInfo— Detects OS and architecture for library resolution.
Native layer (src/main/cpp/):
jllama.cpp— JNI implementation bridging Java calls to llama.cpp.server.hpp— Inference server logic (adapted from llama.cpp's server).utils.hpp— Helper utilities.json_helpers.hpp— Pure JSON transformation helpers (no JNI, no llama state). Independently unit-testable.jni_helpers.hpp— JNI bridge helpers (handle management + server orchestration). Includesjson_helpers.hpp.- Uses
nlohmann/jsonfor JSON deserialization of parameters.
The project C++ helpers follow a strict semantic split:
json_helpers.hpp — Pure data transforms.
- Input:
nlohmann::json,server_task_result_ptr, plain C++ types. - Output:
json,std::vector,std::optional, plain C++ types. - Zero JNI calls (
JNIEnv*never appears). - Zero llama state (
llama_context*,llama_vocab*,server_context*never appear). - Functions are named without
_implsuffix — they are the canonical implementation. - Testable with JSON literals and fake result objects; no JVM and no loaded model required.
- Requires
server.hppto be included by the translation unit first (TU convention —server.hpphas no include guard).
Functions: get_result_error_message, results_to_json, rerank_results_to_json,
build_embeddings_response_json, extract_first_embedding_row, parse_encoding_format,
extract_embedding_prompt, is_infill_request, parse_slot_prompt_similarity,
parse_positive_int_config.
jni_helpers.hpp — JNI bridge helpers, split into two layers:
Layer A (no server.hpp required): handle management.
jllama_contextstruct — ownsserver_context*and background worker thread.get_server_context_impl— reads Javactxhandle, throws on null.get_jllama_context_impl— like above but returns the wrapper (delete path only).require_single_task_id_impl— validates exactly one task ID was created.require_json_field_impl— throws"<field> is required"if key is absent.jint_array_to_tokens_impl— reads a Javaint[]intostd::vector<int32_t>.
Layer B (requires server.hpp in the TU before jni_helpers.hpp): server orchestration.
Includes json_helpers.hpp so all bridge helpers can call transforms directly.
json_to_jstring_impl— serialises anyjsonvalue to a JNI string.build_completion_tasks_impl— tokenises prompt and populatesserver_taskvector.recv_slot_task_result_impl— receives one slot result, throws on error.collect_task_results_impl— receives all results for a task-id set, throws on error.results_to_jstring_impl— delegates toresults_to_jsonthenjson_to_jstring_impl.check_infill_support_impl— validates FIM prefix/suffix/middle tokens present.append_task— constructs and appends aserver_taskof a given type.embedding_to_jfloat_array_impl— convertsstd::vector<float>to a JavajfloatArray; throws OOM on allocation failure.tokens_to_jint_array_impl— convertsstd::vector<int32_t>to a JavajintArray; throws OOM on allocation failure.
Functions with _impl suffix have a thin module-level wrapper in jllama.cpp; functions
without the suffix (in json_helpers.hpp) are called directly.
Include order rule:
// In jllama.cpp and any TU that uses Layer B helpers:
#include "server.hpp" // must come first — no include guard
#include "jni_helpers.hpp" // includes json_helpers.hpp internally
Adding a new pure transform (e.g. a new JSON field parser):
- Add it to
json_helpers.hpp. No JNI, no llama types. - Add tests to
src/test/cpp/test_json_helpers.cpp.
Adding a new JNI bridge helper:
- Add it to
jni_helpers.hppin the appropriate layer. - If it needs
server.hpptypes, put it in Layer B (after thejson_helpers.hppinclude). - Add tests to
src/test/cpp/test_jni_helpers.cpp.
Java parameters are serialized to JSON strings and passed to native code, which deserializes them using nlohmann/json. This avoids complex JNI field mapping for the many llama.cpp parameters.
LlamaLoader tries in order:
- System property
de.kherud.llama.lib.path java.library.path- Extracts from JAR resources at
de/kherud/llama/{os}/{arch}/
Docker-based cross-compilation scripts are in .github/dockcross/ for ARM/Android targets. CI workflows use these for non-x86 Linux builds.
Require a model file. The CI downloads models from HuggingFace:
- LlamaModel tests: CodeLlama-7B-GGUF (
codellama-7b.Q2_K.gguf) - RerankingModel tests: Jina-Reranker model
Set the model path via system property or environment variable (see test files for exact property names).
Test files are in src/test/java/de/kherud/llama/ and src/test/java/examples/.
No JVM or model file required. Built as jllama_test via CMake when BUILD_TESTING=ON.
| File | What it tests |
|---|---|
test_json_helpers.cpp |
All functions in json_helpers.hpp — pure JSON transforms, using fake result objects |
test_jni_helpers.cpp |
All functions in jni_helpers.hpp — mock JNIEnv, pre-seeded server_response queue |
test_server.cpp |
Selected server.hpp internals (result types, error formatting, routing helpers) |
test_utils.cpp |
Utilities from utils.hpp |
Run C++ tests:
cmake -B build -DBUILD_TESTING=ON
cmake --build build --config Release
ctest --test-dir build --output-on-failure- Java 11+ required.
- Native memory allocated by llama.cpp is not GC-managed — always use
LlamaModelin try-with-resources or callclose()explicitly. - The
server.hppfile is adapted from llama.cpp upstream — minimize modifications to ease future upgrades. - Platform-specific native libraries must be pre-built and placed under
src/main/resources/before packaging for distribution.