Hi, I tried to get fully deterministic CPU inference, using --temp 0 with --threads 1.
It gave deterministic output with small text prompts, but was nondeterministic when passing --image to llama-cli.
I found as workaround: OMP_NUM_THREADS=1.
That was weird: Why wouldn't --threads 1 already achieve that, given that llama-cpp propagates that into a openblas_set_num_threads(ctx->n_threads) call?
I dug into OpenBLAS and found a bug that openblas_set_num_threads() is ineffective, using OpenMP's default number of threads anyway.
I reported that as an OpenBLAS issue:
I also made an OpenBLAS PR to fix it:
With those, llama-cpp should hopefully be deterministic when run on 1 thread on the CPU.
So this ticket mainly tracks whether those get merged, and ideally should be closed then. Some open questions I have regarding the CPU determinism though:
- Is this enough? How deterministic do llama.cpp expect it to be on CPUs?
- There is code that seems to invoke BLAS also for non-
--image, if matrices are large enough. So maybe text-only prompts may have been nondeterministic as well?
- Can we have multi-thread implementation that is fully deterministic (e.g. does parallel maps with deterministic reductions) so that deterministic runs aren't so slow?
- It would be great to have:
Environment:
- NixOS Linux 25.11
llama-cpp 8983
Invocation example:
llama-cli \
--single-turn --no-display-prompt --log-verbosity 0 \
--jinja --temp 0 --threads 1 --n-gpu-layers 0 \
--model ./gemma-4-E2B-it-Q4_0.gguf \
--mmproj ./mmproj-gemma-4-E2B-it-F16.gguf \
--image myimage.png \
-p 'Describe the image'
Pinned model URLs: gemma-4-E2B-it-Q4_0.gguf, mmproj-gemma-4-E2B-it-F16.gguf
Hi, I tried to get fully deterministic CPU inference, using
--temp 0with--threads 1.It gave deterministic output with small text prompts, but was nondeterministic when passing
--imagetollama-cli.I found as workaround:
OMP_NUM_THREADS=1.That was weird: Why wouldn't
--threads 1already achieve that, given that llama-cpp propagates that into aopenblas_set_num_threads(ctx->n_threads)call?I dug into OpenBLAS and found a bug that
openblas_set_num_threads()is ineffective, using OpenMP's default number of threads anyway.I reported that as an OpenBLAS issue:
openblas_set_num_threads()is silently overridden when built withUSE_OPENMPOpenMathLib/OpenBLAS#5806I also made an OpenBLAS PR to fix it:
openblas_set_num_threads()OpenMathLib/OpenBLAS#5808With those,
llama-cppshould hopefully be deterministic when run on 1 thread on the CPU.So this ticket mainly tracks whether those get merged, and ideally should be closed then. Some open questions I have regarding the CPU determinism though:
--image, if matrices are large enough. So maybe text-only prompts may have been nondeterministic as well?--temp 0 --threads 1is deterministic, so that this issue I found would have been caught.Environment:
llama-cpp8983Invocation example:
llama-cli \ --single-turn --no-display-prompt --log-verbosity 0 \ --jinja --temp 0 --threads 1 --n-gpu-layers 0 \ --model ./gemma-4-E2B-it-Q4_0.gguf \ --mmproj ./mmproj-gemma-4-E2B-it-F16.gguf \ --image myimage.png \ -p 'Describe the image'Pinned model URLs:
gemma-4-E2B-it-Q4_0.gguf,mmproj-gemma-4-E2B-it-F16.gguf