Nondeterministic CPU inference with `--image` despite `--threads 1` due to OpenMP in OpenBLAS

Hi, I tried to get fully deterministic CPU inference, using `--temp 0` with `--threads 1`.

It gave deterministic output with small text prompts, but was nondeterministic when passing `--image` to `llama-cli`.

I found as workaround: `OMP_NUM_THREADS=1`.

That was weird: Why wouldn't `--threads 1` already achieve that, given that llama-cpp propagates that into a [`openblas_set_num_threads(ctx->n_threads)`](https://github.com/ggml-org/llama.cpp/blob/80afa33aadcc4f71212b17e5e52904491c76b63e/ggml/src/ggml-blas/ggml-blas.cpp#L118-L126) call?

I dug into OpenBLAS and found a bug that `openblas_set_num_threads()` is ineffective, using OpenMP's default number of threads anyway.

I reported that as an OpenBLAS issue:

* https://github.com/OpenMathLib/OpenBLAS/issues/5806

I also made an OpenBLAS PR to fix it:

* https://github.com/OpenMathLib/OpenBLAS/pull/5808

With those, `llama-cpp` should hopefully be deterministic when run on 1 thread on the CPU.

So this ticket mainly tracks whether those get merged, and ideally should be closed then. Some open questions I have regarding the CPU determinism though:

* Is this enough? How deterministic do llama.cpp expect it to be on CPUs?
* There is [code that seems to invoke](https://github.com/ggml-org/llama.cpp/blob/1ec7ba0c14f33f17e980daeeda5f35b225d41994/ggml/src/ggml-blas/ggml-blas.cpp#L406) BLAS also for non-`--image`, if matrices are large enough. So maybe text-only prompts may have been nondeterministic as well?
* Can we have multi-thread implementation that is fully deterministic (e.g. does parallel maps with deterministic reductions) so that deterministic runs aren't so slow?
* It would be great to have:
  * [ ] Some docs that describe what's already deterministic (CPU?) and what isn't (GPU?).
  * [ ] Some tests that check whether `--temp 0 --threads 1` is deterministic, so that this issue I found would have been caught.

---

Environment:

* NixOS Linux 25.11
* `llama-cpp` `8983`

Invocation example:

```sh
llama-cli \
  --single-turn --no-display-prompt --log-verbosity 0 \
  --jinja --temp 0 --threads 1 --n-gpu-layers 0 \
  --model ./gemma-4-E2B-it-Q4_0.gguf \
  --mmproj ./mmproj-gemma-4-E2B-it-F16.gguf \
  --image myimage.png \
  -p 'Describe the image'
```

Pinned model URLs: [`gemma-4-E2B-it-Q4_0.gguf`](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/90f9618340396838ee7ff5b0ba2da27da62953d3/gemma-4-E2B-it-Q4_0.gguf), [`mmproj-gemma-4-E2B-it-F16.gguf`](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/90f9618340396838ee7ff5b0ba2da27da62953d3/mmproj-F16.gguf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nondeterministic CPU inference with `--image` despite `--threads 1` due to OpenMP in OpenBLAS #22956

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Nondeterministic CPU inference with --image despite --threads 1 due to OpenMP in OpenBLAS #22956

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Nondeterministic CPU inference with `--image` despite `--threads 1` due to OpenMP in OpenBLAS #22956