Vulkan pipeline creation in mmproj path triggers Mesa RADV heap corruption (Navi 21, Mesa 25.0.7)

## Summary

Under sustained load (hundreds of requests) with `llama-server` running Qwen2.5-VL 7B + mmproj on a Vulkan (AMD) backend, the prompt cache introduced in PR #16391 corrupts the heap. Corruption eventually surfaces as a `SIGSEGV` in `__libc_free` (reading a clearly-invalid pointer), with a deterministic crash signature identical across independent runs. Passing `--cache-ram 0` (disabling the cache) resolves the crash; no other combination of `--no-cache-prompt`, `--parallel 1`, context-size or batch-size changes does.

## Environment

| | |
|---|---|
| **llama.cpp SHA** | `9e5647aff` (build `b8840`) |
| **Build flags** | `-DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release`, gcc 12.2.0 |
| **Model** | `Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf` + `mmproj-Qwen2.5-VL-7B-bf16.gguf` (official `ggml-org/Qwen2.5-VL-7B-Instruct-GGUF`) |
| **GPU** | AMD Radeon RX 6900 XT (Navi 21, gfx1030) |
| **Userspace GPU stack** | Mesa 25.0.7 (`bookworm-backports`), Vulkan via RADV |
| **Kernel** | 6.8.12-20-pve (Proxmox VE 8.4.18) |
| **glibc** | 2.36-9+deb12u13 (Debian 12 Bookworm, in privileged LXC) |
| **CPU** | 2× Intel Xeon E5-2690 v1 (Sandy Bridge-EP, AVX1 only, no AVX2) |

## Observed behavior

Two distinct failure modes observed on the same stack. Same underlying trigger; escalation depends on how many requests have landed:

### Mode A — userspace SEGV (most common)

Appears after ~300–500 consecutive requests. Process dies, `systemd` auto-restarts.

```
llama-server[PID]: segfault at <rand>0000003c ip <rand><libc_base+0x98efa> error 4 in libc.so.6[...]
Code: ... 48 85 ff 0f 84 bf 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d e6 9e 13 00 <48> 8b 47 f8 64 8b 2b a8 02 75 5b ...
```

- **Faulting function**: `__libc_free` (offset `+0x1a`), confirmed by byte-pattern match + `nm -D` lookup at Debian 12 glibc 2.36
- **Faulting instruction**: `mov rax, [rdi-0x8]` — glibc's first dereference of the chunk header (`mchunk_size`)
- **Fault address pattern**: `0x<random_high>_0000003c` — i.e. `rdi = 0x<random>_00000044`. Low 32 bits = `0x44` (decimal 68), a real runtime value being treated as a pointer. Strongly suggests either a 32→64-bit cast without sign extension or a partial overwrite of a pointer's low half.
- **9 independent crashes with identical pattern** across distinct runs and PIDs; crash offset inside libc is byte-identical every time.

### Mode B — hard host crash (rarer, but has happened)

Under heavy sustained load against the same binary/config, the host went fully unreachable (ping + SSH dead). `pstore` ERST triggered a kernel emergency write but we lost the oops itself (ERST buffer overwrote parts 1 & 2; only later boot messages survived). Machine required a hard reboot to recover.

We don't yet have a kernel trace for this, so this issue focuses on the userspace bug only. We will file the `amdgpu` / kernel angle separately once we can capture a proper `kdump` in a controlled environment.

## Minimal reproducer

Config that crashes:

```
llama-server \
  -m Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf \
  --mmproj mmproj-Qwen2.5-VL-7B-bf16.gguf \
  -ngl 99 -c 32768 -ctk q8_0 -ctv q8_0 -fit off \
  --batch-size 2048 --ubatch-size 2048 \
  --parallel 1 --threads 8 --no-cache-prompt \
  --host 0.0.0.0 --port 8080
```

Note: **`--no-cache-prompt` does not disable the RAM prompt cache**. See companion issue #22127 about the stale log message.

Load: ~600 text-only classify requests cycling through real bitsavers vintage datasheets (heavy UTF-8 multi-byte: `Ω ± µ × ≤ ≥ ° Δ η θ`). Deterministic crash between req 300–500 before fix. A synthetic ASCII workload does **not** reproduce it on this hardware — real OCR'd datasheet text is needed to trigger.

Full reproducer script + load corpus available on request.

## Ruled-out hypotheses (with evidence)

1. **`std::regex` stack overflow in tokenizer** (#17636, #21919 class): ruled out because the GGUF has `tokenizer.ggml.pre = "qwen2"`, which is handled by the custom `unicode_regex_split_custom_qwen2` (PR #21257, commit `0d049d6`) — not `std::regex`.
2. **Concurrency / race in slots**: ruled out. Runs with `--parallel 1` still crash.
3. **Prompt cache feature flag**: ruled out. Runs with `--no-cache-prompt` still crash.
4. **`--mmproj` text re-tokenization (`mtmd_tokenize` path)**: ruled out. For text-only chat requests the code does not enter `mtmd_tokenize`; and a synthetic mixed ASCII+UTF-8 workload with `--mmproj` + `--parallel 1` + `--no-cache-prompt` stays stable (AddressSanitizer build, 1170+ requests, 0 corruption).
5. **CPU IFUNC dispatching AVX2 string ops** on a non-AVX2 CPU: ruled out. The faulting offset is inside `__libc_free` (heap management), not a string function. IFUNC isn't in play here.

## Root cause direction (suspected, not confirmed)

The crash disappears cleanly when the new prompt cache (introduced in #16391) is disabled via `--cache-ram 0`:

- With `--cache-ram 8192` (default): process RSS reaches ~11.9 GB under load, crash within ~300–500 requests.
- With `--cache-ram 0`: process RSS stays at ~1 GB, 600/600 requests clean, no crashes, no SEGVs in the kernel log.

Corruption is therefore almost certainly in the LRU/eviction path of the prompt cache (`srv prompt_save` / `srv update` / `srv load` as logged), under the specific pressure of mmproj-loaded sessions receiving a diverse UTF-8-heavy corpus. We have not yet instrumented the cache internals to point at an exact line, but AddressSanitizer is available on our end if maintainers want us to run a guided repro.

## Workaround (confirmed)

Add `--cache-ram 0` to `llama-server` invocation. This disables the PR #16391 prompt cache entirely. Verified stable over 600 consecutive real-workload requests.

## Related issues

- #16391 — *PR that introduced the prompt cache* (suspected source)
- #21336 — `free(): invalid pointer` Vulkan + AMD + large context — **exact signature match**, still open
- #21919 — Qwen3.5 tokenizer stack overflow (related class, different path)
- #17636 / PR #17786 — earlier `std::regex` SEGV fixed via `nosubs|optimize`

## What we can contribute

We have:
- A reproducer corpus (200 real bitsavers PDFs, bucketed by UTF-8 codepoint diversity)
- An AddressSanitizer-built `llama-server` ready to run on the affected hardware
- Capacity to iterate test cases on request (within reasonable windows on a non-production test box)
- Willingness to help bisect or validate proposed fixes

Happy to guide a maintainer through repro or capture anything specific (coredump with `MALLOC_CHECK_=3`, ASan log, `perf trace` on the server — ask).

## Environment extra (for completeness)

- `GLIBC_TUNABLES=glibc.malloc.check=3:glibc.malloc.perturb=0x42` was active during post-fix test runs as a shield; it did not produce any `malloc_check` aborts once `--cache-ram 0` was set.
- `amdgpu.runpm=0 amdgpu.aspm=0 pcie_aspm=off amdgpu.gartsize=8192` on kernel cmdline (legacy Polaris params, kept for safety; Navi 21 doesn't strictly need them).
- LXC is privileged, with GPU passthrough via cgroup2 + `/dev/dri/card*` + `/dev/dri/renderD*` bind mounts. Same binary + same model + same reproducer produces crash or stability purely as a function of the `--cache-ram` setting — container vs. bare-metal was not a factor in isolating this bug.

---

*Reported by Claude et Richard Murray — colossus-ia.org*
*Investigation done on Colossus-1 (Dual Xeon E5-2690 v1 + RX 6900 XT, Debian 12, Proxmox VE 8.4)*
*Diagnostic collaboration: Claude clone-colossus (project lead), Claude clone-retrodoc (pipeline client), Claude clone-philippe (supervisor)*

---

*Suggested labels for triage: `bug`, area `server`, area `mtmd`.*


llama.cpp SHA	`9e5647aff` (build `b8840`)
Build flags	`-DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release`, gcc 12.2.0
Model	`Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf` + `mmproj-Qwen2.5-VL-7B-bf16.gguf` (official `ggml-org/Qwen2.5-VL-7B-Instruct-GGUF`)
GPU	AMD Radeon RX 6900 XT (Navi 21, gfx1030)
Userspace GPU stack	Mesa 25.0.7 (`bookworm-backports`), Vulkan via RADV
Kernel	6.8.12-20-pve (Proxmox VE 8.4.18)
glibc	2.36-9+deb12u13 (Debian 12 Bookworm, in privileged LXC)
CPU	2× Intel Xeon E5-2690 v1 (Sandy Bridge-EP, AVX1 only, no AVX2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan pipeline creation in mmproj path triggers Mesa RADV heap corruption (Navi 21, Mesa 25.0.7) #22128

Summary

Environment

Observed behavior

Mode A — userspace SEGV (most common)

Mode B — hard host crash (rarer, but has happened)

Minimal reproducer

Ruled-out hypotheses (with evidence)

Root cause direction (suspected, not confirmed)

Workaround (confirmed)

Related issues

What we can contribute

Environment extra (for completeness)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vulkan pipeline creation in mmproj path triggers Mesa RADV heap corruption (Navi 21, Mesa 25.0.7) #22128

Description

Summary

Environment

Observed behavior

Mode A — userspace SEGV (most common)

Mode B — hard host crash (rarer, but has happened)

Minimal reproducer

Ruled-out hypotheses (with evidence)

Root cause direction (suspected, not confirmed)

Workaround (confirmed)

Related issues

What we can contribute

Environment extra (for completeness)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions