Skip to content

Vulkan pipeline creation in mmproj path triggers Mesa RADV heap corruption (Navi 21, Mesa 25.0.7) #22128

@rmurray484

Description

@rmurray484

Summary

Under sustained load (hundreds of requests) with llama-server running Qwen2.5-VL 7B + mmproj on a Vulkan (AMD) backend, the prompt cache introduced in PR #16391 corrupts the heap. Corruption eventually surfaces as a SIGSEGV in __libc_free (reading a clearly-invalid pointer), with a deterministic crash signature identical across independent runs. Passing --cache-ram 0 (disabling the cache) resolves the crash; no other combination of --no-cache-prompt, --parallel 1, context-size or batch-size changes does.

Environment

llama.cpp SHA 9e5647aff (build b8840)
Build flags -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release, gcc 12.2.0
Model Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf + mmproj-Qwen2.5-VL-7B-bf16.gguf (official ggml-org/Qwen2.5-VL-7B-Instruct-GGUF)
GPU AMD Radeon RX 6900 XT (Navi 21, gfx1030)
Userspace GPU stack Mesa 25.0.7 (bookworm-backports), Vulkan via RADV
Kernel 6.8.12-20-pve (Proxmox VE 8.4.18)
glibc 2.36-9+deb12u13 (Debian 12 Bookworm, in privileged LXC)
CPU 2× Intel Xeon E5-2690 v1 (Sandy Bridge-EP, AVX1 only, no AVX2)

Observed behavior

Two distinct failure modes observed on the same stack. Same underlying trigger; escalation depends on how many requests have landed:

Mode A — userspace SEGV (most common)

Appears after ~300–500 consecutive requests. Process dies, systemd auto-restarts.

llama-server[PID]: segfault at <rand>0000003c ip <rand><libc_base+0x98efa> error 4 in libc.so.6[...]
Code: ... 48 85 ff 0f 84 bf 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d e6 9e 13 00 <48> 8b 47 f8 64 8b 2b a8 02 75 5b ...
  • Faulting function: __libc_free (offset +0x1a), confirmed by byte-pattern match + nm -D lookup at Debian 12 glibc 2.36
  • Faulting instruction: mov rax, [rdi-0x8] — glibc's first dereference of the chunk header (mchunk_size)
  • Fault address pattern: 0x<random_high>_0000003c — i.e. rdi = 0x<random>_00000044. Low 32 bits = 0x44 (decimal 68), a real runtime value being treated as a pointer. Strongly suggests either a 32→64-bit cast without sign extension or a partial overwrite of a pointer's low half.
  • 9 independent crashes with identical pattern across distinct runs and PIDs; crash offset inside libc is byte-identical every time.

Mode B — hard host crash (rarer, but has happened)

Under heavy sustained load against the same binary/config, the host went fully unreachable (ping + SSH dead). pstore ERST triggered a kernel emergency write but we lost the oops itself (ERST buffer overwrote parts 1 & 2; only later boot messages survived). Machine required a hard reboot to recover.

We don't yet have a kernel trace for this, so this issue focuses on the userspace bug only. We will file the amdgpu / kernel angle separately once we can capture a proper kdump in a controlled environment.

Minimal reproducer

Config that crashes:

llama-server \
  -m Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf \
  --mmproj mmproj-Qwen2.5-VL-7B-bf16.gguf \
  -ngl 99 -c 32768 -ctk q8_0 -ctv q8_0 -fit off \
  --batch-size 2048 --ubatch-size 2048 \
  --parallel 1 --threads 8 --no-cache-prompt \
  --host 0.0.0.0 --port 8080

Note: --no-cache-prompt does not disable the RAM prompt cache. See companion issue #22127 about the stale log message.

Load: ~600 text-only classify requests cycling through real bitsavers vintage datasheets (heavy UTF-8 multi-byte: Ω ± µ × ≤ ≥ ° Δ η θ). Deterministic crash between req 300–500 before fix. A synthetic ASCII workload does not reproduce it on this hardware — real OCR'd datasheet text is needed to trigger.

Full reproducer script + load corpus available on request.

Ruled-out hypotheses (with evidence)

  1. std::regex stack overflow in tokenizer (Bug: llama-server crashes (segfault) when processing prompts with repeated identical characters #17636, Eval bug: Stack overflow / crash when tokenizing long text with Qwen3.5 models #21919 class): ruled out because the GGUF has tokenizer.ggml.pre = "qwen2", which is handled by the custom unicode_regex_split_custom_qwen2 (PR unicode : add custom Qwen2 regex handler to fix segfault on long input #21257, commit 0d049d6) — not std::regex.
  2. Concurrency / race in slots: ruled out. Runs with --parallel 1 still crash.
  3. Prompt cache feature flag: ruled out. Runs with --no-cache-prompt still crash.
  4. --mmproj text re-tokenization (mtmd_tokenize path): ruled out. For text-only chat requests the code does not enter mtmd_tokenize; and a synthetic mixed ASCII+UTF-8 workload with --mmproj + --parallel 1 + --no-cache-prompt stays stable (AddressSanitizer build, 1170+ requests, 0 corruption).
  5. CPU IFUNC dispatching AVX2 string ops on a non-AVX2 CPU: ruled out. The faulting offset is inside __libc_free (heap management), not a string function. IFUNC isn't in play here.

Root cause direction (suspected, not confirmed)

The crash disappears cleanly when the new prompt cache (introduced in #16391) is disabled via --cache-ram 0:

  • With --cache-ram 8192 (default): process RSS reaches ~11.9 GB under load, crash within ~300–500 requests.
  • With --cache-ram 0: process RSS stays at ~1 GB, 600/600 requests clean, no crashes, no SEGVs in the kernel log.

Corruption is therefore almost certainly in the LRU/eviction path of the prompt cache (srv prompt_save / srv update / srv load as logged), under the specific pressure of mmproj-loaded sessions receiving a diverse UTF-8-heavy corpus. We have not yet instrumented the cache internals to point at an exact line, but AddressSanitizer is available on our end if maintainers want us to run a guided repro.

Workaround (confirmed)

Add --cache-ram 0 to llama-server invocation. This disables the PR #16391 prompt cache entirely. Verified stable over 600 consecutive real-workload requests.

Related issues

What we can contribute

We have:

  • A reproducer corpus (200 real bitsavers PDFs, bucketed by UTF-8 codepoint diversity)
  • An AddressSanitizer-built llama-server ready to run on the affected hardware
  • Capacity to iterate test cases on request (within reasonable windows on a non-production test box)
  • Willingness to help bisect or validate proposed fixes

Happy to guide a maintainer through repro or capture anything specific (coredump with MALLOC_CHECK_=3, ASan log, perf trace on the server — ask).

Environment extra (for completeness)

  • GLIBC_TUNABLES=glibc.malloc.check=3:glibc.malloc.perturb=0x42 was active during post-fix test runs as a shield; it did not produce any malloc_check aborts once --cache-ram 0 was set.
  • amdgpu.runpm=0 amdgpu.aspm=0 pcie_aspm=off amdgpu.gartsize=8192 on kernel cmdline (legacy Polaris params, kept for safety; Navi 21 doesn't strictly need them).
  • LXC is privileged, with GPU passthrough via cgroup2 + /dev/dri/card* + /dev/dri/renderD* bind mounts. Same binary + same model + same reproducer produces crash or stability purely as a function of the --cache-ram setting — container vs. bare-metal was not a factor in isolating this bug.

Reported by Claude et Richard Murray — colossus-ia.org
Investigation done on Colossus-1 (Dual Xeon E5-2690 v1 + RX 6900 XT, Debian 12, Proxmox VE 8.4)
Diagnostic collaboration: Claude clone-colossus (project lead), Claude clone-retrodoc (pipeline client), Claude clone-philippe (supervisor)


Suggested labels for triage: bug, area server, area mtmd.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions