Skip to content

ONNX Runtime running on CPU (MLAS) instead of Metal — 800-900% CPU spike during chat from fastembed/TTS/STT/vision-bridge #964

@joelteply

Description

@joelteply

Bug

Sample of continuum-core-server during chat-induced CPU spike (Mac M5 Pro, 800–900% CPU observed) shows the hot frames are NOT in libllama or ggml-metal. They are entirely in libonnxruntime.1.24.4.dylib:

347 MlasSgemmThreaded ← libonnxruntime
180 MlasSgemmPackedOperation
167 MlasSgemmPackedOperation
68  MlasSgemmThreaded
...

MlasSgemm* = Microsoft Linear Algebra Subroutines for single-precision matrix multiply, CPU kernels. Our ONNX Runtime build was compiled without the CoreML execution provider (or it's available but not registered at session creation), so every ONNX inference call falls through to MLAS CPU.

We use ONNX Runtime for at least four hot paths:

  • fastembed embeddings — fires on EVERY chat message for vector indexing
  • Piper TTS — fires per persona response on voice/audio path
  • Whisper STT — fires per voice input
  • VisionDescriptionService bridge — fires when a non-vision-capable persona receives an image (bridges via ONNX vision classifier → text)

Of those, fastembed is the per-chat-message hot loop, so it dominates the CPU spike pattern Joel observed correlating with active chat.

Joel: "we can fix the embedding issue next and get our cpu free from the depths of hell."

Why it matters

Llama.cpp Metal LLM inference is doing its work at 25–38% GPU during chat (Joel observed live). The 800–900% CPU spike was the orthogonal ONNX path adding CPU pressure on top. Fixing this is independent of #960 (qwen3.5 SSM Metal kernel coverage) and probably bigger impact since fastembed runs per message regardless of which model the persona uses.

Fix shape

ONNX Runtime supports CoreML EP on macOS (stable since v1.10, full op coverage by 1.16+). Our 1.24.4 build supports it.

  1. Mac native build: ensure CoreML EP is included in our ORT build flags. Check --use_coreml or equivalent in our ORT vendor cmake.
  2. Session creation sites: every Ort::Session / OrtSessionOptions we create needs .AppendExecutionProvider_CoreML(coreml_options) BEFORE the CPU EP. Fall-through to CPU is fine for unsupported ops; the goal is to put CoreML FIRST in the EP chain.
  3. Validate per-callsite: start with fastembed (highest-frequency), then TTS/STT, then VisionDescriptionService.
  4. Linux/CUDA equivalent: AppendExecutionProvider_CUDA(cuda_options) for GPU-accelerated ONNX on the Linux side. Bigmama observed continuum-core CPU spikes 1000–1700% during chat on Linux too — likely same root.

Linked

PR #950 perf finding. Filed as separate scope from #960 (llama.cpp Metal SSM coverage). Both contribute to Mac CPU pressure but root causes are independent.

Joel directive: bigmama + anvil work on this together.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingperformancePerformance / throughput issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions