Bug
Sample of continuum-core-server during chat-induced CPU spike (Mac M5 Pro, 800–900% CPU observed) shows the hot frames are NOT in libllama or ggml-metal. They are entirely in libonnxruntime.1.24.4.dylib:
347 MlasSgemmThreaded ← libonnxruntime
180 MlasSgemmPackedOperation
167 MlasSgemmPackedOperation
68 MlasSgemmThreaded
...
MlasSgemm* = Microsoft Linear Algebra Subroutines for single-precision matrix multiply, CPU kernels. Our ONNX Runtime build was compiled without the CoreML execution provider (or it's available but not registered at session creation), so every ONNX inference call falls through to MLAS CPU.
We use ONNX Runtime for at least four hot paths:
- fastembed embeddings — fires on EVERY chat message for vector indexing
- Piper TTS — fires per persona response on voice/audio path
- Whisper STT — fires per voice input
- VisionDescriptionService bridge — fires when a non-vision-capable persona receives an image (bridges via ONNX vision classifier → text)
Of those, fastembed is the per-chat-message hot loop, so it dominates the CPU spike pattern Joel observed correlating with active chat.
Joel: "we can fix the embedding issue next and get our cpu free from the depths of hell."
Why it matters
Llama.cpp Metal LLM inference is doing its work at 25–38% GPU during chat (Joel observed live). The 800–900% CPU spike was the orthogonal ONNX path adding CPU pressure on top. Fixing this is independent of #960 (qwen3.5 SSM Metal kernel coverage) and probably bigger impact since fastembed runs per message regardless of which model the persona uses.
Fix shape
ONNX Runtime supports CoreML EP on macOS (stable since v1.10, full op coverage by 1.16+). Our 1.24.4 build supports it.
- Mac native build: ensure CoreML EP is included in our ORT build flags. Check
--use_coreml or equivalent in our ORT vendor cmake.
- Session creation sites: every
Ort::Session / OrtSessionOptions we create needs .AppendExecutionProvider_CoreML(coreml_options) BEFORE the CPU EP. Fall-through to CPU is fine for unsupported ops; the goal is to put CoreML FIRST in the EP chain.
- Validate per-callsite: start with fastembed (highest-frequency), then TTS/STT, then VisionDescriptionService.
- Linux/CUDA equivalent:
AppendExecutionProvider_CUDA(cuda_options) for GPU-accelerated ONNX on the Linux side. Bigmama observed continuum-core CPU spikes 1000–1700% during chat on Linux too — likely same root.
Linked
PR #950 perf finding. Filed as separate scope from #960 (llama.cpp Metal SSM coverage). Both contribute to Mac CPU pressure but root causes are independent.
Joel directive: bigmama + anvil work on this together.
Bug
Sample of
continuum-core-serverduring chat-induced CPU spike (Mac M5 Pro, 800–900% CPU observed) shows the hot frames are NOT in libllama or ggml-metal. They are entirely inlibonnxruntime.1.24.4.dylib:MlasSgemm*= Microsoft Linear Algebra Subroutines for single-precision matrix multiply, CPU kernels. Our ONNX Runtime build was compiled without the CoreML execution provider (or it's available but not registered at session creation), so every ONNX inference call falls through to MLAS CPU.We use ONNX Runtime for at least four hot paths:
Of those, fastembed is the per-chat-message hot loop, so it dominates the CPU spike pattern Joel observed correlating with active chat.
Joel: "we can fix the embedding issue next and get our cpu free from the depths of hell."
Why it matters
Llama.cpp Metal LLM inference is doing its work at 25–38% GPU during chat (Joel observed live). The 800–900% CPU spike was the orthogonal ONNX path adding CPU pressure on top. Fixing this is independent of #960 (qwen3.5 SSM Metal kernel coverage) and probably bigger impact since fastembed runs per message regardless of which model the persona uses.
Fix shape
ONNX Runtime supports CoreML EP on macOS (stable since v1.10, full op coverage by 1.16+). Our 1.24.4 build supports it.
--use_coremlor equivalent in our ORT vendor cmake.Ort::Session/OrtSessionOptionswe create needs.AppendExecutionProvider_CoreML(coreml_options)BEFORE the CPU EP. Fall-through to CPU is fine for unsupported ops; the goal is to put CoreML FIRST in the EP chain.AppendExecutionProvider_CUDA(cuda_options)for GPU-accelerated ONNX on the Linux side. Bigmama observed continuum-core CPU spikes 1000–1700% during chat on Linux too — likely same root.Linked
PR #950 perf finding. Filed as separate scope from #960 (llama.cpp Metal SSM coverage). Both contribute to Mac CPU pressure but root causes are independent.
Joel directive: bigmama + anvil work on this together.