ONNX Runtime running on CPU (MLAS) instead of Metal — 800-900% CPU spike during chat from fastembed/TTS/STT/vision-bridge

## Bug

Sample of `continuum-core-server` during chat-induced CPU spike (Mac M5 Pro, 800–900% CPU observed) shows the hot frames are NOT in libllama or ggml-metal. They are entirely in `libonnxruntime.1.24.4.dylib`:

```
347 MlasSgemmThreaded ← libonnxruntime
180 MlasSgemmPackedOperation
167 MlasSgemmPackedOperation
68  MlasSgemmThreaded
...
```

`MlasSgemm*` = Microsoft Linear Algebra Subroutines for single-precision matrix multiply, **CPU kernels**. Our ONNX Runtime build was compiled without the **CoreML execution provider** (or it's available but not registered at session creation), so every ONNX inference call falls through to MLAS CPU.

We use ONNX Runtime for at least four hot paths:
- **fastembed embeddings** — fires on EVERY chat message for vector indexing
- **Piper TTS** — fires per persona response on voice/audio path
- **Whisper STT** — fires per voice input
- **VisionDescriptionService bridge** — fires when a non-vision-capable persona receives an image (bridges via ONNX vision classifier → text)

Of those, fastembed is the per-chat-message hot loop, so it dominates the CPU spike pattern Joel observed correlating with active chat.

Joel: "we can fix the embedding issue next and get our cpu free from the depths of hell."

## Why it matters

Llama.cpp Metal LLM inference is doing its work at 25–38% GPU during chat (Joel observed live). The 800–900% CPU spike was the orthogonal ONNX path adding CPU pressure on top. Fixing this is independent of #960 (qwen3.5 SSM Metal kernel coverage) and probably bigger impact since fastembed runs per message regardless of which model the persona uses.

## Fix shape

ONNX Runtime supports CoreML EP on macOS (stable since v1.10, full op coverage by 1.16+). Our 1.24.4 build supports it.

1. **Mac native build:** ensure CoreML EP is included in our ORT build flags. Check `--use_coreml` or equivalent in our ORT vendor cmake.
2. **Session creation sites:** every `Ort::Session` / `OrtSessionOptions` we create needs `.AppendExecutionProvider_CoreML(coreml_options)` BEFORE the CPU EP. Fall-through to CPU is fine for unsupported ops; the goal is to put CoreML FIRST in the EP chain.
3. **Validate per-callsite:** start with fastembed (highest-frequency), then TTS/STT, then VisionDescriptionService.
4. **Linux/CUDA equivalent:** `AppendExecutionProvider_CUDA(cuda_options)` for GPU-accelerated ONNX on the Linux side. Bigmama observed continuum-core CPU spikes 1000–1700% during chat on Linux too — likely same root.

## Linked

PR #950 perf finding. Filed as separate scope from #960 (llama.cpp Metal SSM coverage). Both contribute to Mac CPU pressure but root causes are independent.

Joel directive: bigmama + anvil work on this together.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX Runtime running on CPU (MLAS) instead of Metal — 800-900% CPU spike during chat from fastembed/TTS/STT/vision-bridge #964

Bug

Why it matters

Fix shape

Linked

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ONNX Runtime running on CPU (MLAS) instead of Metal — 800-900% CPU spike during chat from fastembed/TTS/STT/vision-bridge #964

Description

Bug

Why it matters

Fix shape

Linked

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions