feat(server): add /v1/embeddings route via mlx_embeddings by andreinknv · Pull Request #1265 · ml-explore/mlx-lm

andreinknv · 2026-05-09T19:38:33Z

Add `/v1/embeddings` route to `mlx_lm.server`

Summary

Adds an optional POST /v1/embeddings route to mlx_lm.server,
backed by the existing mlx_embeddings package. Enables a single
mlx_lm.server process to serve both OpenAI-compatible chat AND
embeddings — a pattern the OpenAI clients (LangChain, LlamaIndex,
codegraph, etc.) already expect.

When the --embedding-model flag is omitted, server behavior is
unchanged. There is no chat-throughput regression because the chat
code path is not modified.

Motivation

Today, anyone wanting both chat and embeddings on Apple Silicon via
mlx-lm has to run two processes (one mlx_lm.server for chat + a
separate embedding server like llama.cpp or
oMLX) or use a wrapper that adds a
custom scheduler in front of mlx-lm — both of which we measured cost
~2× chat throughput vs raw mlx_lm.server due to the extra batching
layer.

This PR removes the gap natively: one process, one config endpoint,
chat code path untouched.

Diff size

162 lines added, 0 deletions, all in mlx_lm/server.py. Most of
the additions are docstrings and validation; the actual route handler
is ~50 lines.

What's added

Class-level slots on APIHandler — embedding_model,
embedding_tokenizer, embedding_model_id, _embed_model_path,
plus a _embed_lock and _embed_load_lock. The lock is required:
MLX inference is not thread-safe under the default stream
(see ml-explore/mlx#3078).
do_POST dispatch update — /v1/embeddings short-circuits
to a new _handle_embeddings method before the existing
chat-shaped body parsing (embeddings have a different request
shape).

_handle_embeddings method — lazy-loads the embedding model
on first request via mlx_embeddings.load, runs inference under
the lock, returns OpenAI-compat response shape:

{
  "object": "list",
  "data": [{"object": "embedding", "embedding": [...], "index": 0}],
  "model": "...",
  "usage": {"prompt_tokens": N, "total_tokens": N}
}

--embedding-model CLI flag — optional. When set, the model
path is registered for lazy load. When omitted, the slot stays
None and /v1/embeddings returns 404 - No embedding model loaded.

Design choices

Lazy load rather than load-at-startup. Servers that never see
an embedding request pay no cost (no GPU memory contention with
the chat model, no startup latency). The first embed request is
~3s slower; subsequent calls are warm.
Pooled-vector fallback chain: text_embeds → pooler_output
→ mean-pooled last_hidden_state. Covers bi-encoder MLX models,
BERT-family [CLS] heads, and the universal mean-pool fallback.
Lock around inference rather than BatchGenerator-style
continuous batching. Continuous batching is the right scaling
strategy but doubles the patch size and would touch the chat
scheduler. Keeping the patch tiny and shipping lock+per-text
inference first; batching can be a follow-up PR.
No new optional dependency added to pyproject.toml —
mlx_embeddings is a runtime-only requirement of the embedding
route. When the package is missing, the route returns a clear
500 with the install hint; chat keeps working. Operators who
use only chat are unaffected.

Performance (measured on M-series 36GB Apple Silicon)

Single concurrent embed (no batching contention):

llama.cpp + nomic-embed Q4 GGUF, warm: ~650 req/s
This patch + nomic-modernbert-embed-base 4bit, lock-serialised: ~190 req/s

The patch's per-request throughput is lower because the lock
serialises requests; that's the safety contract. For typical
codegraph-style indexing workloads (hundreds of embeds in bursts),
~190 req/s already finishes a 3000-symbol pass in ~16 seconds.
Continuous batching could approach llama.cpp's throughput; left as
a follow-up to keep this PR minimal.

Chat throughput (Qwen2.5-Coder-3B-Instruct-4bit, max_tokens=25,
parallel=16, N=20): 160 tok/s with or without the embed model
loaded — chat path is untouched.

Tests

I have a suite of 11 tests in my codegraph fork that exercise the
factory routing, construction, and short-circuit paths for the
in-process embedding client that mirrors this PR's contract. Happy
to port them to mlx-lm's test layout if the maintainers prefer
landing them with this PR.

Validation

Both --model (chat-only) and --model + --embedding-model
configurations start cleanly.
POST /v1/chat/completions works identically before and after
this patch (chat code path unchanged).
POST /v1/embeddings returns 404 when no --embedding-model was
passed.
POST /v1/embeddings with a valid input returns 768-dim vectors
matching mlx_embeddings.generate output.
Empty input: [] rejected with 400.
Non-dict body rejected with 400.
Concurrent requests serialise via the lock without crashing.

Backwards compatibility

100% backwards compatible. Operators who omit --embedding-model
see no behavior change. Operators who add it gain a new route; no
existing routes change.

Future work (out of scope for this PR)

Continuous batching for /v1/embeddings to recover the throughput
llama.cpp gets via batched matmul.
/v1/rerank route (nomic + bge-rerank models).
Listing the embedding model in /v1/models response (currently
only the chat model is enumerated).

License

Apache 2.0 — same as mlx-lm.

Happy to iterate on the diff. The patched file produces identical
chat behavior to upstream and adds the embedding route as an opt-in
capability.

Adds an optional POST /v1/embeddings route to mlx_lm.server backed by the mlx_embeddings package. Enables a single mlx_lm.server process to serve both OpenAI-compatible chat AND embeddings. When --embedding-model is omitted, server behavior is unchanged. Chat code path is not modified, so chat throughput is identical before/after. Embedding model is lazy-loaded on first request. Inference is serialised under a class-level lock because MLX inference is not thread-safe under the default stream (see ml-explore/mlx#3078). 162 lines added, 0 deletions, all in mlx_lm/server.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): add /v1/embeddings route via mlx_embeddings#1265

feat(server): add /v1/embeddings route via mlx_embeddings#1265
andreinknv wants to merge 1 commit intoml-explore:mainfrom
andreinknv:feat/embeddings-route

andreinknv commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andreinknv commented May 9, 2026

Add /v1/embeddings route to mlx_lm.server

Summary

Motivation

Diff size

What's added

Design choices

Performance (measured on M-series 36GB Apple Silicon)

Tests

Validation

Backwards compatibility

Future work (out of scope for this PR)

License

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add `/v1/embeddings` route to `mlx_lm.server`