feat(server): add /v1/embeddings route via mlx_embeddings#1265
Open
andreinknv wants to merge 1 commit intoml-explore:mainfrom
Open
feat(server): add /v1/embeddings route via mlx_embeddings#1265andreinknv wants to merge 1 commit intoml-explore:mainfrom
andreinknv wants to merge 1 commit intoml-explore:mainfrom
Conversation
Adds an optional POST /v1/embeddings route to mlx_lm.server backed by the mlx_embeddings package. Enables a single mlx_lm.server process to serve both OpenAI-compatible chat AND embeddings. When --embedding-model is omitted, server behavior is unchanged. Chat code path is not modified, so chat throughput is identical before/after. Embedding model is lazy-loaded on first request. Inference is serialised under a class-level lock because MLX inference is not thread-safe under the default stream (see ml-explore/mlx#3078). 162 lines added, 0 deletions, all in mlx_lm/server.py.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
/v1/embeddingsroute tomlx_lm.serverSummary
Adds an optional
POST /v1/embeddingsroute tomlx_lm.server,backed by the existing
mlx_embeddingspackage. Enables a singlemlx_lm.serverprocess to serve both OpenAI-compatible chat ANDembeddings — a pattern the OpenAI clients (LangChain, LlamaIndex,
codegraph, etc.) already expect.
When the
--embedding-modelflag is omitted, server behavior isunchanged. There is no chat-throughput regression because the chat
code path is not modified.
Motivation
Today, anyone wanting both chat and embeddings on Apple Silicon via
mlx-lm has to run two processes (one mlx_lm.server for chat + a
separate embedding server like llama.cpp or
oMLX) or use a wrapper that adds a
custom scheduler in front of mlx-lm — both of which we measured cost
~2× chat throughput vs raw
mlx_lm.serverdue to the extra batchinglayer.
This PR removes the gap natively: one process, one config endpoint,
chat code path untouched.
Diff size
162 lines added, 0 deletions, all in
mlx_lm/server.py. Most ofthe additions are docstrings and validation; the actual route handler
is ~50 lines.
What's added
APIHandler—embedding_model,embedding_tokenizer,embedding_model_id,_embed_model_path,plus a
_embed_lockand_embed_load_lock. The lock is required:MLX inference is not thread-safe under the default stream
(see ml-explore/mlx#3078).
do_POSTdispatch update —/v1/embeddingsshort-circuitsto a new
_handle_embeddingsmethod before the existingchat-shaped body parsing (embeddings have a different request
shape).
_handle_embeddingsmethod — lazy-loads the embedding modelon first request via
mlx_embeddings.load, runs inference underthe lock, returns OpenAI-compat response shape:
{ "object": "list", "data": [{"object": "embedding", "embedding": [...], "index": 0}], "model": "...", "usage": {"prompt_tokens": N, "total_tokens": N} }--embedding-modelCLI flag — optional. When set, the modelpath is registered for lazy load. When omitted, the slot stays
None and
/v1/embeddingsreturns404 - No embedding model loaded.Design choices
an embedding request pay no cost (no GPU memory contention with
the chat model, no startup latency). The first embed request is
~3s slower; subsequent calls are warm.
text_embeds→pooler_output→ mean-pooled
last_hidden_state. Covers bi-encoder MLX models,BERT-family
[CLS]heads, and the universal mean-pool fallback.BatchGenerator-stylecontinuous batching. Continuous batching is the right scaling
strategy but doubles the patch size and would touch the chat
scheduler. Keeping the patch tiny and shipping lock+per-text
inference first; batching can be a follow-up PR.
pyproject.toml—mlx_embeddingsis a runtime-only requirement of the embeddingroute. When the package is missing, the route returns a clear
500 with the install hint; chat keeps working. Operators who
use only chat are unaffected.
Performance (measured on M-series 36GB Apple Silicon)
Single concurrent embed (no batching contention):
The patch's per-request throughput is lower because the lock
serialises requests; that's the safety contract. For typical
codegraph-style indexing workloads (hundreds of embeds in bursts),
~190 req/s already finishes a 3000-symbol pass in ~16 seconds.
Continuous batching could approach llama.cpp's throughput; left as
a follow-up to keep this PR minimal.
Chat throughput (Qwen2.5-Coder-3B-Instruct-4bit, max_tokens=25,
parallel=16, N=20): 160 tok/s with or without the embed model
loaded — chat path is untouched.
Tests
I have a suite of 11 tests in my codegraph fork that exercise the
factory routing, construction, and short-circuit paths for the
in-process embedding client that mirrors this PR's contract. Happy
to port them to mlx-lm's test layout if the maintainers prefer
landing them with this PR.
Validation
--model(chat-only) and--model + --embedding-modelconfigurations start cleanly.
POST /v1/chat/completionsworks identically before and afterthis patch (chat code path unchanged).
POST /v1/embeddingsreturns 404 when no--embedding-modelwaspassed.
POST /v1/embeddingswith a valid input returns 768-dim vectorsmatching
mlx_embeddings.generateoutput.input: []rejected with 400.Backwards compatibility
100% backwards compatible. Operators who omit
--embedding-modelsee no behavior change. Operators who add it gain a new route; no
existing routes change.
Future work (out of scope for this PR)
/v1/embeddingsto recover the throughputllama.cpp gets via batched matmul.
/v1/rerankroute (nomic + bge-rerank models)./v1/modelsresponse (currentlyonly the chat model is enumerated).
License
Apache 2.0 — same as mlx-lm.
Happy to iterate on the diff. The patched file produces identical
chat behavior to upstream and adds the embedding route as an opt-in
capability.