Skip to content

feat(server): add /v1/embeddings route via mlx_embeddings#1265

Open
andreinknv wants to merge 1 commit intoml-explore:mainfrom
andreinknv:feat/embeddings-route
Open

feat(server): add /v1/embeddings route via mlx_embeddings#1265
andreinknv wants to merge 1 commit intoml-explore:mainfrom
andreinknv:feat/embeddings-route

Conversation

@andreinknv
Copy link
Copy Markdown

Add /v1/embeddings route to mlx_lm.server

Summary

Adds an optional POST /v1/embeddings route to mlx_lm.server,
backed by the existing mlx_embeddings package. Enables a single
mlx_lm.server process to serve both OpenAI-compatible chat AND
embeddings — a pattern the OpenAI clients (LangChain, LlamaIndex,
codegraph, etc.) already expect.

When the --embedding-model flag is omitted, server behavior is
unchanged. There is no chat-throughput regression because the chat
code path is not modified.

Motivation

Today, anyone wanting both chat and embeddings on Apple Silicon via
mlx-lm has to run two processes (one mlx_lm.server for chat + a
separate embedding server like llama.cpp or
oMLX) or use a wrapper that adds a
custom scheduler in front of mlx-lm — both of which we measured cost
~2× chat throughput vs raw mlx_lm.server due to the extra batching
layer.

This PR removes the gap natively: one process, one config endpoint,
chat code path untouched.

Diff size

162 lines added, 0 deletions, all in mlx_lm/server.py. Most of
the additions are docstrings and validation; the actual route handler
is ~50 lines.

What's added

  1. Class-level slots on APIHandlerembedding_model,
    embedding_tokenizer, embedding_model_id, _embed_model_path,
    plus a _embed_lock and _embed_load_lock. The lock is required:
    MLX inference is not thread-safe under the default stream
    (see ml-explore/mlx#3078).
  2. do_POST dispatch update/v1/embeddings short-circuits
    to a new _handle_embeddings method before the existing
    chat-shaped body parsing (embeddings have a different request
    shape).
  3. _handle_embeddings method — lazy-loads the embedding model
    on first request via mlx_embeddings.load, runs inference under
    the lock, returns OpenAI-compat response shape:
    {
      "object": "list",
      "data": [{"object": "embedding", "embedding": [...], "index": 0}],
      "model": "...",
      "usage": {"prompt_tokens": N, "total_tokens": N}
    }
  4. --embedding-model CLI flag — optional. When set, the model
    path is registered for lazy load. When omitted, the slot stays
    None and /v1/embeddings returns 404 - No embedding model loaded.

Design choices

  • Lazy load rather than load-at-startup. Servers that never see
    an embedding request pay no cost (no GPU memory contention with
    the chat model, no startup latency). The first embed request is
    ~3s slower; subsequent calls are warm.
  • Pooled-vector fallback chain: text_embedspooler_output
    → mean-pooled last_hidden_state. Covers bi-encoder MLX models,
    BERT-family [CLS] heads, and the universal mean-pool fallback.
  • Lock around inference rather than BatchGenerator-style
    continuous batching. Continuous batching is the right scaling
    strategy but doubles the patch size and would touch the chat
    scheduler. Keeping the patch tiny and shipping lock+per-text
    inference first; batching can be a follow-up PR.
  • No new optional dependency added to pyproject.toml
    mlx_embeddings is a runtime-only requirement of the embedding
    route. When the package is missing, the route returns a clear
    500 with the install hint; chat keeps working. Operators who
    use only chat are unaffected.

Performance (measured on M-series 36GB Apple Silicon)

Single concurrent embed (no batching contention):

  • llama.cpp + nomic-embed Q4 GGUF, warm: ~650 req/s
  • This patch + nomic-modernbert-embed-base 4bit, lock-serialised: ~190 req/s

The patch's per-request throughput is lower because the lock
serialises requests; that's the safety contract. For typical
codegraph-style indexing workloads (hundreds of embeds in bursts),
~190 req/s already finishes a 3000-symbol pass in ~16 seconds.
Continuous batching could approach llama.cpp's throughput; left as
a follow-up to keep this PR minimal.

Chat throughput (Qwen2.5-Coder-3B-Instruct-4bit, max_tokens=25,
parallel=16, N=20): 160 tok/s with or without the embed model
loaded — chat path is untouched.

Tests

I have a suite of 11 tests in my codegraph fork that exercise the
factory routing, construction, and short-circuit paths for the
in-process embedding client that mirrors this PR's contract. Happy
to port them to mlx-lm's test layout if the maintainers prefer
landing them with this PR.

Validation

  • Both --model (chat-only) and --model + --embedding-model
    configurations start cleanly.
  • POST /v1/chat/completions works identically before and after
    this patch (chat code path unchanged).
  • POST /v1/embeddings returns 404 when no --embedding-model was
    passed.
  • POST /v1/embeddings with a valid input returns 768-dim vectors
    matching mlx_embeddings.generate output.
  • Empty input: [] rejected with 400.
  • Non-dict body rejected with 400.
  • Concurrent requests serialise via the lock without crashing.

Backwards compatibility

100% backwards compatible. Operators who omit --embedding-model
see no behavior change. Operators who add it gain a new route; no
existing routes change.

Future work (out of scope for this PR)

  • Continuous batching for /v1/embeddings to recover the throughput
    llama.cpp gets via batched matmul.
  • /v1/rerank route (nomic + bge-rerank models).
  • Listing the embedding model in /v1/models response (currently
    only the chat model is enumerated).

License

Apache 2.0 — same as mlx-lm.


Happy to iterate on the diff. The patched file produces identical
chat behavior to upstream and adds the embedding route as an opt-in
capability.

Adds an optional POST /v1/embeddings route to mlx_lm.server backed
by the mlx_embeddings package. Enables a single mlx_lm.server
process to serve both OpenAI-compatible chat AND embeddings.

When --embedding-model is omitted, server behavior is unchanged.
Chat code path is not modified, so chat throughput is identical
before/after. Embedding model is lazy-loaded on first request.

Inference is serialised under a class-level lock because MLX
inference is not thread-safe under the default stream
(see ml-explore/mlx#3078).

162 lines added, 0 deletions, all in mlx_lm/server.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant