[Docs] Remove the mention of the gateway endpoint #3514

peterschmidt85 · peterschmidt85 · commit 17cbde04bfc6 · 2026-01-29T20:04:56.000+01:00
diff --git a/docs/blog/posts/dstack-sky.md b/docs/blog/posts/dstack-sky.md
@@ -121,15 +121,14 @@ model: mixtral
 ```
 </div>
 
-If it has a `model` mapping, the model will be accessible
-at `https://gateway.<project name>.sky.dstack.ai` via the OpenAI compatible interface.
+The service endpoint will be accessible at `https://<run name>.<project name>.sky.dstack.ai` via the OpenAI compatible interface.
 
 ```python
 from openai import OpenAI
 
 
 client = OpenAI(
-  base_url="https://gateway.<project name>.sky.dstack.ai",
+  base_url="https://<run name>.<project name>.sky.dstack.ai/v1",
   api_key="<dstack token>"
 )
 
diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md
@@ -68,7 +68,7 @@ Model meta-llama/Meta-Llama-3.1-8B-Instruct is published at:
 
 `dstack apply` automatically provisions instances and runs the service.
 
-If a [gateway](gateways.md) is not configured, the service’s endpoint will be accessible at
+If you do not have a [gateway](gateways.md) created, the service endpoint will be accessible at
 `<dstack server URL>/proxy/services/<project name>/<run name>/`.
 
 <div class="termy">
@@ -90,37 +90,50 @@ $ curl http://localhost:3000/proxy/services/main/llama31/v1/chat/completions \
 
 </div>
 
-If the service defines the [`model`](#model) property, the model can be accessed with
-the global OpenAI-compatible endpoint at `<dstack server URL>/proxy/models/<project name>/`,
-or via `dstack` UI.
+<!-- If [authorization](#authorization) is not disabled, the service endpoint requires the `Authorization` header with `Bearer <dstack token>`. -->
 
-If [authorization](#authorization) is not disabled, the service endpoint requires the `Authorization` header with
-`Bearer <dstack token>`.
+## Configuration options
 
-??? info "Gateway"
-    Running services for development purposes doesn’t require setting up a [gateway](gateways.md).
+<!-- !!! info "No commands"
+    If `commands` are not specified, `dstack` runs `image`’s entrypoint (or fails if none is set). -->
 
-    However, you'll need a gateway in the following cases:
+### Gateway
 
-    * To use auto-scaling or rate limits
-    * To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#)
-    * To enable HTTPS for the endpoint and map it to your domain
-    * If your service requires WebSockets
-    * If your service cannot work with a [path prefix](#path-prefix)
+Here are cases where a service may need a gateway:
 
-    <!-- Note, if you're using [dstack Sky](https://sky.dstack.ai),
-    a gateway is already pre-configured for you. -->
+* To use [auto-scaling](#replicas-and-scaling) or [rate limits](#rate-limits)
+* To enable a support custom router, e.g. such as the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#)
+* To enable HTTPS for the endpoint and map it to your domain
+* If your service requires WebSockets
+* If your service cannot work with a [path prefix](#path-prefix)
 
-    If a [gateway](gateways.md) is configured, the service endpoint will be accessible at
-    `https://<run name>.<gateway domain>/`.
+<!-- Note, if you're using [dstack Sky](https://sky.dstack.ai),
+a gateway is already pre-configured for you. -->
 
-    If the service defines the `model` property, the model will be available via the global OpenAI-compatible endpoint 
-    at `https://gateway.<gateway domain>/`.
+If you want `dstack` to explicitly validate that a gateway is used, you can set the [`gateway`](../reference/dstack.yml/service.md#gateway) property in the service configuration to `true`. In this case, `dstack` will raise an error during `dstack apply` if a default gateway is not created.
 
-## Configuration options
+You can also set the `gateway` property to the name of a specific gateway, if required.
+
+If you have a [gateway](gateways.md) created, the service endpoint will be accessible at `https://<run name>.<gateway domain>/`:
+
+<div class="termy">
+
+```shell
+$ curl https://llama31.example.com/v1/chat/completions \
+    -H 'Content-Type: application/json' \
+    -H 'Authorization: Bearer &lt;dstack token&gt;' \
+    -d '{
+        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+        "messages": [
+            {
+                "role": "user",
+                "content": "Compose a poem that explains the concept of recursion in programming."
+            }
+        ]
+    }'
+```
 
-!!! info "No commands"
-    If `commands` are not specified, `dstack` runs `image`’s entrypoint (or fails if none is set).
+</div>
 
 ### Replicas and scaling
 
@@ -215,12 +228,6 @@ Setting the minimum number of replicas to `0` allows the service to scale down t
 ??? info "Disaggregated serving"
     Native support for disaggregated prefill and decode, allowing both worker types to run within a single service, is coming soon.
 
-### Model
-
-If the service is running a chat model with an OpenAI-compatible interface,
-set the [`model`](#model) property to make the model accessible via `dstack`'s 
-global OpenAI-compatible endpoint, and also accessible via `dstack`'s UI.
-
 ### Authorization
 
 By default, the service enables authorization, meaning the service endpoint requires a `dstack` user token.
@@ -359,7 +366,7 @@ set [`strip_prefix`](../reference/dstack.yml/service.md#strip_prefix) to `false`
 If your app cannot be configured to work with a path prefix, you can host it
 on a dedicated domain name by setting up a [gateway](gateways.md).
 
-### Rate limits { #rate-limits }
+### Rate limits
 
 If you have a [gateway](gateways.md), you can configure rate limits for your service
 using the [`rate_limits`](../reference/dstack.yml/service.md#rate_limits) property.
@@ -408,6 +415,11 @@ Limits apply to the whole service (all replicas) and per client (by IP). Clients
 
     </div>
 
+### Model
+
+If the service runs a model with an OpenAI-compatible interface, you can set the [`model`](#model) property to make the model accessible through `dstack`'s chat UI on the `Models` page. 
+In this case, `dstack` will use the service's `/v1/chat/completions` service.
+
 ### Resources
 
 If you specify memory size, you can either specify an explicit size (e.g. `24GB`) or a 
diff --git a/examples/inference/nim/README.md b/examples/inference/nim/README.md
@@ -78,13 +78,12 @@ Provisioning...
 ```
 </div>
 
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint
-at `<dstack server URL>/proxy/models/<project name>/`.
+If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
 
 <div class="termy">
 
 ```shell
-$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
+$ curl http://127.0.0.1:3000/proxy/services/main/serve-distill-deepseek/v1/chat/completions \
     -X POST \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -H 'Content-Type: application/json' \
@@ -106,8 +105,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
 
 </div>
 
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
-is available at `https://gateway.<gateway domain>/`.
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://serve-distill-deepseek.<gateway domain>/`.
 
 ## Source code
 
diff --git a/examples/inference/sglang/README.md b/examples/inference/sglang/README.md
@@ -12,7 +12,7 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B
 
     ```yaml
     type: service
-    name: deepseek-r1-nvidia
+    name: deepseek-r1
 
     image: lmsysorg/sglang:latest
     env:
@@ -38,7 +38,7 @@ Here's an example of a service that deploys DeepSeek-R1-Distill-Llama 8B and 70B
 
     ```yaml
     type: service
-    name: deepseek-r1-amd
+    name: deepseek-r1
 
     image: lmsysorg/sglang:v0.4.1.post4-rocm620
     env:
@@ -69,20 +69,19 @@ $ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml
  #  BACKEND  REGION     RESOURCES                         SPOT  PRICE
  1  runpod   EU-RO-1   24xCPU, 283GB, 1xMI300X (192GB)    no    $2.49
 
-Submit the run deepseek-r1-amd? [y/n]: y
+Submit the run deepseek-r1? [y/n]: y
 
 Provisioning...
 ---> 100%
 ```
 </div>
 
-Once the service is up, the model will be available via the OpenAI-compatible endpoint
-at `<dstack server URL>/proxy/models/<project name>/`.
+If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
 
 <div class="termy">
 
 ```shell
-curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
+curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
     -X POST \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -H 'Content-Type: application/json' \
@@ -107,7 +106,7 @@ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
 !!! info "SGLang Model Gateway"
     If you'd like to use a custom routing policy, e.g. by leveraging the [SGLang Model Gateway](https://docs.sglang.ai/advanced_features/router.html#), create a gateway with `router` set to `sglang`. Check out [gateways](https://dstack.ai/docs/concepts/gateways#router) for more details.
 
-> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the OpenAI-compatible endpoint is available at `https://gateway.<gateway domain>/`.
+> If a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured (e.g. to enable auto-scaling or HTTPs, rate-limits, etc), the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
 
 ## Source code
 
diff --git a/examples/inference/tgi/README.md b/examples/inference/tgi/README.md
@@ -82,13 +82,12 @@ Provisioning...
 ```
 </div>
 
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint
-at `<dstack server URL>/proxy/models/<project name>/`.
+If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
 
 <div class="termy">
 
 ```shell
-$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
+$ curl http://127.0.0.1:3000/proxy/services/main/llama4-scout/v1/chat/completions \
     -X POST \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -H 'Content-Type: application/json' \
@@ -110,8 +109,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
 
 </div>
 
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
-is available at `https://gateway.<gateway domain>/`.
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://llama4-scout.<gateway domain>/`.
 
 ## Source code
 
diff --git a/examples/inference/trtllm/README.md b/examples/inference/trtllm/README.md
@@ -330,13 +330,12 @@ Provisioning...
 
 ## Access the endpoint
 
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint
-at `<dstack server URL>/proxy/models/<project name>/`.
+If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
 
 <div class="termy">
 
 ```shell
-$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
+$ curl http://127.0.0.1:3000/proxy/services/main/serve-distill/v1/chat/completions \
     -X POST \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -H 'Content-Type: application/json' \
@@ -359,8 +358,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
 
 </div>
 
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
-is available at `https://gateway.<gateway domain>/`.
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://serve-distill.<gateway domain>/`.
 
 ## Source code
 
diff --git a/examples/inference/vllm/README.md b/examples/inference/vllm/README.md
@@ -78,13 +78,12 @@ Provisioning...
 ```
 </div>
 
-If no gateway is created, the model will be available via the OpenAI-compatible endpoint
-at `<dstack server URL>/proxy/models/<project name>/`.
+If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
 
 <div class="termy">
 
 ```shell
-$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
+$ curl http://127.0.0.1:3000/proxy/services/main/llama31/v1/chat/completions \
     -X POST \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -H 'Content-Type: application/json' \
@@ -106,8 +105,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
 
 </div>
 
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
-is available at `https://gateway.<gateway domain>/`.
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://llama31.<gateway domain>/`.
 
 ## Source code
 
diff --git a/examples/llms/deepseek/README.md b/examples/llms/deepseek/README.md
@@ -179,7 +179,7 @@ Both SGLang and vLLM also support `Deepseek-V2-Lite`.
 
     ```yaml
     type: service
-    name: deepseek-r1-nvidia
+    name: deepseek-r1
 
     image: lmsysorg/sglang:latest
     env:
@@ -203,7 +203,7 @@ Both SGLang and vLLM also support `Deepseek-V2-Lite`.
 
     ```yaml
     type: service
-    name: deepseek-r1-nvidia
+    name: deepseek-r1
 
     image: vllm/vllm-openai:latest
     env:
@@ -255,20 +255,19 @@ $ dstack apply -f examples/llms/deepseek/sglang/amd/.dstack.yml
  #  BACKEND  REGION     RESOURCES                         SPOT  PRICE
  1  runpod   EU-RO-1   24xCPU, 283GB, 1xMI300X (192GB)    no    $2.49
 
-Submit the run deepseek-r1-amd? [y/n]: y
+Submit the run deepseek-r1? [y/n]: y
 
 Provisioning...
 ---> 100%
 ```
 </div>
 
-Once the service is up, the model will be available via the OpenAI-compatible endpoint
-at `<dstack server URL>/proxy/models/<project name>/`.
+If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
 
 <div class="termy">
 
 ```shell
-curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
+curl http://127.0.0.1:3000/proxy/services/main/deepseek-r1/v1/chat/completions \
     -X POST \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -H 'Content-Type: application/json' \
@@ -290,8 +289,7 @@ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
 ```
 </div>
 
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
-is available at `https://gateway.<gateway domain>/`.
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://deepseek-r1.<gateway domain>/`.
 
 ## Fine-tuning
 
diff --git a/examples/llms/llama31/README.md b/examples/llms/llama31/README.md
@@ -179,13 +179,12 @@ Provisioning...
 
 </div>
 
-Once the service is up, the model will be available via the OpenAI-compatible endpoint
-at `<dstack server URL>/proxy/models/<project name>/`.
+If no gateway is created, the service endpoint will be available at `<dstack server URL>/proxy/services/<project name>/<run name>/`.
 
 <div class="termy">
 
 ```shell
-$ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
+$ curl http://127.0.0.1:3000/proxy/services/main/llama31/v1/chat/completions \
     -X POST \
     -H 'Authorization: Bearer &lt;dstack token&gt;' \
     -H 'Content-Type: application/json' \
@@ -207,8 +206,7 @@ $ curl http://127.0.0.1:3000/proxy/models/main/chat/completions \
 
 </div>
 
-When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the OpenAI-compatible endpoint
-is available at `https://gateway.<gateway domain>/`.
+When a [gateway](https://dstack.ai/docs/concepts/gateways/) is configured, the service endpoint will be available at `https://llama31.<gateway domain>/`.
 
 [//]: # (TODO: How to prompting and tool calling)