From 46557affb39e341d1e29ebf796317305dcbb2566 Mon Sep 17 00:00:00 2001 From: Liuhaai Date: Tue, 5 May 2026 22:16:24 -0700 Subject: [PATCH] =?UTF-8?q?config:=20raise=20vlm=5Fapi=5Fconcurrency=20def?= =?UTF-8?q?ault=201=20=E2=86=92=2016?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The API-layer semaphore in routers/inference.py:_get_vlm_semaphore was introduced to protect local GPU backends from concurrent generation. It defaulted to 1, which made sense before backend-owned locking landed in PR #62. Now that each backend owns its own _lock (BaseBackend._lock = Lock() for local, RemoteHTTPBackend._lock = nullcontext() for remote), the API semaphore is no longer the serialization point: - Local backends still serialize on their per-backend lock — a higher semaphore value just lets requests wait at the lock instead of at the HTTP handler. Observable behavior is identical. - Remote backends use nullcontext, so the semaphore value directly controls how many HTTPS requests run in parallel against the upstream provider (e.g. DashScope). In prod (multi-camera cortex deployment with TRIO_REMOTE_VLM_URL set), default=1 caused VLM avg latency of ~12.7s because cortex sent up to 10 concurrent describe calls but trio-core gated them back to 1. Operators had to set TRIO_VLM_API_CONCURRENCY=16 explicitly to unblock parallelism. Raising the default to 16 makes the common remote-backend case work out of the box. Operators can still lower it via TRIO_VLM_API_CONCURRENCY if a provider rate-limits aggressively. Co-Authored-By: Claude Opus 4.7 --- src/trio_core/config.py | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/src/trio_core/config.py b/src/trio_core/config.py index 687929e..73e2cf8 100644 --- a/src/trio_core/config.py +++ b/src/trio_core/config.py @@ -144,12 +144,14 @@ def from_env_file( # API-layer concurrency vlm_api_concurrency: int = Field( - default=1, + default=16, ge=1, description="Max concurrent VLM requests at the FastAPI handler. " - "Default 1 protects local GPU backends from contention. " - "Raise to 8-16 when remote_vlm_url is set, since the remote service " - "handles its own concurrency and the local lock is bypassed.", + "Local backends still serialize generation via their own " + "BaseBackend._lock, so a higher value here is safe — extra requests " + "just wait at the lock. Remote backends use nullcontext(), so this " + "value caps the actual number of parallel HTTPS calls. Lower it " + "if a remote provider rate-limits aggressively.", ) # Cache (Phase 2)