From 46557affb39e341d1e29ebf796317305dcbb2566 Mon Sep 17 00:00:00 2001
From: Liuhaai <haixiang@iotex.io>
Date: Tue, 5 May 2026 22:16:24 -0700
Subject: [PATCH] =?UTF-8?q?config:=20raise=20vlm=5Fapi=5Fconcurrency=20def?=
 =?UTF-8?q?ault=201=20=E2=86=92=2016?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The API-layer semaphore in routers/inference.py:_get_vlm_semaphore was
introduced to protect local GPU backends from concurrent generation. It
defaulted to 1, which made sense before backend-owned locking landed in
PR #62.

Now that each backend owns its own _lock (BaseBackend._lock = Lock() for
local, RemoteHTTPBackend._lock = nullcontext() for remote), the API
semaphore is no longer the serialization point:

  - Local backends still serialize on their per-backend lock — a higher
    semaphore value just lets requests wait at the lock instead of at the
    HTTP handler. Observable behavior is identical.
  - Remote backends use nullcontext, so the semaphore value directly
    controls how many HTTPS requests run in parallel against the
    upstream provider (e.g. DashScope).

In prod (multi-camera cortex deployment with TRIO_REMOTE_VLM_URL set),
default=1 caused VLM avg latency of ~12.7s because cortex sent up to 10
concurrent describe calls but trio-core gated them back to 1. Operators
had to set TRIO_VLM_API_CONCURRENCY=16 explicitly to unblock parallelism.

Raising the default to 16 makes the common remote-backend case work out
of the box. Operators can still lower it via TRIO_VLM_API_CONCURRENCY if
a provider rate-limits aggressively.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 src/trio_core/config.py | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/trio_core/config.py b/src/trio_core/config.py
index 687929e..73e2cf8 100644
--- a/src/trio_core/config.py
+++ b/src/trio_core/config.py
@@ -144,12 +144,14 @@ def from_env_file(
 
     # API-layer concurrency
     vlm_api_concurrency: int = Field(
-        default=1,
+        default=16,
         ge=1,
         description="Max concurrent VLM requests at the FastAPI handler. "
-        "Default 1 protects local GPU backends from contention. "
-        "Raise to 8-16 when remote_vlm_url is set, since the remote service "
-        "handles its own concurrency and the local lock is bypassed.",
+        "Local backends still serialize generation via their own "
+        "BaseBackend._lock, so a higher value here is safe — extra requests "
+        "just wait at the lock. Remote backends use nullcontext(), so this "
+        "value caps the actual number of parallel HTTPS calls. Lower it "
+        "if a remote provider rate-limits aggressively.",
     )
 
     # Cache (Phase 2)