Summary
A modest keep-warm loop (one /v1/chat/completions request every 15s, max_tokens=1) against a local slot wedged hal0-api: subsequent /api/slots, /api/slots/{name}, and /v1/* requests timed out indefinitely until I systemctl restart hal0-api.
Reproduction
In one shell on the LXC:
while true; do
curl -sS -X POST http://127.0.0.1:8001/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"qwen3-coder-reap-25b-a3b-q5km","messages":[{"role":"user","content":"hi"}],"max_tokens":1}' \
-o /dev/null --max-time 30
sleep 15
done
In another shell, after ~5 minutes:
curl http://127.0.0.1:8080/api/slots --max-time 20
# → curl: (28) Operation timed out
After systemctl restart hal0-api, normal service resumes.
Expected
A keep-warm loop at single-digit-per-minute QPS should never wedge the API. Either the per-upstream client pool should have bounded queueing with timeouts and a circuit breaker, or hal0-api should expose pool saturation in /api/health so operators can detect and react.
Hypothesis
The omnirouter's httpx client to lemond either:
- has an unbounded queue and never times out individual upstream calls
- shares a pool across
/v1 and /api routes so saturation on one starves the other
Workaround
Don't run high-frequency keep-warm loops. Use hal0 slot load via systemd timer at 4-minute cadence instead. Tradeoff: lemond's own eviction (see #B4) still kicks in between timer firings.
Suggested fix area
- Per-upstream httpx client with per-request timeout (e.g. 5s for
/v1/models probe, 60s for /v1/chat/completions).
- Bounded pool with explicit overflow handling instead of silent queue growth.
/api/health should report upstream pool state.
Environment
Summary
A modest keep-warm loop (one
/v1/chat/completionsrequest every 15s,max_tokens=1) against a local slot wedged hal0-api: subsequent/api/slots,/api/slots/{name}, and/v1/*requests timed out indefinitely until Isystemctl restart hal0-api.Reproduction
In one shell on the LXC:
In another shell, after ~5 minutes:
curl http://127.0.0.1:8080/api/slots --max-time 20 # → curl: (28) Operation timed outAfter
systemctl restart hal0-api, normal service resumes.Expected
A keep-warm loop at single-digit-per-minute QPS should never wedge the API. Either the per-upstream client pool should have bounded queueing with timeouts and a circuit breaker, or hal0-api should expose pool saturation in
/api/healthso operators can detect and react.Hypothesis
The omnirouter's httpx client to lemond either:
/v1and/apiroutes so saturation on one starves the otherWorkaround
Don't run high-frequency keep-warm loops. Use
hal0 slot loadvia systemd timer at 4-minute cadence instead. Tradeoff: lemond's own eviction (see #B4) still kicks in between timer firings.Suggested fix area
/v1/modelsprobe, 60s for/v1/chat/completions)./api/healthshould report upstream pool state.Environment