What problem does this solve?
There is no built-in way to measure how a deployed model performs under load. Users have to write their own scripts with curl or httpx, manually compute percentiles, and re-specify the sample input they already provided during deploy.
Proposed solution
New command:
inference-engine benchmark sentiment:v1 \
--host http://localhost:8000 \
--n 200 \
--concurrency 10 \
--sample-input "great movie"
--sample-input is required. If absent and deployment.json exists, reads sample_input from it. If neither is available, exits with a clear message. --host defaults to http://localhost:8000. --n defaults to 100. --concurrency defaults to 1.
Uses httpx.AsyncClient with a semaphore for concurrency control. Computes statistics with the statistics stdlib. No new dependencies.
Output:
Model: sentiment:v1
Host: http://localhost:8000
Requests: 200
Concurrency: 10
──────────────────────────────────
p50: 8ms
p95: 23ms
p99: 41ms
Throughput: 87 req/s
Errors: 0 (0.0%)
Cold start: 142ms (first request excluded from percentiles)
Errors are counted but do not abort the run. If error rate > 10%, a warning is printed.
Alternatives considered
External tools like wrk or locust. These require separate installation and don't know the model's request shape or sample_input.
Area
CLI (deploy / fix)
What problem does this solve?
There is no built-in way to measure how a deployed model performs under load. Users have to write their own scripts with
curlorhttpx, manually compute percentiles, and re-specify the sample input they already provided during deploy.Proposed solution
New command:
inference-engine benchmark sentiment:v1 \ --host http://localhost:8000 \ --n 200 \ --concurrency 10 \ --sample-input "great movie"--sample-inputis required. If absent anddeployment.jsonexists, readssample_inputfrom it. If neither is available, exits with a clear message.--hostdefaults tohttp://localhost:8000.--ndefaults to 100.--concurrencydefaults to 1.Uses
httpx.AsyncClientwith a semaphore for concurrency control. Computes statistics with thestatisticsstdlib. No new dependencies.Output:
Errors are counted but do not abort the run. If error rate > 10%, a warning is printed.
Alternatives considered
External tools like
wrkorlocust. These require separate installation and don't know the model's request shape orsample_input.Area
CLI (deploy / fix)