Skip to content

Add inference-engine benchmark command for latency and throughput measurement #33

@AK11105

Description

@AK11105

What problem does this solve?

There is no built-in way to measure how a deployed model performs under load. Users have to write their own scripts with curl or httpx, manually compute percentiles, and re-specify the sample input they already provided during deploy.

Proposed solution

New command:

inference-engine benchmark sentiment:v1 \
  --host http://localhost:8000 \
  --n 200 \
  --concurrency 10 \
  --sample-input "great movie"

--sample-input is required. If absent and deployment.json exists, reads sample_input from it. If neither is available, exits with a clear message. --host defaults to http://localhost:8000. --n defaults to 100. --concurrency defaults to 1.

Uses httpx.AsyncClient with a semaphore for concurrency control. Computes statistics with the statistics stdlib. No new dependencies.

Output:

Model:        sentiment:v1
Host:         http://localhost:8000
Requests:     200
Concurrency:  10
──────────────────────────────────
p50:          8ms
p95:          23ms
p99:          41ms
Throughput:   87 req/s
Errors:       0 (0.0%)
Cold start:   142ms (first request excluded from percentiles)

Errors are counted but do not abort the run. If error rate > 10%, a warning is printed.

Alternatives considered

External tools like wrk or locust. These require separate installation and don't know the model's request shape or sample_input.

Area

CLI (deploy / fix)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions