Serving#2
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an HTTP inference server and KV-cache generation path to support serving and benchmarking MiniGPT inference.
Changes:
- Introduces a minimal HTTP server (
/health,/stats,/generate) with a request queue and bounded concurrency. - Adds KV-cache generation implementation + CLI tooling (
generate:kv,demo:kv) and related docs. - Switches the framework dependency to a local
file:./frameworksetup and updates install/build instructions.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| src/server.js | New HTTP inference server with queued/bounded-concurrency request handling |
| src/generation_kv.js | New KV-cache-capable model/generation implementation + benchmark formatting |
| src/generate_kv.js | New CLI for KV-cache generation benchmarking |
| src/demo_kv.js | New CLI demo comparing baseline vs KV-cache modes |
| src/generate.js | Improves tokenizer path validation (exists/dir checks) |
| src/generate_gpu.js | Improves tokenizer path validation (exists/dir checks) |
| package.json | Adds serve/kv scripts; switches @mni-ml/framework to local file dependency |
| package-lock.json | Updates lockfile for local framework dependency |
| README.md | Documents KV-cache + HTTP server usage and setup |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const { model, tokenizer, config, quantization } = loadModel(modelPath, tokenizerPath, { weightMode }); | ||
| console.log( | ||
| `Model ready: ${config.nLayer} layers, ${config.nHead} heads, ${config.nEmbd}-dim, ` + | ||
| `quantized_params=${quantization.quantizedParamCount}` |
There was a problem hiding this comment.
loadModel() from generation_kv.js currently returns { model, tokenizer, config } and ignores extra arguments, so quantization will be undefined here and the subsequent log/access (quantization.quantizedParamCount, etc.) will throw on startup. Either extend loadModel to support the weightMode option and return quantization, or update the server to match the existing return shape and remove the quantization usage.
| const { model, tokenizer, config, quantization } = loadModel(modelPath, tokenizerPath, { weightMode }); | |
| console.log( | |
| `Model ready: ${config.nLayer} layers, ${config.nHead} heads, ${config.nEmbd}-dim, ` + | |
| `quantized_params=${quantization.quantizedParamCount}` | |
| const { model, tokenizer, config } = loadModel(modelPath, tokenizerPath, { weightMode }); | |
| console.log( | |
| `Model ready: ${config.nLayer} layers, ${config.nHead} heads, ${config.nEmbd}-dim` |
| id: totalRequests + queue.length + active + 1, | ||
| body, |
There was a problem hiding this comment.
Request IDs can repeat because totalRequests only increments after inference completes; after the queue drains, the next request can reuse an earlier ID (e.g. totalRequests=1, active=0, queue=0 => id=2). Use a dedicated monotonic counter (e.g. nextRequestId++) so requestId is unique and stable regardless of queue timing.
| const maxConcurrent = Number(process.env.MAX_CONCURRENT || 1); | ||
|
|
There was a problem hiding this comment.
MAX_CONCURRENT can parse to 0, NaN, or a negative number (e.g. env var set but not numeric), which will cause /generate requests to queue forever because active < maxConcurrent never becomes true. Validate this at startup (must be an integer >= 1) and fail fast with a clear error.
| const maxConcurrent = Number(process.env.MAX_CONCURRENT || 1); | |
| const rawMaxConcurrent = process.env.MAX_CONCURRENT; | |
| const maxConcurrent = | |
| rawMaxConcurrent === undefined || rawMaxConcurrent === '' | |
| ? 1 | |
| : Number(rawMaxConcurrent); | |
| if (!Number.isInteger(maxConcurrent) || maxConcurrent < 1) { | |
| console.error( | |
| `Invalid MAX_CONCURRENT=${JSON.stringify(rawMaxConcurrent)}. ` + | |
| 'MAX_CONCURRENT must be an integer >= 1.' | |
| ); | |
| process.exit(1); | |
| } |
| const queue = []; | ||
| let active = 0; |
There was a problem hiding this comment.
The request queue is unbounded and uses Array.shift(), which becomes O(n) per dequeue and can grow without limit under load (risking high memory usage / OOM). Consider adding a max queue depth with backpressure (e.g. reject with 429/503) and/or switching to a ring-buffer/linked-list style queue to avoid shift() copying costs.
| | `src/generate_kv.js` | **KV-cache** sampling with the local `framework` branch | | ||
| | `src/demo_kv.js` | baseline vs `kv-fp32` vs `kv-int8` benchmark/demo | | ||
| | `src/server.js` | Minimal HTTP inference server (`POST /generate`) | | ||
| | `src/benchmark_http.js` | HTTP load generator (p50/p95 latency, TTFT, throughput) | |
There was a problem hiding this comment.
The layout table references src/benchmark_http.js, but that file is not present in the repository. Update the docs to match what’s actually shipped, or add the missing file so readers can follow the instructions.
| | `src/benchmark_http.js` | HTTP load generator (p50/p95 latency, TTFT, throughput) | |
| req.on('data', (chunk) => { | ||
| body += chunk; | ||
| if (body.length > 2 * 1024 * 1024) { | ||
| reject(new Error('request body too large')); | ||
| } | ||
| }); | ||
| req.on('end', () => { | ||
| if (!body) { | ||
| resolve({}); | ||
| return; | ||
| } | ||
| try { | ||
| resolve(JSON.parse(body)); | ||
| } catch { | ||
| reject(new Error('invalid json body')); | ||
| } | ||
| }); | ||
| req.on('error', reject); |
There was a problem hiding this comment.
When the body exceeds the size limit, the promise is rejected but the request stream keeps flowing and continues appending to body, which can still consume memory and may trigger multiple rejections. After hitting the limit, stop reading (e.g. req.destroy() / remove listeners / pause) and ensure the rejection happens only once.
| req.on('data', (chunk) => { | |
| body += chunk; | |
| if (body.length > 2 * 1024 * 1024) { | |
| reject(new Error('request body too large')); | |
| } | |
| }); | |
| req.on('end', () => { | |
| if (!body) { | |
| resolve({}); | |
| return; | |
| } | |
| try { | |
| resolve(JSON.parse(body)); | |
| } catch { | |
| reject(new Error('invalid json body')); | |
| } | |
| }); | |
| req.on('error', reject); | |
| let settled = false; | |
| const cleanup = () => { | |
| req.off('data', onData); | |
| req.off('end', onEnd); | |
| req.off('error', onError); | |
| }; | |
| const resolveOnce = (value) => { | |
| if (settled) return; | |
| settled = true; | |
| cleanup(); | |
| resolve(value); | |
| }; | |
| const rejectOnce = (err) => { | |
| if (settled) return; | |
| settled = true; | |
| cleanup(); | |
| reject(err); | |
| }; | |
| const onData = (chunk) => { | |
| if (settled) return; | |
| body += chunk; | |
| if (body.length > 2 * 1024 * 1024) { | |
| req.pause(); | |
| rejectOnce(new Error('request body too large')); | |
| req.destroy(); | |
| } | |
| }; | |
| const onEnd = () => { | |
| if (settled) return; | |
| if (!body) { | |
| resolveOnce({}); | |
| return; | |
| } | |
| try { | |
| resolveOnce(JSON.parse(body)); | |
| } catch { | |
| rejectOnce(new Error('invalid json body')); | |
| } | |
| }; | |
| const onError = (err) => { | |
| rejectOnce(err); | |
| }; | |
| req.on('data', onData); | |
| req.on('end', onEnd); | |
| req.on('error', onError); |
| "generate:kv": "node src/generate_kv.js", | ||
| "demo:kv": "node src/demo_kv.js", | ||
| "serve": "node src/server.js", | ||
| "bench:http": "node src/benchmark_http.js", |
There was a problem hiding this comment.
bench:http points to src/benchmark_http.js, but that file is not present in the repository (so npm run bench:http will fail). Either add src/benchmark_http.js in this PR or remove/adjust the script entry.
| "bench:http": "node src/benchmark_http.js", |
| }, | ||
| "dependencies": { | ||
| "@mni-ml/framework": "^0.3.4" | ||
| "@mni-ml/framework": "file:./framework" |
There was a problem hiding this comment.
The dependency is switched to "@mni-ml/framework": "file:./framework", but there is no framework/ directory in the repository tree. As-is, npm install will fail for fresh clones. Either include/commit the framework/ checkout (or add it as a submodule with setup docs) or keep using the published package and make the local checkout optional (e.g. via npm link / overrides).
| "@mni-ml/framework": "file:./framework" | |
| "@mni-ml/framework": "latest" |
| Build the local `framework` checkout first if you want the KV-cache demo path: | ||
|
|
||
| ```bash | ||
| git -C framework checkout kv-cache-quantization | ||
| npm --prefix framework install | ||
| npm --prefix framework run build:native | ||
| npm --prefix framework run build | ||
| ``` |
There was a problem hiding this comment.
These instructions assume a framework/ directory exists (git -C framework ..., and later file:./framework), but the repo currently has no framework/ checkout. Either add the framework/ source (or submodule) as part of this PR, or adjust the dependency/docs so a separate clone/link step is explicit and installs work from a clean checkout.
| const scaled = new Array(vocabSize); | ||
| for (let i = 0; i < vocabSize; i++) { | ||
| scaled[i] = data[offset + i] / temperature; | ||
| } | ||
| const maxLogit = Math.max(...scaled); | ||
| const exps = scaled.map((value) => Math.exp(value - maxLogit)); | ||
| const sumExps = exps.reduce((sum, value) => sum + value, 0); | ||
| const probs = exps.map((value) => value / sumExps); | ||
|
|
||
| const r = Math.random(); | ||
| let cumulative = 0; | ||
| for (let i = 0; i < probs.length; i++) { | ||
| cumulative += probs[i]; | ||
| if (r < cumulative) { | ||
| return i; | ||
| } | ||
| } | ||
| return probs.length - 1; |
There was a problem hiding this comment.
Math.max(...scaled) will spread vocabSize arguments onto the stack and allocates multiple intermediate arrays (scaled, exps, probs). With typical BPE vocab sizes (tens of thousands), this is a major hot-path cost and can hit argument-count limits. Compute maxLogit and the softmax normalization in simple loops (and avoid extra arrays) to reduce allocations and improve throughput.
| const scaled = new Array(vocabSize); | |
| for (let i = 0; i < vocabSize; i++) { | |
| scaled[i] = data[offset + i] / temperature; | |
| } | |
| const maxLogit = Math.max(...scaled); | |
| const exps = scaled.map((value) => Math.exp(value - maxLogit)); | |
| const sumExps = exps.reduce((sum, value) => sum + value, 0); | |
| const probs = exps.map((value) => value / sumExps); | |
| const r = Math.random(); | |
| let cumulative = 0; | |
| for (let i = 0; i < probs.length; i++) { | |
| cumulative += probs[i]; | |
| if (r < cumulative) { | |
| return i; | |
| } | |
| } | |
| return probs.length - 1; | |
| let maxLogit = -Infinity; | |
| for (let i = 0; i < vocabSize; i++) { | |
| const scaledValue = data[offset + i] / temperature; | |
| if (scaledValue > maxLogit) { | |
| maxLogit = scaledValue; | |
| } | |
| } | |
| let sumExps = 0; | |
| for (let i = 0; i < vocabSize; i++) { | |
| const scaledValue = data[offset + i] / temperature; | |
| sumExps += Math.exp(scaledValue - maxLogit); | |
| } | |
| const r = Math.random() * sumExps; | |
| let cumulative = 0; | |
| for (let i = 0; i < vocabSize; i++) { | |
| const scaledValue = data[offset + i] / temperature; | |
| cumulative += Math.exp(scaledValue - maxLogit); | |
| if (r < cumulative) { | |
| return i; | |
| } | |
| } | |
| return vocabSize - 1; |
Key changes:
Metrics: