Skip to content

Serving#2

Open
r-chong wants to merge 3 commits into
mni-ml:mainfrom
r-chong:inference-v1
Open

Serving#2
r-chong wants to merge 3 commits into
mni-ml:mainfrom
r-chong:inference-v1

Conversation

@r-chong
Copy link
Copy Markdown

@r-chong r-chong commented Apr 25, 2026

Key changes:

  • HTTP serving
  • request queue + bounded concurrency

Metrics:

latency p50:   4909.05 ms
latency p95:   17222.44 ms
latency avg:   5763.12 ms
TTFT p50:      46.38 ms
TTFT p95:      87.92 ms
decode tok/s:  82.03
throughput:    1.17 req/s
output tok/s:  74.84
wall time:     31640.40 ms

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an HTTP inference server and KV-cache generation path to support serving and benchmarking MiniGPT inference.

Changes:

  • Introduces a minimal HTTP server (/health, /stats, /generate) with a request queue and bounded concurrency.
  • Adds KV-cache generation implementation + CLI tooling (generate:kv, demo:kv) and related docs.
  • Switches the framework dependency to a local file:./framework setup and updates install/build instructions.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/server.js New HTTP inference server with queued/bounded-concurrency request handling
src/generation_kv.js New KV-cache-capable model/generation implementation + benchmark formatting
src/generate_kv.js New CLI for KV-cache generation benchmarking
src/demo_kv.js New CLI demo comparing baseline vs KV-cache modes
src/generate.js Improves tokenizer path validation (exists/dir checks)
src/generate_gpu.js Improves tokenizer path validation (exists/dir checks)
package.json Adds serve/kv scripts; switches @mni-ml/framework to local file dependency
package-lock.json Updates lockfile for local framework dependency
README.md Documents KV-cache + HTTP server usage and setup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/server.js
Comment on lines +31 to +34
const { model, tokenizer, config, quantization } = loadModel(modelPath, tokenizerPath, { weightMode });
console.log(
`Model ready: ${config.nLayer} layers, ${config.nHead} heads, ${config.nEmbd}-dim, ` +
`quantized_params=${quantization.quantizedParamCount}`
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadModel() from generation_kv.js currently returns { model, tokenizer, config } and ignores extra arguments, so quantization will be undefined here and the subsequent log/access (quantization.quantizedParamCount, etc.) will throw on startup. Either extend loadModel to support the weightMode option and return quantization, or update the server to match the existing return shape and remove the quantization usage.

Suggested change
const { model, tokenizer, config, quantization } = loadModel(modelPath, tokenizerPath, { weightMode });
console.log(
`Model ready: ${config.nLayer} layers, ${config.nHead} heads, ${config.nEmbd}-dim, ` +
`quantized_params=${quantization.quantizedParamCount}`
const { model, tokenizer, config } = loadModel(modelPath, tokenizerPath, { weightMode });
console.log(
`Model ready: ${config.nLayer} layers, ${config.nHead} heads, ${config.nEmbd}-dim`

Copilot uses AI. Check for mistakes.
Comment thread src/server.js
Comment on lines +158 to +159
id: totalRequests + queue.length + active + 1,
body,
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request IDs can repeat because totalRequests only increments after inference completes; after the queue drains, the next request can reuse an earlier ID (e.g. totalRequests=1, active=0, queue=0 => id=2). Use a dedicated monotonic counter (e.g. nextRequestId++) so requestId is unique and stable regardless of queue timing.

Copilot uses AI. Check for mistakes.
Comment thread src/server.js
Comment on lines +19 to +20
const maxConcurrent = Number(process.env.MAX_CONCURRENT || 1);

Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_CONCURRENT can parse to 0, NaN, or a negative number (e.g. env var set but not numeric), which will cause /generate requests to queue forever because active < maxConcurrent never becomes true. Validate this at startup (must be an integer >= 1) and fail fast with a clear error.

Suggested change
const maxConcurrent = Number(process.env.MAX_CONCURRENT || 1);
const rawMaxConcurrent = process.env.MAX_CONCURRENT;
const maxConcurrent =
rawMaxConcurrent === undefined || rawMaxConcurrent === ''
? 1
: Number(rawMaxConcurrent);
if (!Number.isInteger(maxConcurrent) || maxConcurrent < 1) {
console.error(
`Invalid MAX_CONCURRENT=${JSON.stringify(rawMaxConcurrent)}. ` +
'MAX_CONCURRENT must be an integer >= 1.'
);
process.exit(1);
}

Copilot uses AI. Check for mistakes.
Comment thread src/server.js
Comment on lines +37 to +38
const queue = [];
let active = 0;
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request queue is unbounded and uses Array.shift(), which becomes O(n) per dequeue and can grow without limit under load (risking high memory usage / OOM). Consider adding a max queue depth with backpressure (e.g. reject with 429/503) and/or switching to a ring-buffer/linked-list style queue to avoid shift() copying costs.

Copilot uses AI. Check for mistakes.
Comment thread README.md
| `src/generate_kv.js` | **KV-cache** sampling with the local `framework` branch |
| `src/demo_kv.js` | baseline vs `kv-fp32` vs `kv-int8` benchmark/demo |
| `src/server.js` | Minimal HTTP inference server (`POST /generate`) |
| `src/benchmark_http.js` | HTTP load generator (p50/p95 latency, TTFT, throughput) |
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The layout table references src/benchmark_http.js, but that file is not present in the repository. Update the docs to match what’s actually shipped, or add the missing file so readers can follow the instructions.

Suggested change
| `src/benchmark_http.js` | HTTP load generator (p50/p95 latency, TTFT, throughput) |

Copilot uses AI. Check for mistakes.
Comment thread src/server.js
Comment on lines +50 to +67
req.on('data', (chunk) => {
body += chunk;
if (body.length > 2 * 1024 * 1024) {
reject(new Error('request body too large'));
}
});
req.on('end', () => {
if (!body) {
resolve({});
return;
}
try {
resolve(JSON.parse(body));
} catch {
reject(new Error('invalid json body'));
}
});
req.on('error', reject);
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the body exceeds the size limit, the promise is rejected but the request stream keeps flowing and continues appending to body, which can still consume memory and may trigger multiple rejections. After hitting the limit, stop reading (e.g. req.destroy() / remove listeners / pause) and ensure the rejection happens only once.

Suggested change
req.on('data', (chunk) => {
body += chunk;
if (body.length > 2 * 1024 * 1024) {
reject(new Error('request body too large'));
}
});
req.on('end', () => {
if (!body) {
resolve({});
return;
}
try {
resolve(JSON.parse(body));
} catch {
reject(new Error('invalid json body'));
}
});
req.on('error', reject);
let settled = false;
const cleanup = () => {
req.off('data', onData);
req.off('end', onEnd);
req.off('error', onError);
};
const resolveOnce = (value) => {
if (settled) return;
settled = true;
cleanup();
resolve(value);
};
const rejectOnce = (err) => {
if (settled) return;
settled = true;
cleanup();
reject(err);
};
const onData = (chunk) => {
if (settled) return;
body += chunk;
if (body.length > 2 * 1024 * 1024) {
req.pause();
rejectOnce(new Error('request body too large'));
req.destroy();
}
};
const onEnd = () => {
if (settled) return;
if (!body) {
resolveOnce({});
return;
}
try {
resolveOnce(JSON.parse(body));
} catch {
rejectOnce(new Error('invalid json body'));
}
};
const onError = (err) => {
rejectOnce(err);
};
req.on('data', onData);
req.on('end', onEnd);
req.on('error', onError);

Copilot uses AI. Check for mistakes.
Comment thread package.json
"generate:kv": "node src/generate_kv.js",
"demo:kv": "node src/demo_kv.js",
"serve": "node src/server.js",
"bench:http": "node src/benchmark_http.js",
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bench:http points to src/benchmark_http.js, but that file is not present in the repository (so npm run bench:http will fail). Either add src/benchmark_http.js in this PR or remove/adjust the script entry.

Suggested change
"bench:http": "node src/benchmark_http.js",

Copilot uses AI. Check for mistakes.
Comment thread package.json
},
"dependencies": {
"@mni-ml/framework": "^0.3.4"
"@mni-ml/framework": "file:./framework"
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency is switched to "@mni-ml/framework": "file:./framework", but there is no framework/ directory in the repository tree. As-is, npm install will fail for fresh clones. Either include/commit the framework/ checkout (or add it as a submodule with setup docs) or keep using the published package and make the local checkout optional (e.g. via npm link / overrides).

Suggested change
"@mni-ml/framework": "file:./framework"
"@mni-ml/framework": "latest"

Copilot uses AI. Check for mistakes.
Comment thread README.md
Comment on lines +35 to +42
Build the local `framework` checkout first if you want the KV-cache demo path:

```bash
git -C framework checkout kv-cache-quantization
npm --prefix framework install
npm --prefix framework run build:native
npm --prefix framework run build
```
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These instructions assume a framework/ directory exists (git -C framework ..., and later file:./framework), but the repo currently has no framework/ checkout. Either add the framework/ source (or submodule) as part of this PR, or adjust the dependency/docs so a separate clone/link step is explicit and installs work from a clean checkout.

Copilot uses AI. Check for mistakes.
Comment thread src/generation_kv.js
Comment on lines +62 to +79
const scaled = new Array(vocabSize);
for (let i = 0; i < vocabSize; i++) {
scaled[i] = data[offset + i] / temperature;
}
const maxLogit = Math.max(...scaled);
const exps = scaled.map((value) => Math.exp(value - maxLogit));
const sumExps = exps.reduce((sum, value) => sum + value, 0);
const probs = exps.map((value) => value / sumExps);

const r = Math.random();
let cumulative = 0;
for (let i = 0; i < probs.length; i++) {
cumulative += probs[i];
if (r < cumulative) {
return i;
}
}
return probs.length - 1;
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Math.max(...scaled) will spread vocabSize arguments onto the stack and allocates multiple intermediate arrays (scaled, exps, probs). With typical BPE vocab sizes (tens of thousands), this is a major hot-path cost and can hit argument-count limits. Compute maxLogit and the softmax normalization in simple loops (and avoid extra arrays) to reduce allocations and improve throughput.

Suggested change
const scaled = new Array(vocabSize);
for (let i = 0; i < vocabSize; i++) {
scaled[i] = data[offset + i] / temperature;
}
const maxLogit = Math.max(...scaled);
const exps = scaled.map((value) => Math.exp(value - maxLogit));
const sumExps = exps.reduce((sum, value) => sum + value, 0);
const probs = exps.map((value) => value / sumExps);
const r = Math.random();
let cumulative = 0;
for (let i = 0; i < probs.length; i++) {
cumulative += probs[i];
if (r < cumulative) {
return i;
}
}
return probs.length - 1;
let maxLogit = -Infinity;
for (let i = 0; i < vocabSize; i++) {
const scaledValue = data[offset + i] / temperature;
if (scaledValue > maxLogit) {
maxLogit = scaledValue;
}
}
let sumExps = 0;
for (let i = 0; i < vocabSize; i++) {
const scaledValue = data[offset + i] / temperature;
sumExps += Math.exp(scaledValue - maxLogit);
}
const r = Math.random() * sumExps;
let cumulative = 0;
for (let i = 0; i < vocabSize; i++) {
const scaledValue = data[offset + i] / temperature;
cumulative += Math.exp(scaledValue - maxLogit);
if (r < cumulative) {
return i;
}
}
return vocabSize - 1;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants