Serving by r-chong · Pull Request #2 · mni-ml/transformer

r-chong · 2026-04-25T00:39:43Z

Key changes:

HTTP serving
request queue + bounded concurrency

Metrics:

latency p50:   4909.05 ms
latency p95:   17222.44 ms
latency avg:   5763.12 ms
TTFT p50:      46.38 ms
TTFT p95:      87.92 ms
decode tok/s:  82.03
throughput:    1.17 req/s
output tok/s:  74.84
wall time:     31640.40 ms

Copilot

Pull request overview

Adds an HTTP inference server and KV-cache generation path to support serving and benchmarking MiniGPT inference.

Changes:

Introduces a minimal HTTP server (/health, /stats, /generate) with a request queue and bounded concurrency.
Adds KV-cache generation implementation + CLI tooling (generate:kv, demo:kv) and related docs.
Switches the framework dependency to a local file:./framework setup and updates install/build instructions.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
src/server.js	New HTTP inference server with queued/bounded-concurrency request handling
src/generation_kv.js	New KV-cache-capable model/generation implementation + benchmark formatting
src/generate_kv.js	New CLI for KV-cache generation benchmarking
src/demo_kv.js	New CLI demo comparing baseline vs KV-cache modes
src/generate.js	Improves tokenizer path validation (exists/dir checks)
src/generate_gpu.js	Improves tokenizer path validation (exists/dir checks)
package.json	Adds serve/kv scripts; switches `@mni-ml/framework` to local file dependency
package-lock.json	Updates lockfile for local `framework` dependency
README.md	Documents KV-cache + HTTP server usage and setup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-25T00:45:09Z

+const { model, tokenizer, config, quantization } = loadModel(modelPath, tokenizerPath, { weightMode });
+console.log(
+  `Model ready: ${config.nLayer} layers, ${config.nHead} heads, ${config.nEmbd}-dim, ` +
+    `quantized_params=${quantization.quantizedParamCount}`


loadModel() from generation_kv.js currently returns { model, tokenizer, config } and ignores extra arguments, so quantization will be undefined here and the subsequent log/access (quantization.quantizedParamCount, etc.) will throw on startup. Either extend loadModel to support the weightMode option and return quantization, or update the server to match the existing return shape and remove the quantization usage.

Suggested change

const { model, tokenizer, config, quantization } = loadModel(modelPath, tokenizerPath, { weightMode });

console.log(

`Model ready: ${config.nLayer} layers, ${config.nHead} heads, ${config.nEmbd}-dim, ` +

`quantized_params=${quantization.quantizedParamCount}`

const { model, tokenizer, config } = loadModel(modelPath, tokenizerPath, { weightMode });

console.log(

`Model ready: ${config.nLayer} layers, ${config.nHead} heads, ${config.nEmbd}-dim`

Copilot · 2026-04-25T00:45:09Z

+        id: totalRequests + queue.length + active + 1,
+        body,


Request IDs can repeat because totalRequests only increments after inference completes; after the queue drains, the next request can reuse an earlier ID (e.g. totalRequests=1, active=0, queue=0 => id=2). Use a dedicated monotonic counter (e.g. nextRequestId++) so requestId is unique and stable regardless of queue timing.

Copilot · 2026-04-25T00:45:10Z

+const maxConcurrent = Number(process.env.MAX_CONCURRENT || 1);
+


MAX_CONCURRENT can parse to 0, NaN, or a negative number (e.g. env var set but not numeric), which will cause /generate requests to queue forever because active < maxConcurrent never becomes true. Validate this at startup (must be an integer >= 1) and fail fast with a clear error.

Suggested change

const maxConcurrent = Number(process.env.MAX_CONCURRENT || 1);

const rawMaxConcurrent = process.env.MAX_CONCURRENT;

const maxConcurrent =

rawMaxConcurrent === undefined || rawMaxConcurrent === ''

? 1

: Number(rawMaxConcurrent);

if (!Number.isInteger(maxConcurrent) || maxConcurrent < 1) {

console.error(

`Invalid MAX_CONCURRENT=${JSON.stringify(rawMaxConcurrent)}. ` +

'MAX_CONCURRENT must be an integer >= 1.'

);

process.exit(1);

}

Copilot · 2026-04-25T00:45:10Z

+const queue = [];
+let active = 0;


The request queue is unbounded and uses Array.shift(), which becomes O(n) per dequeue and can grow without limit under load (risking high memory usage / OOM). Consider adding a max queue depth with backpressure (e.g. reject with 429/503) and/or switching to a ring-buffer/linked-list style queue to avoid shift() copying costs.

Copilot · 2026-04-25T00:45:10Z

+| `src/generate_kv.js` | **KV-cache** sampling with the local `framework` branch |
+| `src/demo_kv.js` | baseline vs `kv-fp32` vs `kv-int8` benchmark/demo |
+| `src/server.js` | Minimal HTTP inference server (`POST /generate`) |
+| `src/benchmark_http.js` | HTTP load generator (p50/p95 latency, TTFT, throughput) |


The layout table references src/benchmark_http.js, but that file is not present in the repository. Update the docs to match what’s actually shipped, or add the missing file so readers can follow the instructions.

Suggested change

| `src/benchmark_http.js` | HTTP load generator (p50/p95 latency, TTFT, throughput) |

Copilot · 2026-04-25T00:45:10Z

+    req.on('data', (chunk) => {
+      body += chunk;
+      if (body.length > 2 * 1024 * 1024) {
+        reject(new Error('request body too large'));
+      }
+    });
+    req.on('end', () => {
+      if (!body) {
+        resolve({});
+        return;
+      }
+      try {
+        resolve(JSON.parse(body));
+      } catch {
+        reject(new Error('invalid json body'));
+      }
+    });
+    req.on('error', reject);


When the body exceeds the size limit, the promise is rejected but the request stream keeps flowing and continues appending to body, which can still consume memory and may trigger multiple rejections. After hitting the limit, stop reading (e.g. req.destroy() / remove listeners / pause) and ensure the rejection happens only once.

Suggested change

req.on('data', (chunk) => {

body += chunk;

if (body.length > 2 * 1024 * 1024) {

reject(new Error('request body too large'));

}

});

req.on('end', () => {

if (!body) {

resolve({});

return;

}

try {

resolve(JSON.parse(body));

} catch {

reject(new Error('invalid json body'));

}

});

req.on('error', reject);

let settled = false;

const cleanup = () => {

req.off('data', onData);

req.off('end', onEnd);

req.off('error', onError);

};

const resolveOnce = (value) => {

if (settled) return;

settled = true;

cleanup();

resolve(value);

};

const rejectOnce = (err) => {

if (settled) return;

settled = true;

cleanup();

reject(err);

};

const onData = (chunk) => {

if (settled) return;

body += chunk;

if (body.length > 2 * 1024 * 1024) {

req.pause();

rejectOnce(new Error('request body too large'));

req.destroy();

}

};

const onEnd = () => {

if (settled) return;

if (!body) {

resolveOnce({});

return;

}

try {

resolveOnce(JSON.parse(body));

} catch {

rejectOnce(new Error('invalid json body'));

}

};

const onError = (err) => {

rejectOnce(err);

};

req.on('data', onData);

req.on('end', onEnd);

req.on('error', onError);

Copilot · 2026-04-25T00:45:11Z

+    "generate:kv": "node src/generate_kv.js",
+    "demo:kv": "node src/demo_kv.js",
+    "serve": "node src/server.js",
+    "bench:http": "node src/benchmark_http.js",


bench:http points to src/benchmark_http.js, but that file is not present in the repository (so npm run bench:http will fail). Either add src/benchmark_http.js in this PR or remove/adjust the script entry.

Suggested change

"bench:http": "node src/benchmark_http.js",

Copilot · 2026-04-25T00:45:11Z

  },
  "dependencies": {
-    "@mni-ml/framework": "^0.3.4"
+    "@mni-ml/framework": "file:./framework"


The dependency is switched to "@mni-ml/framework": "file:./framework", but there is no framework/ directory in the repository tree. As-is, npm install will fail for fresh clones. Either include/commit the framework/ checkout (or add it as a submodule with setup docs) or keep using the published package and make the local checkout optional (e.g. via npm link / overrides).

Suggested change

"@mni-ml/framework": "file:./framework"

"@mni-ml/framework": "latest"

Copilot · 2026-04-25T00:45:11Z

+Build the local `framework` checkout first if you want the KV-cache demo path:
+
+```bash
+git -C framework checkout kv-cache-quantization
+npm --prefix framework install
+npm --prefix framework run build:native
+npm --prefix framework run build
+```


These instructions assume a framework/ directory exists (git -C framework ..., and later file:./framework), but the repo currently has no framework/ checkout. Either add the framework/ source (or submodule) as part of this PR, or adjust the dependency/docs so a separate clone/link step is explicit and installs work from a clean checkout.

Copilot · 2026-04-25T00:45:11Z

+  const scaled = new Array(vocabSize);
+  for (let i = 0; i < vocabSize; i++) {
+    scaled[i] = data[offset + i] / temperature;
+  }
+  const maxLogit = Math.max(...scaled);
+  const exps = scaled.map((value) => Math.exp(value - maxLogit));
+  const sumExps = exps.reduce((sum, value) => sum + value, 0);
+  const probs = exps.map((value) => value / sumExps);
+
+  const r = Math.random();
+  let cumulative = 0;
+  for (let i = 0; i < probs.length; i++) {
+    cumulative += probs[i];
+    if (r < cumulative) {
+      return i;
+    }
+  }
+  return probs.length - 1;


Math.max(...scaled) will spread vocabSize arguments onto the stack and allocates multiple intermediate arrays (scaled, exps, probs). With typical BPE vocab sizes (tens of thousands), this is a major hot-path cost and can hit argument-count limits. Compute maxLogit and the softmax normalization in simple loops (and avoid extra arrays) to reduce allocations and improve throughput.

Suggested change

const scaled = new Array(vocabSize);

for (let i = 0; i < vocabSize; i++) {

scaled[i] = data[offset + i] / temperature;

}

const maxLogit = Math.max(...scaled);

const exps = scaled.map((value) => Math.exp(value - maxLogit));

const sumExps = exps.reduce((sum, value) => sum + value, 0);

const probs = exps.map((value) => value / sumExps);

const r = Math.random();

let cumulative = 0;

for (let i = 0; i < probs.length; i++) {

cumulative += probs[i];

if (r < cumulative) {

return i;

}

}

return probs.length - 1;

let maxLogit = -Infinity;

for (let i = 0; i < vocabSize; i++) {

const scaledValue = data[offset + i] / temperature;

if (scaledValue > maxLogit) {

maxLogit = scaledValue;

}

}

let sumExps = 0;

for (let i = 0; i < vocabSize; i++) {

const scaledValue = data[offset + i] / temperature;

sumExps += Math.exp(scaledValue - maxLogit);

}

const r = Math.random() * sumExps;

let cumulative = 0;

for (let i = 0; i < vocabSize; i++) {

const scaledValue = data[offset + i] / temperature;

cumulative += Math.exp(scaledValue - maxLogit);

if (r < cumulative) {

return i;

}

}

return vocabSize - 1;

r-chong added 3 commits April 19, 2026 21:43

feat: add kv cache demo

e5d3058

fix: validate tokenizer paths

902539d

add http serving

3821efc

r-chong requested a review from Copilot April 25, 2026 00:40

Copilot started reviewing on behalf of r-chong April 25, 2026 00:40 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serving#2

Serving#2
r-chong wants to merge 3 commits into
mni-ml:mainfrom
r-chong:inference-v1

r-chong commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const maxConcurrent = Number(process.env.MAX_CONCURRENT \|\| 1);

-const maxConcurrent = Number(process.env.MAX_CONCURRENT || 1);
+const rawMaxConcurrent = process.env.MAX_CONCURRENT;
+const maxConcurrent =
+  rawMaxConcurrent === undefined || rawMaxConcurrent === ''
+    ? 1
+    : Number(rawMaxConcurrent);
+if (!Number.isInteger(maxConcurrent) || maxConcurrent < 1) {
+  console.error(
+    `Invalid MAX_CONCURRENT=${JSON.stringify(rawMaxConcurrent)}. ` +
+      'MAX_CONCURRENT must be an integer >= 1.'
+  );
+  process.exit(1);
+}

-    req.on('data', (chunk) => {
-      body += chunk;
-      if (body.length > 2 * 1024 * 1024) {
-        reject(new Error('request body too large'));
-      }
-    });
-    req.on('end', () => {
-      if (!body) {
-        resolve({});
-        return;
-      }
-      try {
-        resolve(JSON.parse(body));
-      } catch {
-        reject(new Error('invalid json body'));
-      }
-    });
-    req.on('error', reject);
+    let settled = false;
+    const cleanup = () => {
+      req.off('data', onData);
+      req.off('end', onEnd);
+      req.off('error', onError);
+    };
+    const resolveOnce = (value) => {
+      if (settled) return;
+      settled = true;
+      cleanup();
+      resolve(value);
+    };
+    const rejectOnce = (err) => {
+      if (settled) return;
+      settled = true;
+      cleanup();
+      reject(err);
+    };
+    const onData = (chunk) => {
+      if (settled) return;
+      body += chunk;
+      if (body.length > 2 * 1024 * 1024) {
+        req.pause();
+        rejectOnce(new Error('request body too large'));
+        req.destroy();
+      }
+    };
+    const onEnd = () => {
+      if (settled) return;
+      if (!body) {
+        resolveOnce({});
+        return;
+      }
+      try {
+        resolveOnce(JSON.parse(body));
+      } catch {
+        rejectOnce(new Error('invalid json body'));
+      }
+    };
+    const onError = (err) => {
+      rejectOnce(err);
+    };
+    req.on('data', onData);
+    req.on('end', onEnd);
+    req.on('error', onError);

	"@mni-ml/framework": "file:./framework"
	"@mni-ml/framework": "latest"

Conversation

r-chong commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants