feat: international lab support with fuzzy matching + LLM auto-identification by Chocksy · Pull Request #3 · zmeyer44/OpenVitals

Chocksy · 2026-04-07T12:29:45Z

Summary

Fuzzy normalizer: Levenshtein distance matching, token-based matching, canonical code consolidation — handles non-English lab reports (tested with Romanian labs, generalizes to any language)
LLM auto-identification: Unmatched analytes are sent to Gemini Flash which identifies them with LOINC codes, units, and reference ranges, auto-creates metric definitions
Configurable AI provider: AI_PROVIDER=gateway|openrouter env var with shared module

Key Changes

packages/ingestion/src/normalizer.ts — Fuzzy matching engine with Levenshtein distance, canonical code map, demographic-aware ranges
packages/ingestion/src/normalizer.test.ts — 30 unit tests
packages/ingestion/src/normalizer.integration.test.ts — 137 integration tests with real Romanian lab data
services/ingestion-worker/src/steps/auto-identify.ts — LLM-powered biomarker identification pipeline
services/ingestion-worker/src/lib/ai-provider.ts — Shared AI provider (gateway or OpenRouter)
packages/ai/src/prompts/extract-labs.ts — Enhanced prompt for international formats
packages/database/src/seed/data/romanian-lab-supplements.ts — 200+ Romanian aliases
packages/database/src/seed/data/metric-definitions.ts — Extended metric catalog

Test plan

Upload a non-English lab PDF and verify extraction + normalization
Run pnpm vitest run in packages/ingestion — expect 30+137 tests passing
Verify auto-identify creates new metric_definitions for unknown analytes
Test with AI_PROVIDER=openrouter and AI_PROVIDER=gateway

vercel · 2026-04-07T12:29:50Z

@Chocksy is attempting to deploy a commit to the Zach's Projects Team on Vercel.

A member of the Team first needs to authorize it.

greptile-apps · 2026-04-07T12:34:23Z

Greptile Summary

This PR adds international lab report support (initially targeting Romanian formats) via a fuzzy matching engine in the normalizer and a new LLM-powered autoIdentify step that creates metric_definitions on the fly for unknown analytes. There are two P1 defects in auto-identify.ts that must be fixed before merge.

SQL injection: unsanitized LLM-extracted analyte names are interpolated into a raw db.execute() query; single quotes in any analyte name enable injection via a crafted PDF.
Null-to-zero data corruption: when extraction.value is null and a unit conversion applies, null ?? 0 silently stores 0 as a numeric observation instead of preserving null.

Confidence Score: 3/5

Not safe to merge — two P1 defects affect every processed document

SQL injection via unsanitized LLM text in a raw db.execute() call and null-to-zero data corruption during unit conversion are both present-defects on the primary processing path, not speculative. Both need fixing before this ships.

services/ingestion-worker/src/steps/auto-identify.ts (SQL injection + id validation), packages/ingestion/src/normalizer.ts (null coercion during unit conversion)

Important Files Changed

Filename	Overview
services/ingestion-worker/src/steps/auto-identify.ts	New LLM auto-identify pipeline: SQL injection via unsanitized analyte text, unvalidated LLM-generated DB id, provider inconsistency
packages/ingestion/src/normalizer.ts	Fuzzy metric matching + unit conversion; null value coerces to 0 during conversion (P1)
services/ingestion-worker/src/parsers/lab-pdf.ts	Lab PDF parser with scanned-PDF OCR path; contains dead pdfjs page-rendering loop that runs but does nothing
services/ingestion-worker/src/lib/ai-provider.ts	Shared AI provider abstraction for gateway/openrouter; clean implementation
services/ingestion-worker/src/steps/normalize.ts	Normalization step fetching metrics, conversions, demographics from DB; no issues
services/ingestion-worker/src/workflow.ts	Orchestration workflow: integrates autoIdentify between normalize and materialize steps
packages/ai/src/prompts/extract-labs.ts	Enhanced extraction prompt with Romanian translation examples and clearer output rules
packages/database/src/seed/data/romanian-lab-supplements.ts	200+ Romanian lab aliases and supplemental metric definitions
packages/ingestion/src/normalizer.integration.test.ts	137 integration tests covering Romanian lab data normalization
packages/ingestion/src/normalizer.test.ts	30 unit tests for matchMetric, convertUnit, resolveReferenceRange
packages/ingestion/src/types.ts	Minor: re-exports normalizer types; no logic changes
packages/database/src/seed/data/metric-definitions.ts	Extended metric catalog with additional biomarkers
services/ingestion-worker/src/steps/classify.ts	Uses updated getModel/getModelId from ai-provider; no functional changes

Sequence Diagram

sequenceDiagram
    participant W as workflow.ts
    participant N as normalize.ts
    participant NR as normalizer.ts
    participant AI as auto-identify.ts
    participant LLM as LLM (Gemini Flash)
    participant DB as Database

    W->>N: normalize(extractions)
    N->>DB: fetch metricDefs, unitConversions, demographics
    N->>NR: normalizeExtractions()
    NR-->>N: {normalized[], flagged[]}
    N-->>W: NormalizeOutput

    W->>AI: autoIdentify(normResult, metricDefs, ...)
    AI->>AI: filter flagged[reason=unmatched_metric]
    alt unmatched.length > 0
        AI->>LLM: POST /chat/completions (analyte list)
        LLM-->>AI: IdentifiedBiomarker[]
        loop per identified biomarker
            AI->>DB: INSERT metric_definitions (id from LLM)
            Note over AI,DB: ⚠️ id not validated; raw SQL for alias update
        end
        AI->>NR: normalizeExtractions(resolved)
        NR-->>AI: reNormResult
    end
    AI-->>W: finalNormalization

    W->>W: materialize(finalNormalization)

_{Reviews (1): Last reviewed commit: "feat: international lab support with fuz..." | Re-trigger Greptile}

greptile-apps · 2026-04-07T12:34:27Z

+          await db.execute(
+            `UPDATE metric_definitions SET aliases = aliases::jsonb || '["${flagged.extraction.analyte.replace(/"/g, '\\"')}"]'::jsonb WHERE id = '${existing.id}'`,
+          );


SQL injection via unsanitized analyte text

The .replace(/"/g, '\\"') only escapes double quotes. A single quote in an LLM-extracted analyte name (e.g., "O'Brien Factor") breaks out of the SQL string literals '["..."]' and '${existing.id}', enabling SQL injection from any crafted PDF document. Use JSON.stringify for the alias array and a parameterized approach:

await db.execute( sql`UPDATE metric_definitions SET aliases = aliases::jsonb || ${JSON.stringify([flagged.extraction.analyte])}::jsonb WHERE id = ${existing.id}` );

greptile-apps · 2026-04-07T12:34:28Z

      const converted = convertUnit(
        extraction.value ?? 0,
        extraction.unit,
        metric.unit,
        unitConversions,
-        metric.id
+        metric.id,
      );
      if (converted !== null) {
        finalValue = converted;


Null value silently coerced to 0 during unit conversion

When extraction.value is null (qualitative result like "< 0.050") and a unit conversion exists, extraction.value ?? 0 passes 0 to convertUnit. If a conversion rule matches, 0 * multiplier + offset is stored as finalValue — overwriting null with a fabricated numeric zero. For CRP (mg/dL → mg/L, multiplier 10), a null result would be stored as 0 mg/L.

Guard the conversion path with a null check:

Suggested change

const converted = convertUnit(

extraction.value ?? 0,

extraction.unit,

metric.unit,

unitConversions,

metric.id

metric.id,

);

if (converted !== null) {

finalValue = converted;

const converted = extraction.value !== null

? convertUnit(

extraction.value,

extraction.unit,

metric.unit,

unitConversions,

metric.id,

)

: null;

greptile-apps · 2026-04-07T12:34:28Z

+            .values({
+              id: match.id,
+              name: match.standardName,


LLM-returned id used as DB primary key without format validation

The match.id string comes directly from the LLM response and is inserted as the metric_definitions primary key with no sanitization. The prompt requests kebab_case identifier using underscores but this isn't enforced. An adversarial PDF could coerce the model into returning an id that overwrites an existing metric (e.g., "glucose") or contains special characters. Add a guard before insert:

if (!/^[a-z][a-z0-9_]{0,63}$/.test(match.id)) { console.warn(`[auto-identify] Skipping unsafe id: ${match.id}`); remainingFlagged.push(flagged); continue; }

greptile-apps · 2026-04-07T12:34:30Z

+    const pdfjs = await import("pdfjs-dist/legacy/build/pdf.mjs");
+    // Resolve the worker file from node_modules (not relative to this source file)
+    const { createRequire } = await import("module");
+    const req = createRequire(import.meta.url);
+    pdfjs.GlobalWorkerOptions.workerSrc = req.resolve(
+      "pdfjs-dist/legacy/build/pdf.worker.mjs",
+    );
+    const doc = await pdfjs.getDocument({
+      data: new Uint8Array(pdfBuffer),
+      useWorkerFetch: false,
+      isEvalSupported: false,
+      useSystemFonts: true,
+    }).promise;
+    const pageImages: string[] = [];
+
+    for (let i = 1; i <= Math.min(doc.numPages, 10); i++) {
+      const page = await doc.getPage(i);
+      const viewport = page.getViewport({ scale: 2.0 }); // 2x for readability
+
+      // Create canvas-like rendering using node-canvas or sharp
+      // pdfjs-dist needs a canvas - use the OffscreenCanvas or render to PNG via sharp
+      // Simplest approach: send the PDF directly as base64 to Gemini (it supports PDF input)
+      // Actually, Gemini Flash supports PDF files directly via OpenRouter
+      break; // We'll send the whole PDF as a file
+    }


Dead pdfjs page-rendering code

pdfjs.getDocument() is called, the first page is fetched (allocating memory for font/image decoding), and then break exits immediately without rendering anything. The pageImages array is always empty — the code that follows sends the raw pdfBase64 directly. The entire pdfjs block is dead weight that runs on every scanned PDF. Consider removing it:

// Remove lines 87-111 entirely; the PDF is sent as base64 directly below const pdfBase64 = pdfBuffer.toString('base64');

greptile-apps · 2026-04-07T12:34:31Z

+    const response = await fetch(`${getOpenRouterBaseUrl()}/chat/completions`, {
+      method: "POST",
+      headers: getOpenRouterHeaders(),
+      body: JSON.stringify({
+        model: IDENTIFY_MODEL,
+        messages: [{ role: "user", content: prompt }],
+        temperature: 0,
+      }),
+    });


auto-identify bypasses AI_PROVIDER=gateway routing

This step always calls getOpenRouterBaseUrl() (defaulting to https://openrouter.ai/api/v1) regardless of the AI_PROVIDER env var. The classify and parse steps use getModel() which correctly routes through the Vercel AI gateway when AI_PROVIDER=gateway, but autoIdentify hard-codes the OpenRouter path. The PR description claims gateway support for both providers. Consider refactoring to use generateText({ model: getModel(IDENTIFY_MODEL), ... }) so provider selection is consistent.

…fication Enhanced the ingestion pipeline to handle non-English lab reports (tested extensively with Romanian labs) and automatically identify unknown biomarkers. Normalizer improvements: - Fuzzy string matching (Levenshtein distance) for analyte name resolution - Canonical code mapping to consolidate duplicate metric codes - Token-based matching for multi-word analyte names - Demographic-aware reference ranges (age/sex) - 30 unit tests + 137 integration tests with real Romanian lab data LLM auto-identification pipeline: - New auto-identify step: sends unmatched analytes to Gemini Flash - Auto-creates metric_definitions with LOINC codes and reference ranges - Adds aliases to existing metrics for future matching - Re-normalizes resolved extractions automatically AI provider abstraction: - Configurable provider via AI_PROVIDER env var (gateway or openrouter) - Shared module eliminates duplicated provider setup across files Other improvements: - Enhanced extract-labs prompt for international lab formats - Scanned PDF detection with OCR fallback via vision model - Extended metric definitions seed data for common biomarkers - Romanian lab supplement seed data (200+ aliases)

greptile-apps Bot reviewed Apr 7, 2026

View reviewed changes

Chocksy force-pushed the pr/romanian-lab-support branch from efb1c75 to c3da940 Compare April 7, 2026 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: international lab support with fuzzy matching + LLM auto-identification#3

feat: international lab support with fuzzy matching + LLM auto-identification#3
Chocksy wants to merge 1 commit intozmeyer44:mainfrom
Chocksy:pr/romanian-lab-support

Chocksy commented Apr 7, 2026

Uh oh!

vercel Bot commented Apr 7, 2026

Uh oh!

greptile-apps Bot commented Apr 7, 2026

Uh oh!

greptile-apps Bot Apr 7, 2026

Uh oh!

greptile-apps Bot Apr 7, 2026

Uh oh!

greptile-apps Bot Apr 7, 2026

Uh oh!

greptile-apps Bot Apr 7, 2026

Uh oh!

greptile-apps Bot Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chocksy commented Apr 7, 2026

Summary

Key Changes

Test plan

Uh oh!

vercel Bot commented Apr 7, 2026

Uh oh!

greptile-apps Bot commented Apr 7, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant