Skip to content

feat: international lab support with fuzzy matching + LLM auto-identification#3

Open
Chocksy wants to merge 1 commit intozmeyer44:mainfrom
Chocksy:pr/romanian-lab-support
Open

feat: international lab support with fuzzy matching + LLM auto-identification#3
Chocksy wants to merge 1 commit intozmeyer44:mainfrom
Chocksy:pr/romanian-lab-support

Conversation

@Chocksy
Copy link
Copy Markdown

@Chocksy Chocksy commented Apr 7, 2026

Summary

  • Fuzzy normalizer: Levenshtein distance matching, token-based matching, canonical code consolidation — handles non-English lab reports (tested with Romanian labs, generalizes to any language)
  • LLM auto-identification: Unmatched analytes are sent to Gemini Flash which identifies them with LOINC codes, units, and reference ranges, auto-creates metric definitions
  • Configurable AI provider: AI_PROVIDER=gateway|openrouter env var with shared module

Key Changes

  • packages/ingestion/src/normalizer.ts — Fuzzy matching engine with Levenshtein distance, canonical code map, demographic-aware ranges
  • packages/ingestion/src/normalizer.test.ts — 30 unit tests
  • packages/ingestion/src/normalizer.integration.test.ts — 137 integration tests with real Romanian lab data
  • services/ingestion-worker/src/steps/auto-identify.ts — LLM-powered biomarker identification pipeline
  • services/ingestion-worker/src/lib/ai-provider.ts — Shared AI provider (gateway or OpenRouter)
  • packages/ai/src/prompts/extract-labs.ts — Enhanced prompt for international formats
  • packages/database/src/seed/data/romanian-lab-supplements.ts — 200+ Romanian aliases
  • packages/database/src/seed/data/metric-definitions.ts — Extended metric catalog

Test plan

  • Upload a non-English lab PDF and verify extraction + normalization
  • Run pnpm vitest run in packages/ingestion — expect 30+137 tests passing
  • Verify auto-identify creates new metric_definitions for unknown analytes
  • Test with AI_PROVIDER=openrouter and AI_PROVIDER=gateway

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 7, 2026

@Chocksy is attempting to deploy a commit to the Zach's Projects Team on Vercel.

A member of the Team first needs to authorize it.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 7, 2026

Greptile Summary

This PR adds international lab report support (initially targeting Romanian formats) via a fuzzy matching engine in the normalizer and a new LLM-powered autoIdentify step that creates metric_definitions on the fly for unknown analytes. There are two P1 defects in auto-identify.ts that must be fixed before merge.

  • SQL injection: unsanitized LLM-extracted analyte names are interpolated into a raw db.execute() query; single quotes in any analyte name enable injection via a crafted PDF.
  • Null-to-zero data corruption: when extraction.value is null and a unit conversion applies, null ?? 0 silently stores 0 as a numeric observation instead of preserving null.

Confidence Score: 3/5

Not safe to merge — two P1 defects affect every processed document

SQL injection via unsanitized LLM text in a raw db.execute() call and null-to-zero data corruption during unit conversion are both present-defects on the primary processing path, not speculative. Both need fixing before this ships.

services/ingestion-worker/src/steps/auto-identify.ts (SQL injection + id validation), packages/ingestion/src/normalizer.ts (null coercion during unit conversion)

Important Files Changed

Filename Overview
services/ingestion-worker/src/steps/auto-identify.ts New LLM auto-identify pipeline: SQL injection via unsanitized analyte text, unvalidated LLM-generated DB id, provider inconsistency
packages/ingestion/src/normalizer.ts Fuzzy metric matching + unit conversion; null value coerces to 0 during conversion (P1)
services/ingestion-worker/src/parsers/lab-pdf.ts Lab PDF parser with scanned-PDF OCR path; contains dead pdfjs page-rendering loop that runs but does nothing
services/ingestion-worker/src/lib/ai-provider.ts Shared AI provider abstraction for gateway/openrouter; clean implementation
services/ingestion-worker/src/steps/normalize.ts Normalization step fetching metrics, conversions, demographics from DB; no issues
services/ingestion-worker/src/workflow.ts Orchestration workflow: integrates autoIdentify between normalize and materialize steps
packages/ai/src/prompts/extract-labs.ts Enhanced extraction prompt with Romanian translation examples and clearer output rules
packages/database/src/seed/data/romanian-lab-supplements.ts 200+ Romanian lab aliases and supplemental metric definitions
packages/ingestion/src/normalizer.integration.test.ts 137 integration tests covering Romanian lab data normalization
packages/ingestion/src/normalizer.test.ts 30 unit tests for matchMetric, convertUnit, resolveReferenceRange
packages/ingestion/src/types.ts Minor: re-exports normalizer types; no logic changes
packages/database/src/seed/data/metric-definitions.ts Extended metric catalog with additional biomarkers
services/ingestion-worker/src/steps/classify.ts Uses updated getModel/getModelId from ai-provider; no functional changes

Sequence Diagram

sequenceDiagram
    participant W as workflow.ts
    participant N as normalize.ts
    participant NR as normalizer.ts
    participant AI as auto-identify.ts
    participant LLM as LLM (Gemini Flash)
    participant DB as Database

    W->>N: normalize(extractions)
    N->>DB: fetch metricDefs, unitConversions, demographics
    N->>NR: normalizeExtractions()
    NR-->>N: {normalized[], flagged[]}
    N-->>W: NormalizeOutput

    W->>AI: autoIdentify(normResult, metricDefs, ...)
    AI->>AI: filter flagged[reason=unmatched_metric]
    alt unmatched.length > 0
        AI->>LLM: POST /chat/completions (analyte list)
        LLM-->>AI: IdentifiedBiomarker[]
        loop per identified biomarker
            AI->>DB: INSERT metric_definitions (id from LLM)
            Note over AI,DB: ⚠️ id not validated; raw SQL for alias update
        end
        AI->>NR: normalizeExtractions(resolved)
        NR-->>AI: reNormResult
    end
    AI-->>W: finalNormalization

    W->>W: materialize(finalNormalization)
Loading

Reviews (1): Last reviewed commit: "feat: international lab support with fuz..." | Re-trigger Greptile

Comment on lines +195 to +197
await db.execute(
`UPDATE metric_definitions SET aliases = aliases::jsonb || '["${flagged.extraction.analyte.replace(/"/g, '\\"')}"]'::jsonb WHERE id = '${existing.id}'`,
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 SQL injection via unsanitized analyte text

The .replace(/"/g, '\\"') only escapes double quotes. A single quote in an LLM-extracted analyte name (e.g., "O'Brien Factor") breaks out of the SQL string literals '["..."]' and '${existing.id}', enabling SQL injection from any crafted PDF document. Use JSON.stringify for the alias array and a parameterized approach:

await db.execute(
  sql`UPDATE metric_definitions
      SET aliases = aliases::jsonb || ${JSON.stringify([flagged.extraction.analyte])}::jsonb
      WHERE id = ${existing.id}`
);

Comment thread packages/ingestion/src/normalizer.ts Outdated
Comment on lines 307 to 315
const converted = convertUnit(
extraction.value ?? 0,
extraction.unit,
metric.unit,
unitConversions,
metric.id
metric.id,
);
if (converted !== null) {
finalValue = converted;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Null value silently coerced to 0 during unit conversion

When extraction.value is null (qualitative result like "< 0.050") and a unit conversion exists, extraction.value ?? 0 passes 0 to convertUnit. If a conversion rule matches, 0 * multiplier + offset is stored as finalValue — overwriting null with a fabricated numeric zero. For CRP (mg/dL → mg/L, multiplier 10), a null result would be stored as 0 mg/L.

Guard the conversion path with a null check:

Suggested change
const converted = convertUnit(
extraction.value ?? 0,
extraction.unit,
metric.unit,
unitConversions,
metric.id
metric.id,
);
if (converted !== null) {
finalValue = converted;
const converted = extraction.value !== null
? convertUnit(
extraction.value,
extraction.unit,
metric.unit,
unitConversions,
metric.id,
)
: null;

Comment on lines +147 to +149
.values({
id: match.id,
name: match.standardName,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 LLM-returned id used as DB primary key without format validation

The match.id string comes directly from the LLM response and is inserted as the metric_definitions primary key with no sanitization. The prompt requests kebab_case identifier using underscores but this isn't enforced. An adversarial PDF could coerce the model into returning an id that overwrites an existing metric (e.g., "glucose") or contains special characters. Add a guard before insert:

if (!/^[a-z][a-z0-9_]{0,63}$/.test(match.id)) {
  console.warn(`[auto-identify] Skipping unsafe id: ${match.id}`);
  remainingFlagged.push(flagged);
  continue;
}

Comment on lines +87 to +111
const pdfjs = await import("pdfjs-dist/legacy/build/pdf.mjs");
// Resolve the worker file from node_modules (not relative to this source file)
const { createRequire } = await import("module");
const req = createRequire(import.meta.url);
pdfjs.GlobalWorkerOptions.workerSrc = req.resolve(
"pdfjs-dist/legacy/build/pdf.worker.mjs",
);
const doc = await pdfjs.getDocument({
data: new Uint8Array(pdfBuffer),
useWorkerFetch: false,
isEvalSupported: false,
useSystemFonts: true,
}).promise;
const pageImages: string[] = [];

for (let i = 1; i <= Math.min(doc.numPages, 10); i++) {
const page = await doc.getPage(i);
const viewport = page.getViewport({ scale: 2.0 }); // 2x for readability

// Create canvas-like rendering using node-canvas or sharp
// pdfjs-dist needs a canvas - use the OffscreenCanvas or render to PNG via sharp
// Simplest approach: send the PDF directly as base64 to Gemini (it supports PDF input)
// Actually, Gemini Flash supports PDF files directly via OpenRouter
break; // We'll send the whole PDF as a file
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dead pdfjs page-rendering code

pdfjs.getDocument() is called, the first page is fetched (allocating memory for font/image decoding), and then break exits immediately without rendering anything. The pageImages array is always empty — the code that follows sends the raw pdfBase64 directly. The entire pdfjs block is dead weight that runs on every scanned PDF. Consider removing it:

// Remove lines 87-111 entirely; the PDF is sent as base64 directly below
const pdfBase64 = pdfBuffer.toString('base64');

Comment on lines +87 to +95
const response = await fetch(`${getOpenRouterBaseUrl()}/chat/completions`, {
method: "POST",
headers: getOpenRouterHeaders(),
body: JSON.stringify({
model: IDENTIFY_MODEL,
messages: [{ role: "user", content: prompt }],
temperature: 0,
}),
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 auto-identify bypasses AI_PROVIDER=gateway routing

This step always calls getOpenRouterBaseUrl() (defaulting to https://openrouter.ai/api/v1) regardless of the AI_PROVIDER env var. The classify and parse steps use getModel() which correctly routes through the Vercel AI gateway when AI_PROVIDER=gateway, but autoIdentify hard-codes the OpenRouter path. The PR description claims gateway support for both providers. Consider refactoring to use generateText({ model: getModel(IDENTIFY_MODEL), ... }) so provider selection is consistent.

…fication

Enhanced the ingestion pipeline to handle non-English lab reports (tested
extensively with Romanian labs) and automatically identify unknown biomarkers.

Normalizer improvements:
- Fuzzy string matching (Levenshtein distance) for analyte name resolution
- Canonical code mapping to consolidate duplicate metric codes
- Token-based matching for multi-word analyte names
- Demographic-aware reference ranges (age/sex)
- 30 unit tests + 137 integration tests with real Romanian lab data

LLM auto-identification pipeline:
- New auto-identify step: sends unmatched analytes to Gemini Flash
- Auto-creates metric_definitions with LOINC codes and reference ranges
- Adds aliases to existing metrics for future matching
- Re-normalizes resolved extractions automatically

AI provider abstraction:
- Configurable provider via AI_PROVIDER env var (gateway or openrouter)
- Shared module eliminates duplicated provider setup across files

Other improvements:
- Enhanced extract-labs prompt for international lab formats
- Scanned PDF detection with OCR fallback via vision model
- Extended metric definitions seed data for common biomarkers
- Romanian lab supplement seed data (200+ aliases)
@Chocksy Chocksy force-pushed the pr/romanian-lab-support branch from efb1c75 to c3da940 Compare April 7, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant