feat: international lab support with fuzzy matching + LLM auto-identification#3
feat: international lab support with fuzzy matching + LLM auto-identification#3Chocksy wants to merge 1 commit intozmeyer44:mainfrom
Conversation
|
@Chocksy is attempting to deploy a commit to the Zach's Projects Team on Vercel. A member of the Team first needs to authorize it. |
Greptile SummaryThis PR adds international lab report support (initially targeting Romanian formats) via a fuzzy matching engine in the normalizer and a new LLM-powered
Confidence Score: 3/5Not safe to merge — two P1 defects affect every processed document SQL injection via unsanitized LLM text in a raw db.execute() call and null-to-zero data corruption during unit conversion are both present-defects on the primary processing path, not speculative. Both need fixing before this ships. services/ingestion-worker/src/steps/auto-identify.ts (SQL injection + id validation), packages/ingestion/src/normalizer.ts (null coercion during unit conversion) Important Files Changed
Sequence DiagramsequenceDiagram
participant W as workflow.ts
participant N as normalize.ts
participant NR as normalizer.ts
participant AI as auto-identify.ts
participant LLM as LLM (Gemini Flash)
participant DB as Database
W->>N: normalize(extractions)
N->>DB: fetch metricDefs, unitConversions, demographics
N->>NR: normalizeExtractions()
NR-->>N: {normalized[], flagged[]}
N-->>W: NormalizeOutput
W->>AI: autoIdentify(normResult, metricDefs, ...)
AI->>AI: filter flagged[reason=unmatched_metric]
alt unmatched.length > 0
AI->>LLM: POST /chat/completions (analyte list)
LLM-->>AI: IdentifiedBiomarker[]
loop per identified biomarker
AI->>DB: INSERT metric_definitions (id from LLM)
Note over AI,DB: ⚠️ id not validated; raw SQL for alias update
end
AI->>NR: normalizeExtractions(resolved)
NR-->>AI: reNormResult
end
AI-->>W: finalNormalization
W->>W: materialize(finalNormalization)
Reviews (1): Last reviewed commit: "feat: international lab support with fuz..." | Re-trigger Greptile |
| await db.execute( | ||
| `UPDATE metric_definitions SET aliases = aliases::jsonb || '["${flagged.extraction.analyte.replace(/"/g, '\\"')}"]'::jsonb WHERE id = '${existing.id}'`, | ||
| ); |
There was a problem hiding this comment.
SQL injection via unsanitized analyte text
The .replace(/"/g, '\\"') only escapes double quotes. A single quote in an LLM-extracted analyte name (e.g., "O'Brien Factor") breaks out of the SQL string literals '["..."]' and '${existing.id}', enabling SQL injection from any crafted PDF document. Use JSON.stringify for the alias array and a parameterized approach:
await db.execute(
sql`UPDATE metric_definitions
SET aliases = aliases::jsonb || ${JSON.stringify([flagged.extraction.analyte])}::jsonb
WHERE id = ${existing.id}`
);| const converted = convertUnit( | ||
| extraction.value ?? 0, | ||
| extraction.unit, | ||
| metric.unit, | ||
| unitConversions, | ||
| metric.id | ||
| metric.id, | ||
| ); | ||
| if (converted !== null) { | ||
| finalValue = converted; |
There was a problem hiding this comment.
Null value silently coerced to
0 during unit conversion
When extraction.value is null (qualitative result like "< 0.050") and a unit conversion exists, extraction.value ?? 0 passes 0 to convertUnit. If a conversion rule matches, 0 * multiplier + offset is stored as finalValue — overwriting null with a fabricated numeric zero. For CRP (mg/dL → mg/L, multiplier 10), a null result would be stored as 0 mg/L.
Guard the conversion path with a null check:
| const converted = convertUnit( | |
| extraction.value ?? 0, | |
| extraction.unit, | |
| metric.unit, | |
| unitConversions, | |
| metric.id | |
| metric.id, | |
| ); | |
| if (converted !== null) { | |
| finalValue = converted; | |
| const converted = extraction.value !== null | |
| ? convertUnit( | |
| extraction.value, | |
| extraction.unit, | |
| metric.unit, | |
| unitConversions, | |
| metric.id, | |
| ) | |
| : null; |
| .values({ | ||
| id: match.id, | ||
| name: match.standardName, |
There was a problem hiding this comment.
LLM-returned
id used as DB primary key without format validation
The match.id string comes directly from the LLM response and is inserted as the metric_definitions primary key with no sanitization. The prompt requests kebab_case identifier using underscores but this isn't enforced. An adversarial PDF could coerce the model into returning an id that overwrites an existing metric (e.g., "glucose") or contains special characters. Add a guard before insert:
if (!/^[a-z][a-z0-9_]{0,63}$/.test(match.id)) {
console.warn(`[auto-identify] Skipping unsafe id: ${match.id}`);
remainingFlagged.push(flagged);
continue;
}| const pdfjs = await import("pdfjs-dist/legacy/build/pdf.mjs"); | ||
| // Resolve the worker file from node_modules (not relative to this source file) | ||
| const { createRequire } = await import("module"); | ||
| const req = createRequire(import.meta.url); | ||
| pdfjs.GlobalWorkerOptions.workerSrc = req.resolve( | ||
| "pdfjs-dist/legacy/build/pdf.worker.mjs", | ||
| ); | ||
| const doc = await pdfjs.getDocument({ | ||
| data: new Uint8Array(pdfBuffer), | ||
| useWorkerFetch: false, | ||
| isEvalSupported: false, | ||
| useSystemFonts: true, | ||
| }).promise; | ||
| const pageImages: string[] = []; | ||
|
|
||
| for (let i = 1; i <= Math.min(doc.numPages, 10); i++) { | ||
| const page = await doc.getPage(i); | ||
| const viewport = page.getViewport({ scale: 2.0 }); // 2x for readability | ||
|
|
||
| // Create canvas-like rendering using node-canvas or sharp | ||
| // pdfjs-dist needs a canvas - use the OffscreenCanvas or render to PNG via sharp | ||
| // Simplest approach: send the PDF directly as base64 to Gemini (it supports PDF input) | ||
| // Actually, Gemini Flash supports PDF files directly via OpenRouter | ||
| break; // We'll send the whole PDF as a file | ||
| } |
There was a problem hiding this comment.
Dead pdfjs page-rendering code
pdfjs.getDocument() is called, the first page is fetched (allocating memory for font/image decoding), and then break exits immediately without rendering anything. The pageImages array is always empty — the code that follows sends the raw pdfBase64 directly. The entire pdfjs block is dead weight that runs on every scanned PDF. Consider removing it:
// Remove lines 87-111 entirely; the PDF is sent as base64 directly below
const pdfBase64 = pdfBuffer.toString('base64');| const response = await fetch(`${getOpenRouterBaseUrl()}/chat/completions`, { | ||
| method: "POST", | ||
| headers: getOpenRouterHeaders(), | ||
| body: JSON.stringify({ | ||
| model: IDENTIFY_MODEL, | ||
| messages: [{ role: "user", content: prompt }], | ||
| temperature: 0, | ||
| }), | ||
| }); |
There was a problem hiding this comment.
auto-identify bypasses AI_PROVIDER=gateway routing
This step always calls getOpenRouterBaseUrl() (defaulting to https://openrouter.ai/api/v1) regardless of the AI_PROVIDER env var. The classify and parse steps use getModel() which correctly routes through the Vercel AI gateway when AI_PROVIDER=gateway, but autoIdentify hard-codes the OpenRouter path. The PR description claims gateway support for both providers. Consider refactoring to use generateText({ model: getModel(IDENTIFY_MODEL), ... }) so provider selection is consistent.
…fication Enhanced the ingestion pipeline to handle non-English lab reports (tested extensively with Romanian labs) and automatically identify unknown biomarkers. Normalizer improvements: - Fuzzy string matching (Levenshtein distance) for analyte name resolution - Canonical code mapping to consolidate duplicate metric codes - Token-based matching for multi-word analyte names - Demographic-aware reference ranges (age/sex) - 30 unit tests + 137 integration tests with real Romanian lab data LLM auto-identification pipeline: - New auto-identify step: sends unmatched analytes to Gemini Flash - Auto-creates metric_definitions with LOINC codes and reference ranges - Adds aliases to existing metrics for future matching - Re-normalizes resolved extractions automatically AI provider abstraction: - Configurable provider via AI_PROVIDER env var (gateway or openrouter) - Shared module eliminates duplicated provider setup across files Other improvements: - Enhanced extract-labs prompt for international lab formats - Scanned PDF detection with OCR fallback via vision model - Extended metric definitions seed data for common biomarkers - Romanian lab supplement seed data (200+ aliases)
efb1c75 to
c3da940
Compare
Summary
AI_PROVIDER=gateway|openrouterenv var with shared moduleKey Changes
packages/ingestion/src/normalizer.ts— Fuzzy matching engine with Levenshtein distance, canonical code map, demographic-aware rangespackages/ingestion/src/normalizer.test.ts— 30 unit testspackages/ingestion/src/normalizer.integration.test.ts— 137 integration tests with real Romanian lab dataservices/ingestion-worker/src/steps/auto-identify.ts— LLM-powered biomarker identification pipelineservices/ingestion-worker/src/lib/ai-provider.ts— Shared AI provider (gateway or OpenRouter)packages/ai/src/prompts/extract-labs.ts— Enhanced prompt for international formatspackages/database/src/seed/data/romanian-lab-supplements.ts— 200+ Romanian aliasespackages/database/src/seed/data/metric-definitions.ts— Extended metric catalogTest plan
pnpm vitest runinpackages/ingestion— expect 30+137 tests passingAI_PROVIDER=openrouterandAI_PROVIDER=gateway