feat: add data_extractor and query_extraction tools (DWS Data Extraction API)#27
feat: add data_extractor and query_extraction tools (DWS Data Extraction API)#27jdrhyne wants to merge 12 commits into
Conversation
Expose the build core (package-internal) so focused tools like data_extractor can compose instructions -> API call -> response routing without performBuildCall's required outputFilePath, enabling inline responses. Behavior-preserving; performBuildCall still consumes both. 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
Data Extraction is a separate DWS API (POST /extraction/parse with its own pdf_live_ key), not a json-content output of the Processor /build endpoint. data_extractor will not reuse the Build instruction machinery, so the processInstructions/makeApiBuildCall exports are unnecessary.
…rse) DWS Data Extraction is a separate API with its own pdf_live_ key, not a json-content output of Processor /build. Rework KTD1 (second DwsApiClient), add modes/formats + cost transparency, separate NUTRIENT_EXTRACTION_API_KEY, and ground all wiring in main's client.ts/index.ts. 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
Schemas for the Data Extraction API (/extraction/parse): mode (text/ structure/understand/agentic), output format (spatial/markdown), includeWords, language, pages, outputPath; plus query filters (pages/region/minConfidence/ elementTypes/limit). Cross-field rules enforced in handlers (Schema.shape requires a plain ZodObject). 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
Separate key from the Processor NUTRIENT_DWS_API_KEY; the Data Extraction client is constructed from it during tool registration (U5). 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
performExtractCall: POST /extraction/parse (multipart file + instructions), validates outputPath before the call, routes spatial->file+content-free summary and markdown->inline, parses streamed JSON, clear error when key unset. performQueryCall: reads a saved spatial file and returns elements filtered by page/region/minConfidence/elementTypes, capped by limit. Drop unsupported 'pages' request param from the extractor schema; language nests under options. 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
… extraction on document_processor Builds a Data Extraction DwsApiClient from NUTRIENT_EXTRACTION_API_KEY (undefined when unset; data_extractor then returns a clear setup error). Threads it through createMcpServer/addToolsToServer. Updates document_processor description to point extraction users to data_extractor, and the tool-list test for the two new tools. 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
12 tests against the documented response shape (mocked client): markdown inline, spatial->file with a content-free summary (asserts no PII leaks inline while the file retains it), spatial-requires-outputPath, text-mode rejects spatial, sandbox containment of absolute paths, missing-key error, and query filters (minConfidence/type/page/region/limit/malformed). 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
…ion key Add both tools to the Available Tools table, a Data Extraction section (modes, cost per page, spatial-vs-markdown, file+query workflow, transcript caveat), NUTRIENT_EXTRACTION_API_KEY to the env table and .env.example, and point the document_processor capability row at the dedicated tool. 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
examples/invoice-extraction-workflow.md demonstrates data_extractor (spatial to file) -> query_extraction (low-confidence + region slices) -> act via ai_redactor/document_signer, keeping the full payload out of context. 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
…e guard, leaner write - Honor outputPath for markdown output (large docs would overflow context inline) - Reject 2xx responses lacking output.elements before writing, so a non-extraction body can't overwrite the target file - Write the raw response body for spatial output instead of re-stringifying (drops a copy of large payloads, preserves all API fields) - Extract writeToResolvedPath helper (de-dupes mkdir-p), drop redundant includeWords coalesce and the resolvedOutputPath casts - Add tests for markdown-to-file and the non-spatial-body guard 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
U0 — live
|
Summary
Adds the data extraction workflow primitive to the MCP server, targeting the standalone DWS Data Extraction API (
POST /extraction/parse) — a separate product with its own key, not ajson-contentoutput of the Processor/buildendpoint.data_extractor—POST /extraction/parsewith amode(text/structure/understand/agentic) and outputformat. Spatial output (typed elements with bounding boxes + confidence + reading order) is written tooutputPathand summarized inline; markdown is returned inline (or to a file whenoutputPathis given).query_extraction— reads a saved spatial-extraction file and returns filtered element slices inline (bypages,region/bbox,minConfidence,elementTypes), so an agent pulls only the coordinates/values it needs into context.examples/invoice-extraction-workflow.md).Architecture
Reuses the existing
DwsApiClientabstraction —data_extractorruns on a second client instance authenticated with a newNUTRIENT_EXTRACTION_API_KEY(separate from the ProcessorNUTRIENT_DWS_API_KEY). No new HTTP plumbing.document_processor's description no longer advertises standalone extraction and points todata_extractor.Design notes
outputPathto avoid overflowing the conversation.output.elementsis rejected before writing, so a non-extraction body can't overwrite the target file.Testing
pnpm test: 120 passing (62 skipped). Newtests/extract.test.ts(14 cases) covers routing (spatial→file / markdown→inline / markdown→file), the no-PII-in-summary guarantee, sandbox containment, the non-spatial-body guard, the missing-key error, and everyquery_extractionfilter.tests/environment.test.ts+tests/mcp-tools.test.tsupdated.tsc,eslint,prettier, andpnpm buildall clean.Follow-up
/extraction/parsecall with apdf_live_key should confirm the response shape and capture a fixture before release. Tracked as U0 indocs/plans/2026-06-07-001-feat-dws-data-extraction-workflow-plan.md.Plan:
docs/plans/2026-06-07-001-feat-dws-data-extraction-workflow-plan.md