feat: add data_extractor and query_extraction tools (DWS Data Extraction API) by jdrhyne · Pull Request #27 · PSPDFKit/nutrient-dws-mcp-server

jdrhyne · 2026-06-07T21:06:00Z

Summary

Adds the data extraction workflow primitive to the MCP server, targeting the standalone DWS Data Extraction API (POST /extraction/parse) — a separate product with its own key, not a json-content output of the Processor /build endpoint.

data_extractor — POST /extraction/parse with a mode (text/structure/understand/agentic) and output format. Spatial output (typed elements with bounding boxes + confidence + reading order) is written to outputPath and summarized inline; markdown is returned inline (or to a file when outputPath is given).
query_extraction — reads a saved spatial-extraction file and returns filtered element slices inline (by pages, region/bbox, minConfidence, elementTypes), so an agent pulls only the coordinates/values it needs into context.
A worked extract → query → act example (examples/invoice-extraction-workflow.md).

Architecture

Reuses the existing DwsApiClient abstraction — data_extractor runs on a second client instance authenticated with a new NUTRIENT_EXTRACTION_API_KEY (separate from the Processor NUTRIENT_DWS_API_KEY). No new HTTP plumbing. document_processor's description no longer advertises standalone extraction and points to data_extractor.

Design notes

Context safety: large spatial output goes to a file; the inline summary is content-free (counts, low-confidence flags, page geometry only — no document text). Markdown also honors outputPath to avoid overflowing the conversation.
Sandbox: all reads/writes go through the existing sandbox resolvers; absolute paths are re-rooted/contained as in the other tools.
Robustness: a 2xx response lacking output.elements is rejected before writing, so a non-extraction body can't overwrite the target file.

Testing

pnpm test: 120 passing (62 skipped). New tests/extract.test.ts (14 cases) covers routing (spatial→file / markdown→inline / markdown→file), the no-PII-in-summary guarantee, sandbox containment, the non-spatial-body guard, the missing-key error, and every query_extraction filter. tests/environment.test.ts + tests/mcp-tools.test.ts updated.
tsc, eslint, prettier, and pnpm build all clean.
The 2 pre-existing key-gated live integration tests (build/sign) are unchanged by this PR.

Follow-up

Live verification (deferred): the handlers are built against Nutrient's published Data Extraction schema (endpoint, instructions shape, element fields). One live /extraction/parse call with a pdf_live_ key should confirm the response shape and capture a fixture before release. Tracked as U0 in docs/plans/2026-06-07-001-feat-dws-data-extraction-workflow-plan.md.

Plan: docs/plans/2026-06-07-001-feat-dws-data-extraction-workflow-plan.md

🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o

Expose the build core (package-internal) so focused tools like data_extractor can compose instructions -> API call -> response routing without performBuildCall's required outputFilePath, enabling inline responses. Behavior-preserving; performBuildCall still consumes both. 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o

Data Extraction is a separate DWS API (POST /extraction/parse with its own pdf_live_ key), not a json-content output of the Processor /build endpoint. data_extractor will not reuse the Build instruction machinery, so the processInstructions/makeApiBuildCall exports are unnecessary.

…rse) DWS Data Extraction is a separate API with its own pdf_live_ key, not a json-content output of Processor /build. Rework KTD1 (second DwsApiClient), add modes/formats + cost transparency, separate NUTRIENT_EXTRACTION_API_KEY, and ground all wiring in main's client.ts/index.ts. 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o

Schemas for the Data Extraction API (/extraction/parse): mode (text/ structure/understand/agentic), output format (spatial/markdown), includeWords, language, pages, outputPath; plus query filters (pages/region/minConfidence/ elementTypes/limit). Cross-field rules enforced in handlers (Schema.shape requires a plain ZodObject). 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o

Separate key from the Processor NUTRIENT_DWS_API_KEY; the Data Extraction client is constructed from it during tool registration (U5). 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o

performExtractCall: POST /extraction/parse (multipart file + instructions), validates outputPath before the call, routes spatial->file+content-free summary and markdown->inline, parses streamed JSON, clear error when key unset. performQueryCall: reads a saved spatial file and returns elements filtered by page/region/minConfidence/elementTypes, capped by limit. Drop unsupported 'pages' request param from the extractor schema; language nests under options. 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o

… extraction on document_processor Builds a Data Extraction DwsApiClient from NUTRIENT_EXTRACTION_API_KEY (undefined when unset; data_extractor then returns a clear setup error). Threads it through createMcpServer/addToolsToServer. Updates document_processor description to point extraction users to data_extractor, and the tool-list test for the two new tools. 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o

12 tests against the documented response shape (mocked client): markdown inline, spatial->file with a content-free summary (asserts no PII leaks inline while the file retains it), spatial-requires-outputPath, text-mode rejects spatial, sandbox containment of absolute paths, missing-key error, and query filters (minConfidence/type/page/region/limit/malformed). 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o

…ion key Add both tools to the Available Tools table, a Data Extraction section (modes, cost per page, spatial-vs-markdown, file+query workflow, transcript caveat), NUTRIENT_EXTRACTION_API_KEY to the env table and .env.example, and point the document_processor capability row at the dedicated tool. 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o

examples/invoice-extraction-workflow.md demonstrates data_extractor (spatial to file) -> query_extraction (low-confidence + region slices) -> act via ai_redactor/document_signer, keeping the full payload out of context. 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o

…e guard, leaner write - Honor outputPath for markdown output (large docs would overflow context inline) - Reject 2xx responses lacking output.elements before writing, so a non-extraction body can't overwrite the target file - Write the raw response body for spatial output instead of re-stringifying (drops a copy of large payloads, preserves all API fields) - Extract writeToResolvedPath helper (de-dupes mkdir-p), drop redundant includeWords coalesce and the resolvedOutputPath casts - Add tests for markdown-to-file and the non-spatial-body guard 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o

jdrhyne · 2026-06-07T21:27:02Z

U0 — live `/extraction/parse` verification (do before merge)

The handlers are built against Nutrient's published Data Extraction schema, but no live call has run yet (no key was available during development). One real request confirms the response shape and gives us a recorded fixture. Costs ~2–3 credits total.

Steps

# 1. Set the Data Extraction key (SEPARATE from NUTRIENT_DWS_API_KEY; starts with pdf_live_)
export NUTRIENT_EXTRACTION_API_KEY=pdf_live_...   # or add to .env (gitignored)

# 2a. Spatial (structure mode, ~1.5 credits) — confirms the element schema
curl -sS -X POST https://api.nutrient.io/extraction/parse \
  -H "Authorization: Bearer $NUTRIENT_EXTRACTION_API_KEY" \
  -F "file=@tests/assets/example.pdf" \
  -F 'instructions={"mode":"structure","output":{"format":"spatial"}}' \
  | tee tests/fixtures/extraction-spatial-sample.json | jq '.output.elements[0], .metrics'

# 2b. Markdown (text mode, ~1 credit)
curl -sS -X POST https://api.nutrient.io/extraction/parse \
  -H "Authorization: Bearer $NUTRIENT_EXTRACTION_API_KEY" \
  -F "file=@tests/assets/example.pdf" \
  -F 'instructions={"mode":"text","output":{"format":"markdown"}}' \
  | jq '.output | keys'

Acceptance criteria — the code reads exactly these paths (`src/dws/extract.ts`)

Spatial: output.elements is an array; each element has type, confidence, bounds.{x,y,width,height}, page.pageIndex. metrics.pagesProcessed is present.
- summarizeSpatial uses element.type / element.confidence / element.page.pageIndex / metrics.pagesProcessed
- query_extraction filters on element.bounds / confidence / page.pageIndex / type
Markdown: response has output.markdown (string).

If the field names match → no code change needed; just drop the captured response into tests/fixtures/ and (optionally) swap the inline fixture in tests/extract.test.ts for it, then this is release-ready.

If anything differs → adjust the SpatialElement / ExtractionResponse types and the filter/summary field access in src/dws/extract.ts (and the instructions shape if options.language / output.includeWords placement differs from the docs).

Tracked as U0 in docs/plans/2026-06-07-001-feat-dws-data-extraction-workflow-plan.md.

jdrhyne added 12 commits June 7, 2026 11:27

docs(plan): data_extractor + query_extraction workflow plan

a638997

🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o

feat(env): add NUTRIENT_EXTRACTION_API_KEY for the Data Extraction API

ee1d04a

Separate key from the Processor NUTRIENT_DWS_API_KEY; the Data Extraction client is constructed from it during tool registration (U5). 🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add data_extractor and query_extraction tools (DWS Data Extraction API)#27

feat: add data_extractor and query_extraction tools (DWS Data Extraction API)#27
jdrhyne wants to merge 12 commits into
mainfrom
feat/dws-data-extraction-workflow

jdrhyne commented Jun 7, 2026

Uh oh!

jdrhyne commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdrhyne commented Jun 7, 2026

Summary

Architecture

Design notes

Testing

Follow-up

Uh oh!

jdrhyne commented Jun 7, 2026

U0 — live /extraction/parse verification (do before merge)

Steps

Acceptance criteria — the code reads exactly these paths (src/dws/extract.ts)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

U0 — live `/extraction/parse` verification (do before merge)

Acceptance criteria — the code reads exactly these paths (`src/dws/extract.ts`)