Skip to content

feat: add data_extractor and query_extraction tools (DWS Data Extraction API)#27

Open
jdrhyne wants to merge 12 commits into
mainfrom
feat/dws-data-extraction-workflow
Open

feat: add data_extractor and query_extraction tools (DWS Data Extraction API)#27
jdrhyne wants to merge 12 commits into
mainfrom
feat/dws-data-extraction-workflow

Conversation

@jdrhyne
Copy link
Copy Markdown
Contributor

@jdrhyne jdrhyne commented Jun 7, 2026

Summary

Adds the data extraction workflow primitive to the MCP server, targeting the standalone DWS Data Extraction API (POST /extraction/parse) — a separate product with its own key, not a json-content output of the Processor /build endpoint.

  • data_extractorPOST /extraction/parse with a mode (text/structure/understand/agentic) and output format. Spatial output (typed elements with bounding boxes + confidence + reading order) is written to outputPath and summarized inline; markdown is returned inline (or to a file when outputPath is given).
  • query_extraction — reads a saved spatial-extraction file and returns filtered element slices inline (by pages, region/bbox, minConfidence, elementTypes), so an agent pulls only the coordinates/values it needs into context.
  • A worked extract → query → act example (examples/invoice-extraction-workflow.md).

Architecture

Reuses the existing DwsApiClient abstraction — data_extractor runs on a second client instance authenticated with a new NUTRIENT_EXTRACTION_API_KEY (separate from the Processor NUTRIENT_DWS_API_KEY). No new HTTP plumbing. document_processor's description no longer advertises standalone extraction and points to data_extractor.

Design notes

  • Context safety: large spatial output goes to a file; the inline summary is content-free (counts, low-confidence flags, page geometry only — no document text). Markdown also honors outputPath to avoid overflowing the conversation.
  • Sandbox: all reads/writes go through the existing sandbox resolvers; absolute paths are re-rooted/contained as in the other tools.
  • Robustness: a 2xx response lacking output.elements is rejected before writing, so a non-extraction body can't overwrite the target file.

Testing

  • pnpm test: 120 passing (62 skipped). New tests/extract.test.ts (14 cases) covers routing (spatial→file / markdown→inline / markdown→file), the no-PII-in-summary guarantee, sandbox containment, the non-spatial-body guard, the missing-key error, and every query_extraction filter. tests/environment.test.ts + tests/mcp-tools.test.ts updated.
  • tsc, eslint, prettier, and pnpm build all clean.
  • The 2 pre-existing key-gated live integration tests (build/sign) are unchanged by this PR.

Follow-up

  • Live verification (deferred): the handlers are built against Nutrient's published Data Extraction schema (endpoint, instructions shape, element fields). One live /extraction/parse call with a pdf_live_ key should confirm the response shape and capture a fixture before release. Tracked as U0 in docs/plans/2026-06-07-001-feat-dws-data-extraction-workflow-plan.md.

Plan: docs/plans/2026-06-07-001-feat-dws-data-extraction-workflow-plan.md

jdrhyne added 12 commits June 7, 2026 11:27
Expose the build core (package-internal) so focused tools like
data_extractor can compose instructions -> API call -> response routing
without performBuildCall's required outputFilePath, enabling inline
responses. Behavior-preserving; performBuildCall still consumes both.

🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
Data Extraction is a separate DWS API (POST /extraction/parse with its own
pdf_live_ key), not a json-content output of the Processor /build endpoint.
data_extractor will not reuse the Build instruction machinery, so the
processInstructions/makeApiBuildCall exports are unnecessary.
…rse)

DWS Data Extraction is a separate API with its own pdf_live_ key, not a
json-content output of Processor /build. Rework KTD1 (second DwsApiClient),
add modes/formats + cost transparency, separate NUTRIENT_EXTRACTION_API_KEY,
and ground all wiring in main's client.ts/index.ts.

🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
Schemas for the Data Extraction API (/extraction/parse): mode (text/
structure/understand/agentic), output format (spatial/markdown), includeWords,
language, pages, outputPath; plus query filters (pages/region/minConfidence/
elementTypes/limit). Cross-field rules enforced in handlers (Schema.shape
requires a plain ZodObject).

🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
Separate key from the Processor NUTRIENT_DWS_API_KEY; the Data Extraction
client is constructed from it during tool registration (U5).

🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
performExtractCall: POST /extraction/parse (multipart file + instructions),
validates outputPath before the call, routes spatial->file+content-free summary
and markdown->inline, parses streamed JSON, clear error when key unset.
performQueryCall: reads a saved spatial file and returns elements filtered by
page/region/minConfidence/elementTypes, capped by limit. Drop unsupported
'pages' request param from the extractor schema; language nests under options.

🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
… extraction on document_processor

Builds a Data Extraction DwsApiClient from NUTRIENT_EXTRACTION_API_KEY (undefined
when unset; data_extractor then returns a clear setup error). Threads it through
createMcpServer/addToolsToServer. Updates document_processor description to point
extraction users to data_extractor, and the tool-list test for the two new tools.

🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
12 tests against the documented response shape (mocked client): markdown
inline, spatial->file with a content-free summary (asserts no PII leaks
inline while the file retains it), spatial-requires-outputPath, text-mode
rejects spatial, sandbox containment of absolute paths, missing-key error,
and query filters (minConfidence/type/page/region/limit/malformed).

🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
…ion key

Add both tools to the Available Tools table, a Data Extraction section (modes,
cost per page, spatial-vs-markdown, file+query workflow, transcript caveat),
NUTRIENT_EXTRACTION_API_KEY to the env table and .env.example, and point the
document_processor capability row at the dedicated tool.

🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
examples/invoice-extraction-workflow.md demonstrates data_extractor (spatial
to file) -> query_extraction (low-confidence + region slices) -> act via
ai_redactor/document_signer, keeping the full payload out of context.

🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
…e guard, leaner write

- Honor outputPath for markdown output (large docs would overflow context inline)
- Reject 2xx responses lacking output.elements before writing, so a non-extraction
  body can't overwrite the target file
- Write the raw response body for spatial output instead of re-stringifying
  (drops a copy of large payloads, preserves all API fields)
- Extract writeToResolvedPath helper (de-dupes mkdir-p), drop redundant
  includeWords coalesce and the resolvedOutputPath casts
- Add tests for markdown-to-file and the non-spatial-body guard

🔮 View transcript: https://nutrient-agentlogs.dev/s/duk4x9tr3rmlnxta7c4bwk6o
@jdrhyne
Copy link
Copy Markdown
Contributor Author

jdrhyne commented Jun 7, 2026

U0 — live /extraction/parse verification (do before merge)

The handlers are built against Nutrient's published Data Extraction schema, but no live call has run yet (no key was available during development). One real request confirms the response shape and gives us a recorded fixture. Costs ~2–3 credits total.

Steps

# 1. Set the Data Extraction key (SEPARATE from NUTRIENT_DWS_API_KEY; starts with pdf_live_)
export NUTRIENT_EXTRACTION_API_KEY=pdf_live_...   # or add to .env (gitignored)

# 2a. Spatial (structure mode, ~1.5 credits) — confirms the element schema
curl -sS -X POST https://api.nutrient.io/extraction/parse \
  -H "Authorization: Bearer $NUTRIENT_EXTRACTION_API_KEY" \
  -F "file=@tests/assets/example.pdf" \
  -F 'instructions={"mode":"structure","output":{"format":"spatial"}}' \
  | tee tests/fixtures/extraction-spatial-sample.json | jq '.output.elements[0], .metrics'

# 2b. Markdown (text mode, ~1 credit)
curl -sS -X POST https://api.nutrient.io/extraction/parse \
  -H "Authorization: Bearer $NUTRIENT_EXTRACTION_API_KEY" \
  -F "file=@tests/assets/example.pdf" \
  -F 'instructions={"mode":"text","output":{"format":"markdown"}}' \
  | jq '.output | keys'

Acceptance criteria — the code reads exactly these paths (src/dws/extract.ts)

  • Spatial: output.elements is an array; each element has type, confidence, bounds.{x,y,width,height}, page.pageIndex. metrics.pagesProcessed is present.
    • summarizeSpatial uses element.type / element.confidence / element.page.pageIndex / metrics.pagesProcessed
    • query_extraction filters on element.bounds / confidence / page.pageIndex / type
  • Markdown: response has output.markdown (string).

If the field names match → no code change needed; just drop the captured response into tests/fixtures/ and (optionally) swap the inline fixture in tests/extract.test.ts for it, then this is release-ready.

If anything differs → adjust the SpatialElement / ExtractionResponse types and the filter/summary field access in src/dws/extract.ts (and the instructions shape if options.language / output.includeWords placement differs from the docs).

Tracked as U0 in docs/plans/2026-06-07-001-feat-dws-data-extraction-workflow-plan.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant