Skip to content

docs: deltalake — drop external-direct; add $viewdefinition-export#18

Open
spicyfalafel wants to merge 22 commits into
mainfrom
docs/data-lakehouse-cleanup-and-viewdefinition-export
Open

docs: deltalake — drop external-direct; add $viewdefinition-export#18
spicyfalafel wants to merge 22 commits into
mainfrom
docs/data-lakehouse-cleanup-and-viewdefinition-export

Conversation

@spicyfalafel
Copy link
Copy Markdown
Collaborator

Summary

Two related doc changes for the Data Lakehouse Topic Destination + the SoF section.

1. Trim the Data Lakehouse tutorial to two writeModes

docs/tutorials/subscriptions-tutorials/data-lakehouse-aidboxtopicdestination.md — drop ~270 lines of `external-direct` content (one of the three historical writeModes). The code path is still in the module, just no longer in customer-facing docs. Touches:

  • Overview: "Three write strategies" → "Two"; corresponding bullet dropped; mermaid flow simplified.
  • Choosing-between-modes table: 3-column → 2-column; "files in own bucket" bullet dropped.
  • Authentication table: external-direct row dropped; static-AWS-keys paragraph dropped.
  • Configuration parameters: external-direct tab dropped from the per-mode tabs.
  • Service-principal grants: external-direct tab dropped.
  • Removed entirely: "Alternative: external-direct configuration" section, "Static AWS keys" subsection, "How it works — external-direct mode" subsection, "external-direct mode (manual)" schema-evolution subsection.
  • Maintenance section: external-direct paragraph dropped; managed-modes content surfaced as the body.
  • Troubleshooting: two external-direct-specific entries dropped + EXTERNAL_ACCESS_DISABLED_ON_METASTORE merged with the EXTERNAL USE SCHEMA entry.

2. New page: `$viewdefinition-export` operation

`docs/modules/sql-on-fhir/operation-viewdefinition-export.md` (new). Documents the SQL-on-FHIR v2 `$viewdefinition-export` operation that sansara now ships first-party (HealthSamurai/sansara#7691, merged):

  • One-shot async export of a ViewDefinition's rows to a backend-provided sink.
  • Aidbox owns the FHIR-side wiring; backends are external modules dispatched by `kind`.
  • First backend: `topic-destination-deltalake` for `kind=data-lakehouse` (Databricks Delta).
  • Documents kick-off shape, spec-defined params + their MVP status, polling protocol, failure model.

Linked from:

  • `SUMMARY.md` — new nav entry under SQL on FHIR.
  • `docs/modules/sql-on-fhir/README.md` — new "Export a ViewDefinition's rows" section.
  • `data-lakehouse-aidboxtopicdestination.md` — new "Ad-hoc one-shot export" section + Related-docs link.

Test plan

  • gitbook preview renders both pages cleanly.
  • Anchors in the new file resolve (intra-page TOC).
  • No broken intra-doc links — old anchors like `#alternative-external-direct-configuration` are gone, references updated.
  • Customer reading the deltalake tutorial top-to-bottom still has a coherent flow without external-direct dangling references.

🤖 Generated with Claude Code

spicyfalafel and others added 22 commits May 22, 2026 10:34
Trims the customer-facing Data Lakehouse Topic Destination tutorial to
the two managed modes that we recommend (managed-zerobus + managed-sql);
the external-direct writeMode stays in the module's code but no longer
appears in the docs. ~270 lines removed (write-mode subsection,
comparison-table column, Configuration parameters tab, SP-grants tab,
Alternative-external-direct section, "How it works external-direct"
subsection, maintenance subsection, manual schema-evolution subsection,
two troubleshooting entries).

New page docs/modules/sql-on-fhir/operation-viewdefinition-export.md
documents the SQL-on-FHIR v2 \$viewdefinition-export operation that
sansara now ships first-party; explains the kind-based backend
registry, lists currently registered backends, the kick-off request
shape, the spec-defined params (with their MVP status), the status
polling protocol, and the failure model. Linked from the SoF README
+ SUMMARY navigation.

The deltalake tutorial gains a new "Ad-hoc one-shot export" section
pointing at the \$viewdefinition-export op page (since this module is
the first backend, kind=data-lakehouse) and a Related-docs link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…op page

- 'no backend for kind' is async-failed (defmulti :default throws inside
  the future), not sync 400. Updated the Failure model section.
- Added an AWS-only callout to the operation page itself so users
  reading just that page aren't surprised at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xport flow

For the data-lakehouse backend the operation is literally the same
managed-initial-export! path the AidboxTopicDestination runs on its
first start — same staging Delta + MERGE INTO + drop-staging, just
exposed standalone with no continuous-streaming worker around it. Made
that explicit on both pages so readers don't have to discover the
relationship by reading source.

- operation-viewdefinition-export.md: new "Relationship to
  AidboxTopicDestination's initial export" section with a comparison
  table.
- data-lakehouse-aidboxtopicdestination.md ad-hoc section: short
  callout pointing at the existing "Initial export" section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User feedback: the previous wording (\"AWS only\", \"aws_temp_credentials\")
was technical and easy to miss. Replaces with explicit \"AWS S3 only;
GCS and Azure ADLS Gen2 not supported\" callouts, escalated to
'warning' style on the tutorial, repeated in the operation page's
Cloud support section, mentioned in Prerequisites and on the
stagingTablePath parameter row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two flows are the same path — call that out at the top of the
Initial-export section, not just bottom-up from the ad-hoc section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mermaid + step-by-step explanation of the staging-Delta + MERGE INTO
flow moves from the topic-destination tutorial to the operation page —
the operation is the canonical entry point for that flow (and the
topic-destination's initial export reuses the same code path
internally). Topic tutorial keeps only an inline pointer; 'Large-scale
initial export' subsection stays in the tutorial since it's about a
destination-only parameter (initialExportParallelism on
AidboxTopicDestination, not on the operation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous wording was vague — readers had no way to pick N when their
deployment has multiple Aidbox pods. New wording spells out:

- N is the cluster-wide chunk count, not a per-pod thread count.
- Each pod auto-sizes its local worker pool to min(N, its cores).
- Workers coordinate cluster-wide via PG advisory locks; no leader,
  no external service.
- Effective concurrency is min(N, total cores across all pods); raising
  N higher than that wastes CPU on lock-claim spin.
- Practical sizing rule: N ≈ sum of cores across all pods, capped by
  max_connections / ~2 (each worker holds ~2 PG connections).

Suggested-values table now has rows for 1 pod / 2-4 pod HA / 4+ pod
deployments instead of the single-node-centric values it had before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reader asked for a formula instead of a vague rule. Replaces the
'N ≈ sum of cores' sentence with a code block showing the two
ceilings (core count and PG connection budget), what each variable
means, what the per-worker connection cost actually is (2), and a
worked example for a 4-pod × 8-core cluster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…page

Reviewer feedback: an op-page reader shouldn't need to know what an
AidboxTopicDestination is to understand the operation. The cross-link
remains in the other direction (the topic-destination tutorial says
'this same flow is exposed as \$viewdefinition-export') and a brief
'Databricks-side setup is documented in the topic tutorial' pointer
stays inside How-it-works for the setup steps the reader genuinely
needs to find.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop 'destination kind is data-lakehouse-at-least-once' from
  Background (not needed there; the kind is in the configuration
  examples where it actually has to be typed).
- Merge the two writeMode subsections into one. They share the same
  target, init flow, schema-drift handling and maintenance — only
  the hot-path transport differs. One mermaid diagram with both arms
  replaces two near-identical diagrams + the standalone 'Choosing
  between the two modes' table (most rows of that table were identical
  copies anyway).
- Tighten the 'setting up initial-bulk staging' hint to one line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User asked whether millions of rows cause an HTTP timeout. They don't —
POST AidboxTopicDestination returns 201 after bootstrap (1-2 min worst
case on a cold warehouse), initial-export runs in a background future
sized only by the dataset. Made that explicit:

- New 'Timing & monitoring' subsection in Initial export.
- Phase table showing what runs sync inside POST vs in the future.
- The \$status fields you read during init-export (status / rowsSent /
  error).
- Retry behaviour (3× exp backoff, idempotent MERGE on id).
- Note that continuous worker starts in parallel with init-export, not
  serialized after it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l detail)

Tutorial:
- Overview mermaid said 'Cloud storage S3 / GCS / ADLS' even though we
  now flag AWS-S3-only. Fixed to 'AWS S3 bucket'.
- 'The flow' enumeration listed both modes inline (steps 4-5), redundant
  with the Write modes subsection that follows. Collapsed to one step
  that points at Write modes.
- The two writeMode tabs in Configuration still had stale 'Recommended
  4-8 for ≥1M-row datasets; 16-32 for multi-Aidbox setups' on the
  initialExportParallelism row. Replaced with a pointer to the formula
  section.

Operation page:
- Stacked three info-hints at the top were noisy. Merged into one.
- 'Registered backends' line and Failure-model entry for 'no backend
  registered for kind' described the user-visible behaviour with an
  implementation-detail leak ('the defmulti's :default method raises a
  clear ex-info'). Reworded.
- Step 1 of How-it-works referenced 'sof.<view>' as jargon. Spelled out
  what that means ('the SQL-on-FHIR materialized view in Aidbox's
  PostgreSQL').

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inations

Remnant of the external-direct writeMode that was removed earlier in
this PR. Reframed the use-cases to ones that work with managed-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace the "Status registry is in-process" MVP limitation with the
  shipped architecture: canonical state lives in the data-lakehouse
  backend's Postgres `viewdefinition_export_jobs` table (shared across
  all Aidbox pods); pod-local cache holds only the echo-only spec
  fields plus the `kind` needed to route the status defmulti. Polls
  arriving on a different pod than the kick-off return 404 — call out
  the hostname-stickiness assumption explicitly so cluster operators
  configure their LB accordingly.
- Document `initialExportParallelism` for `$viewdefinition-export`:
  the operation reuses the same chunked + per-pod-advisory-locked
  multi-pod path the AidboxTopicDestination initial export does, so
  a single export now scales across the whole cluster's cores.
  Sizing guidance cross-links to the tutorial's existing Large-scale
  initial export section instead of duplicating the formula.
- Add a "Multi-pod execution" section to the operation page (after
  "How it works", before "Cloud support") explaining the shipped
  architecture: canonical state in `tds.viewdefinition_export_jobs`,
  cross-pod fanout via `cache_replication_msgs` PG NOTIFY (same path
  as AidboxTopicDestination create/delete), chunk-claim via
  `pg_try_advisory_lock`, implicit crash recovery via session-level
  lock auto-release. Includes a sequence diagram showing two pods
  racing for chunks. Cross-links to the tutorial's existing sizing
  formula instead of duplicating it.
- Promote the 404-on-non-kick-off-pod note from a buried Limitations
  bullet into a dedicated "Troubleshooting" section with concrete LB
  guidance (cookie affinity, source-IP hash, honouring
  Content-Location). Limitations now links to it.
- In the tutorial's Ad-hoc one-shot export section, replace the dense
  parallelism+stickiness paragraph with a "Scaling and multi-pod
  execution" subsection naming the shipped mechanism explicitly and
  cross-linking the operation page's Troubleshooting.
…gable backend API

The operation was re-architected away from a custom multi-pod implementation
(module-owned tds.viewdefinition_export_jobs table + pg_try_advisory_lock +
cache_replication_msgs NOTIFY fan-out + per-pod *export-meta cache with the
404-on-non-kick-off-pod bug) onto Aidbox's standard async-task engine — the
same db-scheduler-backed engine that powers $purge, box.sdc.fhir.workflow,
and box.operations. We were reinventing a wheel; async-api gives us
cross-pod execution, restart safety, and lease-based crash recovery for
free.

Job state now lives in shared db_scheduler.scheduled_tasks, so any pod can
answer any status poll — the LB-stickiness / cookie-affinity guidance is
gone with it.

Backend modules now plug in by implementing four mandatory defmultis on
their kind (plan-export, setup-export, export-chunk, finalize-export) and
one optional one (cancel-export). See aidbox-api
io.healthsamurai.topics.api for the contract.

- Operation page: rewrite Multi-pod execution section + sequence diagram;
  drop the 404-on-non-kick-off-pod Troubleshooting entry and the LB
  session-affinity guidance from Limitations.
- Tutorial: simplify the Ad-hoc one-shot export → Scaling and multi-pod
  subsection — drop the jobs-table / NOTIFY / hostname-stickiness prose
  and cross-link to the operation page for the orchestration details.

The continuous AidboxTopicDestination initial-sync still uses
pg_try_advisory_lock — that path is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Service-principal credentials now come exclusively from the
Aidbox-wide settings (env vars
BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID / _CLIENT_SECRET, or the
equivalent `module.databricks.data-lakehouse.*` setting keys).
Per-destination overrides on AidboxTopicDestination.parameter[]
and per-request fields on the `$viewdefinition-export` body are
no longer accepted — a resolved plaintext secret in the
destination's `resource` jsonb or in
`db_scheduler.scheduled_tasks.task_data` was readable by anyone
with PG access.

- Drop the `databricksClientId` / `databricksClientSecret` rows
  from the managed-zerobus and managed-sql parameter tables.
- Drop the two cred entries from each example destination body
  (managed-zerobus + managed-sql).
- Drop the "resolve client_secret from External Secrets" inline-
  vault example (no longer applicable — the secret is set on the
  Aidbox box, not on the resource).
- Drop the External Secrets bullet referencing
  `databricksClientSecret` from "Related documentation".
- Add a callout right under the managed-zerobus example noting
  the secret lives on the box itself, with the docker-run env
  snippet showing how to inject it.

The system-resources reference dump (`core-module-resources.md`)
is auto-regenerated from sansara's IG StructureDefinition —
follow-up regen will pick up the schema slice removal there.
Companion to the sansara-side cancel handler (PR #7699 commit
ffbfa71c04): document the DELETE route, its three response codes,
and what cancellation does + doesn't roll back. Tighten the
Limitations bullet — \`cancelUrl\` discovery in the kick-off
response body is still unwired, but cancellation itself works.
… (N=1 included)

The previous "How it works" section described two architectures side-
by-side: a single staging table for the default case, plus a per-chunk
variant when initialExportParallelism>1. That was historical accident,
not what the code does. The shipped module ALWAYS uses per-chunk
staging — N=1 is just the degenerate case (one chunk, no UNION ALL
source). Realigning the prose + mermaid avoids future readers
inferring two code paths that don't actually exist, and explains why
we keep per-chunk even at N=1 (Delta-on-S3 has no atomic put-if-absent
on _delta_log/N.json without S3DynamoDBLogStore, so one-writer-per-
Delta is required for correctness — see Delta #1830 / #1410).

Also notes the HikariCP pool clamp introduced in the module: requesting
parallelism beyond the available pool slots is rejected up-front with
parallelism-exceeds-pool 400.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant