` as one Delta commit.
+3. Once all chunks complete, the coordinator task invokes `finalize-export`: `MERGE INTO target USING (SELECT * FROM staging_0 UNION ALL …) ON t.id = s.id WHEN NOT MATCHED THEN INSERT *`.
+4. The module drops every staging table.
+
+On failure the per-chunk stagings are best-effort dropped via `cancel-export`. Chunks retry up to 2 times with a 30-second backoff; if they exhaust retries the export is reported as `failed`.
+
+{% hint style="info" %}
+The `MERGE` is idempotent on `id` — a retried export after a lost response inserts nothing instead of duplicating. Your ViewDefinition must select an `id` column.
+{% endhint %}
+
+The Databricks-side setup (catalog, schema, target table, staging schema, service principal, grants, warehouse) is documented in the [Data Lakehouse tutorial](../../tutorials/subscriptions-tutorials/data-lakehouse-aidboxtopicdestination.md) — the same setup is reused here.
+
+## Multi-pod execution
+
+Chunks run on Aidbox's standard async-task engine, so they distribute across every pod sharing the metastore. Status polling and cancellation answer from any pod — no kick-off-pod affinity, no client-side load-balancer pinning.
+
+A pod failure mid-chunk is recovered automatically: the chunk task's heartbeat lapses and another pod re-leases it.
+
+## Cloud support
+
+The Data Lakehouse backend's staging path currently supports **AWS S3 only** (`s3://...` / `s3a://...`). The Unity Catalog managed target table is unaffected — UC manages its own storage.
+
+{% hint style="info" %}
+Need **GCS** (`gs://...`) or **Azure ADLS Gen2** (`abfss://...`) staging? [Contact us](../../overview/contact-us.md) — they're not wired through the backend yet, but the Aidbox-side wiring is cloud-agnostic and we can prioritize.
+{% endhint %}
+
+## Limitations (current)
+
+- One `view` per request (spec allows `1..*`).
+- `patient` / `group` filters extracted but not yet applied to the SQL. `_since` is applied.
+- `cancelUrl` (the spec's pointer to a cancel endpoint exposed in the kick-off response) is not yet returned. Cancellation itself works — `DELETE` on the status URL is supported (see [Cancellation](#cancellation) above).
+- `estimatedTimeRemaining` is not computed.
diff --git a/docs/tutorials/subscriptions-tutorials/data-lakehouse-aidboxtopicdestination.md b/docs/tutorials/subscriptions-tutorials/data-lakehouse-aidboxtopicdestination.md
index 1a74182d1..9698c7d58 100644
--- a/docs/tutorials/subscriptions-tutorials/data-lakehouse-aidboxtopicdestination.md
+++ b/docs/tutorials/subscriptions-tutorials/data-lakehouse-aidboxtopicdestination.md
@@ -1,5 +1,5 @@
---
-description: Export FHIR resources to a Data Lakehouse — Databricks Unity Catalog managed tables or non-managed external Delta tables on S3 / GCS / Azure ADLS — using SQL-on-FHIR ViewDefinitions.
+description: Export FHIR resources to Databricks Unity Catalog managed Delta tables using SQL-on-FHIR ViewDefinitions.
---
# Data Lakehouse AidboxTopicDestination
@@ -8,11 +8,17 @@ description: Export FHIR resources to a Data Lakehouse — Databricks Unity Cata
This functionality is available starting from Aidbox version **2605**.
{% endhint %}
-This page sets up an `AidboxTopicDestination` that streams FHIR resource changes into Delta-Lake tables — Databricks-managed Unity Catalog tables, or external Delta tables on S3 / GCS / Azure ADLS that you own. Rows are flattened by a [ViewDefinition](../../modules/sql-on-fhir/defining-flat-views-with-view-definitions.md) so analytics consumers see columns, not nested FHIR JSON.
+{% hint style="warning" %}
+**Cloud support: AWS S3 only (today)** for the initial-export staging bucket (`s3://...` / `s3a://...`). The Unity Catalog managed target table is unaffected — UC manages its own storage.
+
+Need GCS or Azure ADLS Gen2? [Contact us](../../overview/contact-us.md) — they're not wired through yet and we can prioritise.
+{% endhint %}
+
+This page sets up an `AidboxTopicDestination` that streams FHIR resource changes into a Databricks Unity Catalog managed Delta table. Rows are flattened by a [ViewDefinition](../../modules/sql-on-fhir/defining-flat-views-with-view-definitions.md) so analytics consumers see columns, not nested FHIR JSON.
## Background
-"Data Lakehouse" is the generic name for the destination category — a hybrid of object-storage data lake and warehouse, implemented here on top of the Delta Lake table format. Concretely the module writes Delta-formatted tables that can live on plain cloud object storage you own, or in Databricks Unity Catalog managed storage; either way the destination kind is the same (`data-lakehouse-at-least-once`).
+"Data Lakehouse" is the generic name for the destination category — a hybrid of object-storage data lake and warehouse, implemented here on top of the Delta Lake table format. The module writes a Delta-formatted Unity Catalog managed table.
If you're already comfortable with Databricks, Unity Catalog, and Delta Lake, skip to [Overview](#overview).
@@ -56,129 +62,41 @@ Unity Catalog tables come in two flavours:
| Storage location | Databricks-managed cloud storage (path picked by Unity Catalog) | Your bucket — declared with `LOCATION 's3://...' / 'gs://...' / 'abfss://...'` at `CREATE TABLE` |
| Who owns the files | Unity Catalog — manages read, write, storage, and optimization | You — Unity Catalog manages metadata only |
| `DROP TABLE` | Deletes the data | Drops metadata only — files stay in your bucket |
-| Supported write paths from Aidbox | **Zerobus REST ingest** (Aidbox `managed-zerobus`), or **SQL warehouse INSERT** (Aidbox `managed-sql`) | **Direct Parquet + Delta commit** via STS-vended Unity Catalog creds (Aidbox `external-direct`) |
+| Supported write paths from Aidbox | **Zerobus REST ingest** (Aidbox `managed-zerobus`), or **SQL warehouse INSERT** (Aidbox `managed-sql`) | Not supported by this module — use a managed table |
| External STS credential vending| Not available for managed targets (`EXTERNAL USE SCHEMA` is only grantable on external schemas) | Allowed if the principal has `EXTERNAL USE SCHEMA` on the schema |
| Predictive Optimization | Enabled by default for accounts created on or after **2024-11-11**; runs `OPTIMIZE` / `VACUUM` / `ANALYZE` automatically. Billed under the **Jobs Serverless** SKU. | **Not supported** — Predictive Optimization runs only on managed tables |
| Liquid Clustering | Opt-in per table (automatic liquid clustering requires Predictive Optimization and is also opt-in) | Opt-in per table |
-The "Supported write paths" row drives the module's three `writeMode` values — see [Overview](#overview) for the resulting write paths.
+The "Supported write paths" row drives the module's `writeMode` values — see [Overview](#overview) for the resulting write paths.
## Overview
The module exports FHIR resources from Aidbox to a Delta Lake table in a flattened format using [ViewDefinitions](../../modules/sql-on-fhir/defining-flat-views-with-view-definitions.md) (SQL-on-FHIR).
-```mermaid
-graph LR
- Client[User / FHIR API client]:::blue2
- Aidbox[Aidbox]:::blue2
- PG[(Aidbox PostgreSQL)]:::neutral2
- Mod[Data Lakehouse module]:::yellow2
- DBX[Databricks workspace]:::green2
- FS[(Cloud storage
S3 / GCS / ADLS)]:::violet2
-
- Client -- FHIR POST / PUT / DELETE --> Aidbox
- Aidbox -- write resource +
enqueue topic event --> PG
- Mod -- poll batch --> PG
- Mod -- REST ingest
(managed-zerobus) --> DBX
- Mod -- SQL INSERT
(managed-sql) --> DBX
- Mod -- Delta write
(external-direct) --> FS
- DBX -- read / write Delta files --> FS
-```
+**Live (streaming) flow** — every resource change is enqueued in Aidbox's topics PG queue, batched by the sender, and pushed to Databricks via the chosen `writeMode`:
+
+
The flow:
1. A FHIR API client (a user, an integration, a backfill script) sends a `POST` / `PUT` / `DELETE` to Aidbox.
2. Aidbox persists the resource and enqueues a topic event for the destination in PostgreSQL.
3. The Data Lakehouse module polls the destination's batch from the same PostgreSQL queue.
-4. For `managed-zerobus` mode (default): the module POSTs each batch as a JSON array to Databricks' Zerobus REST ingest endpoint, which writes directly to the managed table. No SQL parsing / planning per write.
-5. For `managed-sql` mode: the module sends `INSERT` (and `ALTER` / `DESCRIBE` when needed) to the Databricks SQL warehouse; the warehouse writes the Delta files to storage.
-6. For `external-direct` mode: the module gets short-lived storage credentials from Unity Catalog and writes Delta files directly to your bucket.
+4. The batch is sent to Databricks via one of the two paths picked by `writeMode` — see [Write modes](#write-modes) below.
The module may also perform an initial export of pre-existing resources at first start — see [Initial export](#initial-export) for when this runs and how to skip it.
### Write modes
-The module supports three **write modes**, picked per-destination via the `writeMode` parameter (see the [Configuration](#configuration) section below for the full parameter list).
-
-### managed-zerobus mode (default)
-
-`writeMode=managed-zerobus` targets a **Databricks Unity Catalog managed table** via the [Zerobus REST ingest endpoint](https://docs.databricks.com/aws/en/ingestion/zerobus-ingest).
-
-```mermaid
-graph LR
- A(FHIR resource POST / PUT / DELETE):::blue2
- B(Aidbox Topics API):::blue2
- C(PostgreSQL queue):::neutral2
- D(ViewDefinition flatten):::yellow2
- Z(Zerobus REST ingest):::green2
- T0(Unity Catalog managed Delta table):::violet2
-
- A --> B --> C --> D --> Z --> T0
-```
-
-- Each batch is JSON-encoded as an array and POSTed to the Zerobus REST endpoint with an OAuth M2M bearer. No SQL parsing, no warehouse cold-start.
-- Initial bulk export uses a one-shot staging Delta table + `MERGE INTO` — same path as `managed-sql`.
-- Schema sync at sender bootstrap hits the SQL warehouse once (`INFORMATION_SCHEMA.COLUMNS` + optional `ALTER TABLE`); live writes don't.
-
-### managed-sql mode
-
-`writeMode=managed-sql` — same target as `managed-zerobus` (Unity Catalog **managed** table), but routes incoming batches through a Databricks SQL warehouse. Use this when Zerobus isn't available on your Databricks SKU.
-
-```mermaid
-graph LR
- A(FHIR resource POST / PUT / DELETE):::blue2
- B(Aidbox Topics API):::blue2
- C(PostgreSQL queue):::neutral2
- D(ViewDefinition flatten):::yellow2
- M(Databricks SQL warehouse):::green2
- T1(Unity Catalog managed Delta table):::violet2
-
- A --> B --> C --> D --> M --> T1
-```
-
-- Each batch becomes a single `INSERT INTO managed (cols) VALUES (...)` against the SQL warehouse.
-- Initial bulk export uses a one-shot staging Delta table + `MERGE INTO`. See [Initial export](#how-it-works-managed-modes).
-
-### external-direct mode
-
-`writeMode=external-direct` targets a **non-managed external Delta table** that you own.
-
-```mermaid
-graph LR
- A(FHIR resource POST / PUT / DELETE):::blue2
- B(Aidbox Topics API):::blue2
- C(PostgreSQL queue):::neutral2
- D(ViewDefinition flatten):::yellow2
- K(Direct Delta writer):::green2
- T2(External Delta table on S3 / GCS / ADLS):::violet2
-
- A --> B --> C --> D --> K --> T2
-```
-
-- The module writes Delta files straight to your bucket from the Aidbox process. No SQL warehouse, no Databricks compute on the write path.
-- Storage backends: AWS S3, Google Cloud Storage, Azure ADLS Gen2.
-- You own table maintenance — schedule `OPTIMIZE` and `VACUUM` yourself. See [Compaction and maintenance](#compaction-and-maintenance).
+The module supports two **write modes**, picked per-destination via the `writeMode` parameter. Both target the same Unity Catalog managed Delta table and share the same initial-bulk staging + `MERGE INTO` flow, the same auto-`ALTER` schema-drift handling, and the same Databricks-side Predictive Optimization. They differ only in **how live batches reach Databricks**:
-## Choosing between the three modes
+**Default to `managed-zerobus`.** Each batch is JSON-encoded and POSTed to the [Zerobus REST ingest endpoint](https://docs.databricks.com/aws/en/ingestion/zerobus-ingest) with an OAuth M2M bearer — no SQL parsing, no warm warehouse. The warehouse is hit once at sender bootstrap for `INFORMATION_SCHEMA.COLUMNS` + optional `ALTER TABLE`; live writes don't touch it.
-**Default to `managed-zerobus`.** Pick a different mode only when one of these applies:
-
-- **Zerobus isn't available on your Databricks SKU** → `managed-sql`. Same managed target, but every batch hits a warm SQL warehouse.
-- **You want the files in your own bucket and own table maintenance yourself** → `external-direct`. No Databricks compute on the write path.
-
-| | `managed-zerobus` (default) | `managed-sql` | `external-direct` |
-| ------------------------------ | ------------------------------------------------------------------------ | ------------------------------------------------------------------------ | ------------------------------------------------------------ |
-| Table type | Unity Catalog **managed** (Databricks owns the files) | Unity Catalog **managed** (Databricks owns the files) | **External** (the User's bucket owns the files) |
-| Hot-path transport | Zerobus REST ingest API | Databricks SQL warehouse (Statement Execution API) | Direct Delta commits via Hadoop FS |
-| Who runs maintenance | Databricks (Predictive Optimization handles `OPTIMIZE` / `VACUUM`) | Databricks (Predictive Optimization handles `OPTIMIZE` / `VACUUM`) | The User schedules `OPTIMIZE` / `VACUUM` |
-| Databricks compute cost surface| **No warm warehouse** — pay-per-row Zerobus + storage only | SQL warehouse must be running to accept INSERTs — Databricks bills uptime | No warehouse — no Databricks compute charge for write path |
-| Schema drift handling | Auto-`ALTER` on mismatch | Auto-`ALTER` on mismatch | User runs `ALTER TABLE` and recreates the destination |
-| Initial export path | Staging Delta on your bucket → `MERGE INTO` target | Staging Delta on your bucket → `MERGE INTO` target | Bulk written straight to the target in one Delta commit |
-| Storage backends | Databricks-managed storage | Databricks-managed storage | AWS S3, GCS, Azure ADLS Gen2 |
+**Use `managed-sql` only when Zerobus isn't available on your Databricks SKU.** Each batch becomes a single `INSERT INTO managed (cols) VALUES (...)` against a SQL warehouse — same target, same idempotent staging-MERGE init, just a warm warehouse on the hot path. The cost difference is the warehouse uptime billing.
## Authentication
-All three modes authenticate to Databricks via [**OAuth Machine-to-Machine (M2M)**](https://docs.databricks.com/aws/en/dev-tools/auth/oauth-m2m) with a service principal: the module exchanges `client_id` + `client_secret` at the workspace token endpoint for a ~1h bearer token, caches it, and re-issues a fresh one when fewer than 5 minutes remain.
+Both modes authenticate to Databricks via [**OAuth Machine-to-Machine (M2M)**](https://docs.databricks.com/aws/en/dev-tools/auth/oauth-m2m) with a service principal: the module exchanges `client_id` + `client_secret` at the workspace token endpoint for a ~1h bearer token, caches it, and re-issues a fresh one when fewer than 5 minutes remain.
The bearer is sent on every Databricks call. What differs between modes is which Databricks surfaces see it:
@@ -186,9 +104,8 @@ The bearer is sent on every Databricks call. What differs between modes is which
|-----------------------------|-----------------------------------------------|--------------------------------------------|----------------------------|--------------------------------------|
| `managed-zerobus` (default) | only during initial-export (staging vending) | bootstrap + initial-export only | Zerobus REST (every batch) | Zerobus ingest service, Databricks-side |
| `managed-sql` | only during initial-export (staging vending) | every batch (`INSERT` / `ALTER` / `DESCRIBE`) | — | SQL warehouse compute |
-| `external-direct` | every cred-refresh (~45 min) | none | — | sender process, with Unity-Catalog-vended STS |
-In `external-direct` you can also skip Databricks entirely and authenticate against the bucket with static AWS keys (`awsAccessKeyId` + `awsSecretAccessKey`) or the [AWS default provider chain](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials-chain.html). The service principal and the grants it needs are set up in the [Usage example](#usage-example-patient-data-export) below.
+The service principal and the grants it needs are set up in the [Usage example](#usage-example-patient-data-export) below.
## Installation
@@ -197,10 +114,10 @@ In `external-direct` you can also skip Databricks entirely and authenticate agai
- Aidbox **2605** or newer ([install guide](../../getting-started/run-aidbox-locally.md))
- A Databricks workspace (Free Edition works for evaluation, paid for production)
- The [Databricks CLI](https://docs.databricks.com/aws/en/dev-tools/cli/install) installed locally (`brew install databricks/tap/databricks` on macOS) — every Databricks-side operation in the tutorial uses it
-- AWS CLI (only for `managed-*` modes that do initial-export — for the staging bucket + IAM role)
-- A SQL warehouse (skip only for `external-direct`)
+- AWS CLI (for initial-export staging — bucket + IAM role)
+- A SQL warehouse
- For `managed-zerobus`: Zerobus enabled on your SKU (Databricks Free Edition supports it; for paid plans confirm with Databricks support)
-- For initial-export in the `managed-*` modes: an S3/GCS/ADLS bucket you control
+- For initial-export: an **S3 bucket** you control. GCS and Azure ADLS Gen2 are not supported for the staging path today (see the "Cloud support: AWS only" callout above).
The service principal that authenticates the module is created in step 3 of the usage example — you don't need it before you start.
@@ -209,7 +126,7 @@ The service principal that authenticates the module is created in step 3 of the
1. Download the Databricks module JAR file and place it next to your **docker-compose.yaml**:
```sh
- curl -O https://storage.googleapis.com/aidbox-modules/topic-destination-deltalake/topic-destination-deltalake-2605.0.jar
+ curl -O https://storage.googleapis.com/aidbox-modules/topic-destination-deltalake/databricks-module-2605.0.jar
```
2. Edit your **docker-compose.yaml** and add these lines to the Aidbox service:
@@ -217,11 +134,11 @@ The service principal that authenticates the module is created in step 3 of the
```yaml
aidbox:
volumes:
- - ./topic-destination-deltalake-2605.0.jar:/topic-destination-deltalake.jar
+ - ./databricks-module-2605.0.jar:/databricks-module.jar
# ... other volumes ...
environment:
- BOX_MODULE_LOAD: io.healthsamurai.topic-destination.data-lakehouse.core
- BOX_MODULE_JAR: "/topic-destination-deltalake.jar"
+ BOX_MODULE_LOAD: io.healthsamurai.databricks.core
+ BOX_MODULE_JAR: "/databricks-module.jar"
BOX_FHIR_SCHEMA_VALIDATION: "true"
# ... other environment variables ...
```
@@ -259,9 +176,9 @@ spec:
- -c
- |
apt-get -y update && apt-get -y install curl
- curl -L -o /modules/topic-destination-deltalake.jar \
- https://storage.googleapis.com/aidbox-modules/topic-destination-deltalake/topic-destination-deltalake-2605.0.jar
- chmod 644 /modules/topic-destination-deltalake.jar
+ curl -L -o /modules/databricks-module.jar \
+ https://storage.googleapis.com/aidbox-modules/topic-destination-deltalake/databricks-module-2605.0.jar
+ chmod 644 /modules/databricks-module.jar
volumeMounts:
- mountPath: /modules
name: modules-volume
@@ -270,9 +187,9 @@ spec:
image: healthsamurai/aidboxone:edge
env:
- name: BOX_MODULE_LOAD
- value: "io.healthsamurai.topic-destination.data-lakehouse.core"
+ value: "io.healthsamurai.databricks.core"
- name: BOX_MODULE_JAR
- value: "/modules/topic-destination-deltalake.jar"
+ value: "/modules/databricks-module.jar"
- name: BOX_FHIR_SCHEMA_VALIDATION
value: "true"
# ... other environment variables ...
@@ -303,12 +220,10 @@ All requests in this tutorial use `Content-Type: application/json`.
databricksWorkspaceUrl | string | https://<workspace>.cloud.databricks.com |
databricksWorkspaceId | string | Numeric workspace ID (e.g. 1234567890123456). Composes the Zerobus REST endpoint host |
databricksRegion | string | Workspace AWS region (e.g. us-east-1). Composes the Zerobus REST endpoint host |
-databricksClientId | string | Service principal client_id for OAuth M2M |
-databricksClientSecret | string | Service principal client_secret; supports vault refs |
tableName | string | Managed table full name: catalog.schema.table |
databricksWarehouseId | string | SQL warehouse ID — used at bootstrap for schema sync + (if initial-export runs) the final MERGE INTO. No warm-warehouse traffic during live writes. |
awsRegion | string | AWS region of the staging bucket |
-stagingTablePath | string | s3://bucket/path/ for the staging Delta table created during initial export. Required when skipInitialExport is not true |
+stagingTablePath | string | s3://bucket/path/ for the staging Delta table created during initial export (S3 only today). Required when skipInitialExport is not true |