Skip to content

Per-user Iceberg warehouse with bring-your-own S3 storage #5135

@mengw15

Description

@mengw15

Feature Summary

A warehouse here is a top level entity in the catalog hierarchy (Project → Warehouse → Namespace → Table) that owns a set of namespaces (results, runtime_stats, console_logs) and the storage configuration (S3 bucket + credentials) backing their tables. This follows the Lakekeeper warehouse concept.

Today Texera writes all execution outputs (results, runtime_stats, console_logs) into a single global Iceberg warehouse. One warehouse, all users share it, storage costs absorbed by the platform.

This issue proposes a per-user warehouse model: each user registers one or more warehouses, each backed by their own S3 bucket (Bring-Your-Own-S3). Storage cost follows the data owner; users get tenant-isolated namespaces and tables.

Background / Motivation

  • Billing. S3 cost should be attributed to the user who owns the data, not the platform.
  • Isolation. Per-tenant namespaces/tables, no shared blast radius.
  • Builds on Migrate to Catalog Service and MinIO for Execution Results #4126 — that issue introduced the REST Catalog Service (Lakekeeper) layer. This issue is the next step: make Lakekeeper multi-tenant.

Scope

Per-user warehouses are scoped to the Kubernetes deployment. Local / single-node Docker Compose deployments continue to work as today: PsqlCatalog remains supported and unchanged, and RestCatalog mode keeps its current single global Lakekeeper warehouse (no per-user split).

Proposed Solution or Design

Data model

User ─1:N→ Warehouse                (new)
User ─1:N→ ComputingUnit            (existing)
ComputingUnit ─1:N→ Execution       (existing)
Warehouse ─1:N→ Execution           (new association)

ER diagram:

Image

Catalog hierarchy

Texera already has two Catalog implementations:

Catalog (interface)
├── PsqlCatalog          — backed by PostgreSQL
└── RestCatalog          — backed by any Iceberg REST Catalog service (Lakekeeper is one implementation of this)

This design uses RestCatalog with Lakekeeper as the REST Catalog service to deliver per-user warehouses. Lakekeeper owns S3 credentials in its own encrypted DB (Postgres); Texera never persists raw S3 creds, only the Lakekeeper warehouse UUID and non-secret metadata.

Flow A — Registering a warehouse

  1. User fills the new Dashboard "Warehouse" tab with S3 bucket / endpoint / region / credentials.
  2. Backend posts the credentials directly to Lakekeeper to create the warehouse. Creds never touch the Texera DB.
  3. Lakekeeper returns the warehouse UUID; Texera stores the reference plus non-secret metadata.

Sequence diagram:

Image

Flow B — Binding a warehouse to a CU

  1. When the user creates a CU they pick which warehouse to use.
  2. At execution time, Texera instantiates a RestCatalog for that CU using the warehouse's Lakekeeper UUID — no global singleton on the hot path.
  3. Two-layer split at runtime:
    • Catalog pathRestCatalog talks to Lakekeeper for metadata operations (resolve table, create / commit snapshots, schema changes). Lakekeeper owns the warehouse → S3 path mapping.
    • Data path — the Iceberg client reads/writes Parquet directly to the user's S3 bucket, using short-lived credentials vended by Lakekeeper per request. Lakekeeper does not proxy S3 traffic.

Files land in the user's S3 bucket under the warehouse's root prefix, organized by namespace (results / runtime_stats / console_logs) and per-execution table.

Sequence diagram (CU creation + RestCatalog instantiation):

Image

For execution diagram please check: #4126

Open questions

  • Should a user own multiple warehouses, or exactly one? (Schema allows many)
  • Shared CU: when User A runs a workflow on a CU owned by User B, whose warehouse stores the results? In other words, should we allow User A store results into User B's Warehouse.
  • Warehouse deletion semantics: hard-delete the Lakekeeper catalog and leave S3 data orphaned in the user's bucket (Texera has no write access to user buckets), or soft-archive the catalog so existing executions stay readable until the user explicitly purges?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions