Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 28 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,14 @@
[![CI](https://github.com/SmooAI/clickhouse-kit/actions/workflows/rust.yml/badge.svg)](https://github.com/SmooAI/clickhouse-kit/actions/workflows/rust.yml)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE)

**A safe-by-construction schema toolkit for ClickHouse — built for user-defined, multi-tenant schemas.**
**A safe-by-construction schema toolkit for ClickHouse — for user-defined, multi-tenant schemas, with a TypeScript→Rust bridge for the schemas you author by hand.**

When your customers' data shapes are defined at runtime, you end up turning untrusted input into SQL. `smooai-clickhouse-kit` owns that boundary so the happy path makes **SQL injection and unbounded tables impossible, not merely discouraged** — an allowlisted type system, identifier validation, DDL generation, forward-only migrations, additive evolution, and drift detection. Rows stay [Serde](https://serde.rs)-native (use the [`clickhouse`](https://crates.io/crates/clickhouse) crate's `#[derive(Row)]`), so the kit never reimplements row mapping.
The kit has two jobs:

1. **Runtime toolkit (user-defined / multi-tenant tables).** When your customers' data shapes are defined at runtime, you end up turning untrusted input into SQL. The kit owns that boundary so the happy path makes **SQL injection and unbounded tables impossible, not merely discouraged** — an allowlisted type system, identifier validation, DDL generation, `flexible_table`, forward-only migrations, and additive evolution.
2. **TS→Rust bridge (developer-authored tables).** When TypeScript owns a table's schema, `introspect` reads the live ClickHouse back into Rust and `codegen` emits the `#[derive(Row)]` struct, with `check_drift` asserting the Rust view ≡ the live DB. No more hand-copied row structs drifting from the schema.

Either way, rows stay [Serde](https://serde.rs)-native (use the [`clickhouse`](https://crates.io/crates/clickhouse) crate's `#[derive(Row)]`) — the kit never reimplements row mapping.

```toml
[dependencies]
Expand Down Expand Up @@ -98,6 +103,27 @@ let drift = check_drift(&exec, &[table]).await?;

For growing a per-tenant table, `diff_columns` + `alter_add_columns_sql` emit a guarded, **additive-only** `ALTER TABLE … ADD COLUMN IF NOT EXISTS …` (identifiers quoted; types from your trusted spec, never from the live DB).

## TS→Rust bridge: generate Rust rows from a TS-authored table

When the schema lives in TypeScript, you don't hand-write (and re-sync) the Rust row struct — introspect the live table and generate it:

```rust
use clickhouse_kit::introspect_row_struct;

// Reads system.columns for `events` and emits the Rust source:
let src = introspect_row_struct(&exec, "events", "EventRow").await?;
// #[derive(Debug, Clone, clickhouse::Row, serde::Serialize, serde::Deserialize)]
// pub struct EventRow {
// pub id: String, // UUID
// pub org: String, // LowCardinality(String)
// pub n: u64,
// pub tags: Vec<String>,
// pub attrs: std::collections::HashMap<String, String>,
// }
```

`ch_type_to_rust` / `rust_row_struct` are also exposed directly. Pair this with `check_drift` in CI to assert the generated Rust view stays ≡ the live (TS-owned) schema — so the Rust side can never silently diverge.

## Design

- **Safe by construction.** The type allowlist is unrepresentable-by-default; identifiers are validated + quoted; tables are bounded. The dangerous bits are impossible, not discouraged.
Expand Down
13 changes: 6 additions & 7 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,14 @@
- **MIT, generic.** Frame every primitive as "multi-tenant ClickHouse," never coupled to one app.
- **Safe by construction.** Every runtime/user-facing primitive validates input; the happy path makes SQL injection and unbounded tables impossible, not merely discouraged.

## Next: Rust-canonical (the customer-shape work runs in Rust)
## Source-of-truth model: TypeScript for static, Rust for dynamic (split by population)

The flexible / multi-tenant surface (runtime construction, the safety layer, flatten/coerce, additive evolution) is consumed by **Rust** services (api-prime, audit-logs, the Ask-Your-Data data platform) — that's where untrusted customer schema input is turned into SQL, and _safe-by-construction only counts in the process holding the input_. So the canonical implementation moves to Rust:
ClickHouse has two schema populations with different natural owners. This mirrors `smooai-postgres-kit`'s TS-source reframe for the developer-authored set, while keeping the runtime engine canonical where TypeScript can't reach:

- **Canonical Rust crate** (`crates/clickhouse-kit`, crates.io, MIT) — rows are Serde-native (`#[derive(Row)]` via the `clickhouse` crate); the crate adds the allowlisted type system, identifier safety, DDL generation, runtime/`flexible` construction, additive evolution, migrations, and drift. The allowlist is _stronger_ than the TS version: disallowed types are **unrepresentable** (no enum variant), so untrusted JSON naming them fails to deserialize at the boundary.
- **No TS consumers → the Rust crate is the standalone product.** Published as **`smooai-clickhouse-kit`** on crates.io (imports as `clickhouse_kit`). There is nothing on the TS side that needs to ride this, so there is **no WASM/npm binding**; the Rust services consume the crate directly. The original TS package served as the reference spec and is retired/static-only.
- **The TS v0.1/v0.2 is the reference spec** the Rust port mirrors (the adversarial safety tests translate almost line-for-line).
- **Tradeoff recorded:** TS compile-time `$inferSelect` row inference can't come from a WASM/runtime core — static TS-authored tables move to codegen'd types. Non-issue for dynamic tables (shapes unknown at compile time).
- **Static, developer-authored tables** (observability, metrics, billing) → **TypeScript is the source of truth** (the `@smooai/clickhouse-kit` TS authoring DX). The Rust crate is the **TS→Rust bridge**: `introspect` reads the live ClickHouse → Rust, `codegen` (`rust_row_struct` / `ch_type_to_rust`) emits the `#[derive(Row)]` struct, and `check_drift` asserts the Rust view ≡ the live (TS-owned) schema — so the Rust side never hand-copies or silently diverges.
- **Dynamic, customer-defined / multi-tenant tables** (Ask-Your-Data, custom tables, audit) → **Rust is canonical**. Created at runtime from untrusted input; safe-by-construction only counts in the process holding the input. The allowlisted type system, identifier safety, DDL gen, `flexible_table`, forward-only migrations, and additive evolution live here. The allowlist is **unrepresentable-by-default** — disallowed types have no enum variant, so untrusted JSON naming them fails to deserialize at the boundary.
- **Crate:** `smooai-clickhouse-kit` on crates.io (imports as `clickhouse_kit`); rows stay Serde-native. **No WASM/npm binding** — the TS side authors static schemas in its own kit; the Rust side bridges + owns the dynamic engine.

Started: the Rust **safety core** (`crates/clickhouse-kit/src/safety.rs`) — `validate_identifier`/`quote_identifier`, the `ColumnTypeSpec` allowlist (+ `to_ch_type`/`is_datetime64`), bounds + reserved — plus runtime **table DDL generation** (`table.rs`: `to_create_table_sql` from an untrusted spec, with identifier/allowlist/bounds/dup guards). Verified **end-to-end against a real ClickHouse** via testcontainers (generate DDL → apply → introspect `system.columns` → insert/select round-trip); the ported adversarial unit suite (injection, disallowed types, bounds, dup columns) is green too. CI runs unit + the testcontainers integration.

**Full surface landed (built via a 4-way Rust fan-out, lead-integrated):** `flexible_table` (the hybrid), `flatten_record` + `coerce_to_table`, `diff_columns` + `alter_add_columns_sql` (additive-only), and the I/O layer — a driver-agnostic `ChExecutor` trait + `run_migrations` (forward-only) + `check_drift` — with a second testcontainers integration exercising migrate + drift against a real ClickHouse. **38 unit + 2 real-ClickHouse integration tests, clippy `-D warnings` clean.** Published to crates.io as **`smooai-clickhouse-kit`** (manual `publish-crate.yml`, `SMOOAI_CARGO_REGISTRY_TOKEN`). No WASM binding — there are no TS consumers; the Rust services consume the crate directly. Rows stay Serde-native.
**Full surface landed (built via a 4-way Rust fan-out, lead-integrated):** `flexible_table` (the hybrid), `flatten_record` + `coerce_to_table`, `diff_columns` + `alter_add_columns_sql` (additive-only), and the I/O layer — a driver-agnostic `ChExecutor` trait + `run_migrations` (forward-only) + `check_drift` — with a second testcontainers integration exercising migrate + drift against a real ClickHouse. Plus the **TS→Rust bridge** (`codegen` + `introspect`): `ch_type_to_rust` / `rust_row_struct` map a ClickHouse table to a `#[derive(Row)]` struct, and `introspect_row_struct(&exec, table, name)` does live-table → Rust source in one call (verified by a third testcontainers integration: create table → introspect → generated struct). **41 unit + 3 real-ClickHouse integration tests, clippy `-D warnings` clean.** Published to crates.io as **`smooai-clickhouse-kit`** (manual `publish-crate.yml`, `SMOOAI_CARGO_REGISTRY_TOKEN`). Rows stay Serde-native.
2 changes: 1 addition & 1 deletion crates/clickhouse-kit/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "smooai-clickhouse-kit"
version = "0.1.0"
version = "0.2.0"
edition = "2021"
rust-version = "1.75"
license = "MIT"
Expand Down
30 changes: 28 additions & 2 deletions crates/clickhouse-kit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,14 @@
[![CI](https://github.com/SmooAI/clickhouse-kit/actions/workflows/rust.yml/badge.svg)](https://github.com/SmooAI/clickhouse-kit/actions/workflows/rust.yml)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE)

**A safe-by-construction schema toolkit for ClickHouse — built for user-defined, multi-tenant schemas.**
**A safe-by-construction schema toolkit for ClickHouse — for user-defined, multi-tenant schemas, with a TypeScript→Rust bridge for the schemas you author by hand.**

When your customers' data shapes are defined at runtime, you end up turning untrusted input into SQL. `smooai-clickhouse-kit` owns that boundary so the happy path makes **SQL injection and unbounded tables impossible, not merely discouraged** — an allowlisted type system, identifier validation, DDL generation, forward-only migrations, additive evolution, and drift detection. Rows stay [Serde](https://serde.rs)-native (use the [`clickhouse`](https://crates.io/crates/clickhouse) crate's `#[derive(Row)]`), so the kit never reimplements row mapping.
The kit has two jobs:

1. **Runtime toolkit (user-defined / multi-tenant tables).** When your customers' data shapes are defined at runtime, you end up turning untrusted input into SQL. The kit owns that boundary so the happy path makes **SQL injection and unbounded tables impossible, not merely discouraged** — an allowlisted type system, identifier validation, DDL generation, `flexible_table`, forward-only migrations, and additive evolution.
2. **TS→Rust bridge (developer-authored tables).** When TypeScript owns a table's schema, `introspect` reads the live ClickHouse back into Rust and `codegen` emits the `#[derive(Row)]` struct, with `check_drift` asserting the Rust view ≡ the live DB. No more hand-copied row structs drifting from the schema.

Either way, rows stay [Serde](https://serde.rs)-native (use the [`clickhouse`](https://crates.io/crates/clickhouse) crate's `#[derive(Row)]`) — the kit never reimplements row mapping.

```toml
[dependencies]
Expand Down Expand Up @@ -98,6 +103,27 @@ let drift = check_drift(&exec, &[table]).await?;

For growing a per-tenant table, `diff_columns` + `alter_add_columns_sql` emit a guarded, **additive-only** `ALTER TABLE … ADD COLUMN IF NOT EXISTS …` (identifiers quoted; types from your trusted spec, never from the live DB).

## TS→Rust bridge: generate Rust rows from a TS-authored table

When the schema lives in TypeScript, you don't hand-write (and re-sync) the Rust row struct — introspect the live table and generate it:

```rust
use clickhouse_kit::introspect_row_struct;

// Reads system.columns for `events` and emits the Rust source:
let src = introspect_row_struct(&exec, "events", "EventRow").await?;
// #[derive(Debug, Clone, clickhouse::Row, serde::Serialize, serde::Deserialize)]
// pub struct EventRow {
// pub id: String, // UUID
// pub org: String, // LowCardinality(String)
// pub n: u64,
// pub tags: Vec<String>,
// pub attrs: std::collections::HashMap<String, String>,
// }
```

`ch_type_to_rust` / `rust_row_struct` are also exposed directly. Pair this with `check_drift` in CI to assert the generated Rust view stays ≡ the live (TS-owned) schema — so the Rust side can never silently diverge.

## Design

- **Safe by construction.** The type allowlist is unrepresentable-by-default; identifiers are validated + quoted; tables are bounded. The dangerous bits are impossible, not discouraged.
Expand Down
168 changes: 168 additions & 0 deletions crates/clickhouse-kit/src/codegen.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
//! TS→Rust bridge codegen. TypeScript owns the (static) schema; this turns a
//! ClickHouse table's live/spec columns into a Rust **row struct** — `#[derive(Row,
//! Deserialize)]` — so the Rust services get faithful, drift-checked rows for the
//! TS-authored tables instead of hand-writing them (the class of bug that bit
//! api-prime's hand-copied structs). Pair with `introspect` + `check_drift`: the
//! generated rows are asserted ≡ the live ClickHouse in CI.
//!
//! The mapping is a faithful **scaffold**: ClickHouse temporal types map to
//! `String` (the works-everywhere default over the HTTP/RowBinary boundary) — a
//! consumer may refine those to `time`/`chrono` types behind the `clickhouse`
//! crate's feature flags.

/// Strip a single-arg wrapper like `Nullable(...)` / `Array(...)`, returning the inner.
fn strip_wrapper<'a>(t: &'a str, name: &str) -> Option<&'a str> {
let prefix = format!("{name}(");
t.strip_prefix(&prefix)
.and_then(|rest| rest.strip_suffix(')'))
}

/// Split a `Map(...)` inner on its top-level comma (respecting nested parens).
fn split_top_comma(inner: &str) -> Option<(&str, &str)> {
let mut depth = 0usize;
for (i, c) in inner.char_indices() {
match c {
'(' => depth += 1,
')' => depth = depth.saturating_sub(1),
',' if depth == 0 => return Some((inner[..i].trim(), inner[i + 1..].trim())),
_ => {}
}
}
None
}

/// Map a ClickHouse type string to the Rust type a `clickhouse`-crate row uses.
/// Wrappers recurse; unknown scalars fall back to `String` (safe over the wire).
pub fn ch_type_to_rust(ch_type: &str) -> String {
let t = ch_type.trim();
if let Some(inner) = strip_wrapper(t, "Nullable") {
return format!("Option<{}>", ch_type_to_rust(inner));
}
if let Some(inner) = strip_wrapper(t, "LowCardinality") {
return ch_type_to_rust(inner);
}
if let Some(inner) = strip_wrapper(t, "Array") {
return format!("Vec<{}>", ch_type_to_rust(inner));
}
if let Some(inner) = strip_wrapper(t, "Map") {
if let Some((k, v)) = split_top_comma(inner) {
return format!(
"std::collections::HashMap<{}, {}>",
ch_type_to_rust(k),
ch_type_to_rust(v)
);
}
}
// Scalar — match on the base type, ignoring any `(...)` parameters.
let base = t.split('(').next().unwrap_or(t).trim();
match base {
"Bool" => "bool",
"UInt8" => "u8",
"UInt16" => "u16",
"UInt32" => "u32",
"UInt64" => "u64",
"Int8" => "i8",
"Int16" => "i16",
"Int32" => "i32",
"Int64" => "i64",
"Float32" => "f32",
"Float64" => "f64",
// String, UUID, FixedString, Date*, DateTime*, IPv4/6, Enum*, JSON, and
// anything unrecognized → String (the safe over-the-wire default).
_ => "String",
}
.to_string()
}

/// Rust raw-ident escape for column names that collide with Rust keywords.
fn rust_field_ident(name: &str) -> String {
const KEYWORDS: &[&str] = &[
"as", "break", "const", "continue", "crate", "else", "enum", "extern", "false", "fn",
"for", "if", "impl", "in", "let", "loop", "match", "mod", "move", "mut", "pub", "ref",
"return", "self", "static", "struct", "super", "trait", "true", "type", "unsafe", "use",
"where", "while", "async", "await", "dyn",
];
if KEYWORDS.contains(&name) {
format!("r#{name}")
} else {
name.to_string()
}
}

/// Emit a Rust row struct for a table's columns — `(column_name, clickhouse_type)`
/// pairs. Derives the `clickhouse` crate's `Row` + serde, so it deserializes
/// straight from a query. The emitted source references `clickhouse::Row`
/// (a dev/consumer dependency); this function only produces the string.
pub fn rust_row_struct(struct_name: &str, columns: &[(String, String)]) -> String {
let mut out = String::new();
out.push_str(
"#[derive(Debug, Clone, clickhouse::Row, serde::Serialize, serde::Deserialize)]\n",
);
out.push_str(&format!("pub struct {struct_name} {{\n"));
for (name, ch_type) in columns {
let field = rust_field_ident(name);
// Preserve the exact column name for (de)serialization when the field was escaped.
if field != *name {
out.push_str(&format!(" #[serde(rename = \"{name}\")]\n"));
}
out.push_str(&format!(" pub {field}: {},\n", ch_type_to_rust(ch_type)));
}
out.push_str("}\n");
out
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn maps_scalars() {
assert_eq!(ch_type_to_rust("String"), "String");
assert_eq!(ch_type_to_rust("UInt64"), "u64");
assert_eq!(ch_type_to_rust("Int32"), "i32");
assert_eq!(ch_type_to_rust("Float64"), "f64");
assert_eq!(ch_type_to_rust("Bool"), "bool");
assert_eq!(ch_type_to_rust("UUID"), "String");
assert_eq!(ch_type_to_rust("DateTime64(3)"), "String");
}

#[test]
fn maps_wrappers_and_containers() {
assert_eq!(ch_type_to_rust("Nullable(String)"), "Option<String>");
assert_eq!(ch_type_to_rust("LowCardinality(String)"), "String");
assert_eq!(
ch_type_to_rust("LowCardinality(Nullable(String))"),
"Option<String>"
);
assert_eq!(ch_type_to_rust("Array(String)"), "Vec<String>");
assert_eq!(ch_type_to_rust("Array(UInt32)"), "Vec<u32>");
assert_eq!(
ch_type_to_rust("Map(String, String)"),
"std::collections::HashMap<String, String>"
);
assert_eq!(
ch_type_to_rust("Map(String, Array(UInt8))"),
"std::collections::HashMap<String, Vec<u8>>"
);
}

#[test]
fn emits_row_struct_with_keyword_escape() {
let cols = vec![
("id".to_string(), "UUID".to_string()),
("count".to_string(), "UInt64".to_string()),
("type".to_string(), "LowCardinality(String)".to_string()),
("tags".to_string(), "Array(String)".to_string()),
];
let src = rust_row_struct("EventRow", &cols);
assert!(src.contains(
"#[derive(Debug, Clone, clickhouse::Row, serde::Serialize, serde::Deserialize)]"
));
assert!(src.contains("pub struct EventRow {"));
assert!(src.contains("pub id: String,"));
assert!(src.contains("pub count: u64,"));
assert!(src.contains("#[serde(rename = \"type\")]"));
assert!(src.contains("pub r#type: String,"));
assert!(src.contains("pub tags: Vec<String>,"));
}
}
Loading
Loading