Skip to content

Commit 3cebdd3

Browse files
committed
Move cross-database diffing to Tobiko Cloud features
1 parent a59ecfe commit 3cebdd3

File tree

3 files changed

+147
-120
lines changed

3 files changed

+147
-120
lines changed

docs/cloud/features/xdb_diffing.md

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# Cross-database Table Diffing
2+
3+
Tobiko Cloud extends SQLMesh's [within-database table diff tool](../../guides/tablediff.md) to support comparison of tables or views across different database systems.
4+
5+
It provides a method of validating models that can be used along with [evaluating a model](../../guides/models.md#evaluating-a-model) and [testing a model with unit tests](../../guides/testing.md#testing-changes-to-models).
6+
7+
!!! tip "Learn more about table diffing"
8+
9+
Learn more about using the table diff tool in the SQLMesh [table diff guide](../../guides/tablediff.md).
10+
11+
## Diffing tables or views across gateways
12+
13+
SQLMesh executes a project's models with a single database system, specified as a [gateway](../../guides/connections.md) in the project configuration.
14+
15+
The within-database table diff tool described above compares tables or environments within such a system. Sometimes, however, you might want to compare tables that reside in two different data systems.
16+
17+
For example, you might migrate your data transformations from an on-premises SQL engine to a cloud SQL engine while setting up your SQLMesh project. To demonstrate equivalence between the systems you could run the transformations in both and compare the new tables to the old tables.
18+
19+
The [within-database table diff](../../guides/tablediff.md) tool cannot make those comparisons, for two reasons:
20+
21+
1. It must join the two tables being diffed, but with two systems no single database engine can access both tables.
22+
2. It assumes that data values can be compared across tables without modification. However, the diff must account for differences in data types across the two SQL engines (e.g., whether timestamps should include time zone information).
23+
24+
SQLMesh's cross-database table diff tool is built for just this scenario. Its comparison algorithm efficiently diffs tables without moving them from one system to the other and automatically addresses differences in data types.
25+
26+
## Configuration and syntax
27+
28+
To diff tables across systems, first configure a [gateway](../../reference/configuration.md#gateway) for each database system in your SQLMesh configuration file.
29+
30+
This example configures `bigquery` and `snowflake` gateways:
31+
32+
```yaml linenums="1"
33+
gateways:
34+
bigquery:
35+
connection:
36+
type: bigquery
37+
[other connection parameters]
38+
39+
snowflake:
40+
connection:
41+
type: snowflake
42+
[other connection parameters]
43+
```
44+
45+
Then, specify each table's gateway in the `table_diff` command with this syntax: `[source_gateway]|[source table]:[target_gateway]|[target table]`.
46+
47+
For example, we could diff the `landing.table` table across `bigquery` and `snowflake` gateways like this:
48+
49+
```sh
50+
$ tcloud sqlmesh table_diff 'bigquery|landing.table:snowflake|landing.table'
51+
```
52+
53+
This syntax tells SQLMesh to use the cross-database diffing algorithm instead of the normal within-database diffing algorithm.
54+
55+
After adding gateways to the table names, use `table_diff` as described in the [SQLMesh table diff guide](../../guides/tablediff.md) - the same options apply for specifying the join keys, decimal precision, etc. See `tcloud sqlmesh table_diff --help` for a [full list of options](../../reference/cli.md#table_diff).
56+
57+
!!! warning
58+
59+
Cross-database diff works for data objects (tables / views).
60+
61+
Diffing _models_ is not supported because we do not assume that both the source and target databases are managed by SQLMesh.
62+
63+
## Example output
64+
65+
A cross-database diff is broken up into two stages.
66+
67+
The first stage is a schema diff. This example shows that differences in column name case across the two tables are identified as schema differences:
68+
69+
```bash
70+
$ tcloud sqlmesh table_diff 'bigquery|sqlmesh_example.full_model:snowflake|sqlmesh_example.full_model' --on item_id --show-sample
71+
72+
Schema Diff Between 'BIGQUERY|SQLMESH_EXAMPLE.FULL_MODEL' and 'SNOWFLAKE|SQLMESH_EXAMPLE.FULL_MODEL':
73+
├── Added Columns:
74+
│ ├── ITEM_ID (DECIMAL(38, 0))
75+
│ └── NUM_ORDERS (DECIMAL(38, 0))
76+
└── Removed Columns:
77+
├── item_id (BIGINT)
78+
└── num_orders (BIGINT)
79+
Schema has differences; continue comparing rows? [y/n]:
80+
```
81+
82+
SQLMesh prompts you before comparing data values across table rows. The prompt provides an opportunity to discontinue the comparison if the schemas are vastly different (potentially indicating a mistake) or you need to exclude columns from the diff because you know they won't match.
83+
84+
The second stage of the diff is comparing data values across tables. Within each system, SQLMesh divides the data into chunks, evaluates each chunk, and compares the outputs across systems. If a difference is found, it performs a row-level diff on that chunk by reading a sample of mismatched rows from each system.
85+
86+
This example shows that 2 rows were present in each system but had different values, one row was in Bigquery only, and one row was in Snowflake only:
87+
88+
```bash
89+
Dividing source dataset into 10 chunks (based on 10947709 total records)
90+
Checking chunks against target dataset
91+
Chunk 1 hash mismatch!
92+
Starting row-level comparison for the range (1 -> 3)
93+
Identifying individual record hashes that don't match
94+
Comparing
95+
96+
Row Counts:
97+
├── PARTIAL MATCH: 2 rows (66.67%)
98+
├── BIGQUERY ONLY: 1 rows (16.67%)
99+
└── SNOWFLAKE ONLY: 1 rows (16.67%)
100+
101+
COMMON ROWS column comparison stats:
102+
pct_match
103+
num_orders 0.0
104+
105+
106+
COMMON ROWS sample data differences:
107+
Column: num_orders
108+
┏━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓
109+
┃ item_id ┃ BIGQUERY ┃ SNOWFLAKE ┃
110+
┡━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩
111+
│ 1 │ 5 │ 7 │
112+
│ 2 │ 1 │ 2 │
113+
└─────────┴──────────┴───────────┘
114+
115+
BIGQUERY ONLY sample rows:
116+
item_id num_orders
117+
7 4
118+
119+
120+
SNOWFLAKE ONLY sample rows:
121+
item_id num_orders
122+
4 6
123+
```
124+
125+
If there are no differences found between chunks, the source and target datasets can be considered equal:
126+
127+
```bash
128+
Chunk 1 (1094771 rows) matches!
129+
Chunk 2 (1094771 rows) matches!
130+
...
131+
Chunk 10 (1094770 rows) matches!
132+
133+
All 10947709 records match between 'bigquery|sqlmesh_example.full_model' and 'snowflake|TEST.SQLMESH_EXAMPLE.FULL_MODEL'
134+
```
135+
136+
!!! info
137+
138+
Don't forget to specify the `--show-sample` option if you'd like to see a sample of the actual mismatched data!
139+
140+
Otherwise, only high level statistics for the mismatched rows will be printed.
141+
142+
### Supported engines
143+
144+
Cross-database diffing is supported on all execution engines that [SQLMesh supports](../../integrations/overview.md#execution-engines).

docs/guides/tablediff.md

Lines changed: 2 additions & 120 deletions
Original file line numberDiff line numberDiff line change
@@ -214,7 +214,7 @@ The output matches, with the exception of the column labels in the `COMMON ROWS
214214

215215
!!! info "Tobiko Cloud Feature"
216216

217-
Cross-database table diffing is available in [Tobiko Cloud](./observer.md#installation).
217+
Cross-database table diffing is available in [Tobiko Cloud](../cloud/features/xdb_diffing.md).
218218

219219
SQLMesh executes a project's models with a single database system, specified as a [gateway](../guides/connections.md#overview) in the project configuration.
220220

@@ -229,122 +229,4 @@ The [within-database table diff](#diffing-models-across-environments) tool canno
229229

230230
SQLMesh's cross-database table diff tool is built for just this scenario. Its comparison algorithm efficiently diffs tables without moving them from one system to the other and automatically addresses differences in data types.
231231

232-
### Configuration and syntax
233-
234-
To diff tables across systems, first configure [Gateways](../reference/configuration#Gateways) for each database system in your SQLMesh configuration file.
235-
236-
This example configures `bigquery` and `snowflake` gateways:
237-
238-
```yaml linenums="1"
239-
gateways:
240-
bigquery:
241-
connection:
242-
type: bigquery
243-
[other connection parameters]
244-
245-
snowflake:
246-
connection:
247-
type: snowflake
248-
[other connection parameters]
249-
```
250-
251-
Then, specify each table's gateway in the `table_diff` command with this syntax: `[source_gateway]|[source table]:[target_gateway]|[target table]`.
252-
253-
For example, we could diff the `landing.table` table across `bigquery` and `snowflake` gateways like this:
254-
255-
```sh
256-
$ tcloud sqlmesh table_diff 'bigquery|landing.table:snowflake|lake.table'
257-
```
258-
259-
This syntax tells SQLMesh to use the cross-database diffing algorithm instead of the normal within-database diffing algorithm.
260-
261-
After adding gateways to the table names, use `table_diff` as described above - the same options apply for specifying the join keys, decimal precision, etc. See `tcloud sqlmesh table_diff --help` for a [full list of options](../reference/cli.md#table_diff).
262-
263-
!!! warning
264-
265-
Cross-database diff works for data objects (tables / views).
266-
267-
Diffing _models_ is not supported because we do not assume that both the source and target databases are managed by SQLMesh.
268-
269-
### Example output
270-
271-
A cross-database diff is broken up into two stages.
272-
273-
The first stage is a schema diff. This example shows that differences in column name case across the two tables are identified as schema differences:
274-
275-
```bash
276-
$ tcloud sqlmesh table_diff 'bigquery|sqlmesh_example.full_model:snowflake|sqlmesh_example.full_model' --on item_id --show-sample
277-
278-
Schema Diff Between 'BIGQUERY|SQLMESH_EXAMPLE.FULL_MODEL' and 'SNOWFLAKE|SQLMESH_EXAMPLE.FULL_MODEL':
279-
├── Added Columns:
280-
│ ├── ITEM_ID (DECIMAL(38, 0))
281-
│ └── NUM_ORDERS (DECIMAL(38, 0))
282-
└── Removed Columns:
283-
├── item_id (BIGINT)
284-
└── num_orders (BIGINT)
285-
Schema has differences; continue comparing rows? [y/n]:
286-
```
287-
288-
SQLMesh prompts you before comparing data values across table rows. The prompt provides an opportunity to discontinue the comparison if the schemas are vastly different (potentially indicating a mistake) or you need to exclude columns from the diff because you know they won't match.
289-
290-
The second stage of the diff is comparing data values across tables. Within each system, SQLMesh divides the data into chunks, evaluates each chunk, and compares the outputs across systems. If a difference is found, it performs a row-level diff on that chunk by reading a sample of mismatched rows from each system.
291-
292-
This example shows that 2 rows were present in each system but had different values, one row was in Bigquery only, and one row was in Snowflake only:
293-
294-
```bash
295-
Dividing source dataset into 10 chunks (based on 10947709 total records)
296-
Checking chunks against target dataset
297-
Chunk 1 hash mismatch!
298-
Starting row-level comparison for the range (1 -> 3)
299-
Identifying individual record hashes that don't match
300-
Comparing
301-
302-
Row Counts:
303-
├── PARTIAL MATCH: 2 rows (66.67%)
304-
├── BIGQUERY ONLY: 1 rows (16.67%)
305-
└── SNOWFLAKE ONLY: 1 rows (16.67%)
306-
307-
COMMON ROWS column comparison stats:
308-
pct_match
309-
num_orders 0.0
310-
311-
312-
COMMON ROWS sample data differences:
313-
Column: num_orders
314-
┏━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓
315-
┃ item_id ┃ BIGQUERY ┃ SNOWFLAKE ┃
316-
┡━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩
317-
│ 1 │ 5 │ 7 │
318-
│ 2 │ 1 │ 2 │
319-
└─────────┴──────────┴───────────┘
320-
321-
BIGQUERY ONLY sample rows:
322-
item_id num_orders
323-
7 4
324-
325-
326-
SNOWFLAKE ONLY sample rows:
327-
item_id num_orders
328-
4 6
329-
```
330-
331-
If there are no differences found between chunks, the source and target datasets can be considered equal:
332-
333-
```bash
334-
Chunk 1 (1094771 rows) matches!
335-
Chunk 2 (1094771 rows) matches!
336-
...
337-
Chunk 10 (1094770 rows) matches!
338-
339-
All 10947709 records match between 'bigquery|sqlmesh_example.full_model' and 'snowflake|TEST.SQLMESH_EXAMPLE.FULL_MODEL'
340-
```
341-
342-
!!! info
343-
344-
Don't forget to specify the `--show-sample` option if you'd like to see a sample of the actual mismatched data!
345-
346-
Otherwise, only high level statistics for the mismatched rows will be printed.
347-
348-
### Supported engines
349-
350-
Cross-database diffing is supported on all execution engines that [SQLMesh supports](../integrations/overview.md#execution-engines).
232+
Learn more about cross-database table diffing in our [Tobiko Cloud docs](../cloud/features/xdb_diffing.md).

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,7 @@ nav:
126126
- Security:
127127
- "Security Overview": cloud/features/security/security.md
128128
- "Incident Reporting": cloud/features/incident_reporting.md
129+
- cloud/features/xdb_diffing.md
129130
# - Observability:
130131
# - cloud/features/observability/overview.md
131132
# - cloud/features/observability/model_freshness.md

0 commit comments

Comments
 (0)