Add BigQuery as a reconcile source#2527
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2527 +/- ##
==========================================
+ Coverage 69.18% 69.41% +0.22%
==========================================
Files 105 106 +1
Lines 9503 9574 +71
Branches 1052 1056 +4
==========================================
+ Hits 6575 6646 +71
Misses 2731 2731
Partials 197 197 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
✅ 172/172 passed, 3 skipped, 48m48s total Running from acceptance #4945 |
Adds BigQuery as a Lakehouse Federation reconcile source (schema/row/data/all/aggregate), reusing the existing remote_query path like the other federation connectors. - New BigQueryDataSource: remote_query reads, backtick 3-part `project.dataset.table` names, INFORMATION_SCHEMA schema query with scale/precision canonicalization - Register BIGQUERY in ReconSourceType and source_adapter; install prompts + result display name - Row hashing for BigQuery: TO_HEX(SHA256()) (matches Databricks sha2) and scale-aware decimal FORMAT so cross-engine hashes match Databricks DECIMAL string output - Docs (supported sources + config tab incl. DBR 17.3+/serverless compute note) and unit tests incl. a type-coverage guardrail
d73cf77 to
13141ee
Compare
m-abulazm
left a comment
There was a problem hiding this comment.
added UC connection with name bigquery_sandbox for the e2e test
…ion tests - bigquery.py: reference tables and INFORMATION_SCHEMA two-part (dataset.table); the project is abstracted by the UC connection, matching the other federated connectors. list_schemas uses bare SCHEMATA via the connection's default project. - install: drop the BigQuery project prompt; catalog is empty for the bigquery dialect. - unit tests: update assertions from three-part to two-part naming. - integration: add bigquery e2e (report_type=schema) plus read_schema/list tests against the bigquery_sandbox UC connection.
main removed profiler_dashboard from LakebridgeConfiguration (#2512); update the BigQuery reconcile install test to match.
catalog="" is dropped by blueprint serde and reloads as None, breaking the required str field (e2e SerdeError). BigQuery has no separate catalog, so mirror the dataset into catalog (non-empty, round-trips); the connector ignores it (two-part naming).
|
@m-abulazm thanks for setting up the connection. All 4 BigQuery acceptance tests now fail on the same single cause — a connection grant, not PERMISSION_DENIED: User does not have USE CONNECTION on Connection 'bigquery_sandbox' Two things from your side:
Thanks |
|
Materialization dataset: undocumented write requirement + no config knob In production the BigQuery materialization target always defaults to the read dataset (
The doc line unblocks the PR; the config change can be a follow-up. |
…ataset Address review feedback: document that the source dataset (or a dedicated materialization dataset) must be writable by the connection's service account, since remote_query materializes results there. Also correct the catalog description for two-part naming — the project is taken from the UC connection; catalog mirrors the dataset.
|
Thanks @bishwajit-db for the review! (1) Docs — done here: added a "Writable dataset required" note (source or a dedicated materialization dataset must be writable by the connection's SA). Also fixed the naming text for two-part — project comes from the UC connection, schema is the dataset, catalog mirrors it. (2) Config — agreed, taking as a follow-up: add materialization_dataset to SourceConnectionConfig and thread it through create_adapter (also fixes list_schemas' empty materializationDataset). Separate PR so this can land on the doc fix. #2529 @m-abulazm thanks for setting up UC connection and grant the permission. added testing and verified. |
m-abulazm
left a comment
There was a problem hiding this comment.
Nice work, a few things before this is good to merge:
1. Use catalog as materializationDataset. Rather than threading a separate materialization_dataset arg through the connector and defaulting it to the source dataset, let's just use catalog for this. We already carry catalog around for every source, it's already non-empty in the config and reusing it avoids inventing a parallel concept.
2. Drop the comments. There's a lot of explanatory commentary here — the module docstring, the inline notes on the _SCHEMA_QUERY CASE, the read_data placeholder substitution, list_schemas, etc. Readable, well-written code doesn't need this. Please lift out anything that's genuinely surprising into the names/structure of the code itself and delete the rest.
3. Revert the type canonicalization; do it through DataType_transform_mapping. We already support this through DataType_transform_mapping and I don't want another magical spot that conditionally rewrites types (The special-cased bigquery_decimal_transform branch in _get_transform). Let's improve DataType_transform_mapping to handle these cases properly instead. Please revert this part here and bring the refactoring on a dedicated follow-up PR so it can be reviewed on its own.
Happy to pair on the DataType_transform_mapping extension if useful.
| # BigQuery's remote_query rejects `database`; a `query` push requires `materializationDataset` | ||
| # (a writable BigQuery dataset where results are materialized). Defaults to the dataset being | ||
| # read; set explicitly when the source dataset is read-only for the connection's service account. |
There was a problem hiding this comment.
| # BigQuery's remote_query rejects `database`; a `query` push requires `materializationDataset` | |
| # (a writable BigQuery dataset where results are materialized). Defaults to the dataset being | |
| # read; set explicitly when the source dataset is read-only for the connection's service account. |
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class BigQueryDataSource(DataSource): |
There was a problem hiding this comment.
lets use catalog as materializationDataset and document that clearly plus logs when using it
There was a problem hiding this comment.
Done — catalog is now used as the materializationDataset, documented in the class docstring and logged on each read.
| """BigQuery data source read through a Databricks Lakehouse Federation UC connection. | ||
|
|
||
| Data is fetched via the `remote_query` table-valued function (same path as Snowflake/Oracle/etc.), | ||
| so credentials live in the UC connection and no JDBC driver is required here. | ||
|
|
||
| Naming/quoting follows GoogleSQL: identifiers are backtick-quoted and tables are referenced | ||
| two-part as `dataset.table` (dataset == schema), the same way the other federated connectors | ||
| keep the top-level container out of the dotted name. The project is abstracted by the UC | ||
| connection (its default project scopes unqualified names), so the `catalog` argument is unused. | ||
|
|
||
| The `_SCHEMA_QUERY` `CASE` is the *Stage-1* type canonicalization for schema reconciliation: it | ||
| emits, for the handful of BigQuery types that sqlglot cannot bridge to Databricks on its own, the | ||
| Databricks-equivalent type string so the downstream `schema_compare` round-trip matches. Targets are | ||
| taken from the empirically-tested BigQuery -> Databricks type mapping (FE GCP + DBSQL 2026.10): | ||
| * BIGNUMERIC -> string (precision 76 exceeds Databricks DECIMAL max 38; STRING preserves it) | ||
| * NUMERIC -> decimal(38,9) (bare NUMERIC is fixed 38/9; sqlglot would emit DECIMAL(10,0)) | ||
| * TIME -> string (Databricks has no TIME type) | ||
| * JSON -> variant | ||
| * RANGE<T> -> struct<start <T>, end <T>> | ||
| All other types (INT64, FLOAT64, BOOL, STRING, BYTES, DATE, DATETIME, TIMESTAMP, NUMERIC(p,s), | ||
| GEOGRAPHY, ARRAY, STRUCT) are left raw because sqlglot translates them correctly. | ||
| INTERVAL is intentionally not mapped: it migrates to two Databricks columns, which the one-to-one | ||
| schema comparison cannot represent, so such columns surface as a visible mismatch. | ||
| """ |
There was a problem hiding this comment.
| """BigQuery data source read through a Databricks Lakehouse Federation UC connection. | |
| Data is fetched via the `remote_query` table-valued function (same path as Snowflake/Oracle/etc.), | |
| so credentials live in the UC connection and no JDBC driver is required here. | |
| Naming/quoting follows GoogleSQL: identifiers are backtick-quoted and tables are referenced | |
| two-part as `dataset.table` (dataset == schema), the same way the other federated connectors | |
| keep the top-level container out of the dotted name. The project is abstracted by the UC | |
| connection (its default project scopes unqualified names), so the `catalog` argument is unused. | |
| The `_SCHEMA_QUERY` `CASE` is the *Stage-1* type canonicalization for schema reconciliation: it | |
| emits, for the handful of BigQuery types that sqlglot cannot bridge to Databricks on its own, the | |
| Databricks-equivalent type string so the downstream `schema_compare` round-trip matches. Targets are | |
| taken from the empirically-tested BigQuery -> Databricks type mapping (FE GCP + DBSQL 2026.10): | |
| * BIGNUMERIC -> string (precision 76 exceeds Databricks DECIMAL max 38; STRING preserves it) | |
| * NUMERIC -> decimal(38,9) (bare NUMERIC is fixed 38/9; sqlglot would emit DECIMAL(10,0)) | |
| * TIME -> string (Databricks has no TIME type) | |
| * JSON -> variant | |
| * RANGE<T> -> struct<start <T>, end <T>> | |
| All other types (INT64, FLOAT64, BOOL, STRING, BYTES, DATE, DATETIME, TIMESTAMP, NUMERIC(p,s), | |
| GEOGRAPHY, ARRAY, STRUCT) are left raw because sqlglot translates them correctly. | |
| INTERVAL is intentionally not mapped: it migrates to two Databricks columns, which the one-to-one | |
| schema comparison cannot represent, so such columns surface as a visible mismatch. | |
| """ |
There was a problem hiding this comment.
please avoid comments. Readable well-written code does not need comments
| logger.warning(f"Could not parse datatype {datatype} for source {source_dialect}") | ||
|
|
||
| # BigQuery decimals need scale-aware padding to match Spark's CAST(DECIMAL AS STRING). | ||
| if source_dialect == "bigquery" and parsed == exp.DataType.Type.DECIMAL.value: |
There was a problem hiding this comment.
we support this through DataType_transform_mapping I dont want to introduce another magical place that does things conditionally.
we should improve DataType_transform_mapping to better support this. please revert this and add the new refactoring on a new dedicated PR
There was a problem hiding this comment.
Reverted the conditional branch. Tracking the DataType_transform_mapping refactor in #2537.
| # BigQuery CONCAT/|| and TRIM/COALESCE require STRING operands, so cast every column to STRING | ||
| # before concatenation (mirrors the tsql default). DATE/INT64/NUMERIC/STRING cast deterministically; | ||
| # TIMESTAMP/FLOAT64 string formatting parity with Databricks is a follow-up (explicit FORMAT needed). |
There was a problem hiding this comment.
| # BigQuery CONCAT/|| and TRIM/COALESCE require STRING operands, so cast every column to STRING | |
| # before concatenation (mirrors the tsql default). DATE/INT64/NUMERIC/STRING cast deterministically; | |
| # TIMESTAMP/FLOAT64 string formatting parity with Databricks is a follow-up (explicit FORMAT needed). | |
| # TODO: add timestamps and numbers handling |
There was a problem hiding this comment.
Applied — left the # TODO: add timestamps and numbers handling marker; full handling in #2537.
| # The BigQuery project is abstracted by the UC connection (its default project scopes | ||
| # unqualified names), so no project is prompted; the connector uses two-part dataset.table. | ||
| # catalog is set to the dataset below (it must be non-empty to round-trip through serde). |
There was a problem hiding this comment.
| # The BigQuery project is abstracted by the UC connection (its default project scopes | |
| # unqualified names), so no project is prompted; the connector uses two-part dataset.table. | |
| # catalog is set to the dataset below (it must be non-empty to round-trip through serde). |
| # BigQuery has no separate catalog; mirror the dataset so the value is non-empty (the | ||
| # connector ignores it). The project is abstracted by the UC connection. |
There was a problem hiding this comment.
| # BigQuery has no separate catalog; mirror the dataset so the value is non-empty (the | |
| # connector ignores it). The project is abstracted by the UC connection. |
… revert decimal transform - BigQuery connector uses catalog as the remote_query materializationDataset (logged), removing the constructor-only materialization_dataset arg. - Remove explanatory comments in the connector, install, and reconcile fixtures. - Revert the special-cased BigQuery decimal transform (bigquery_decimal_transform and the _get_transform branch); decimal/timestamp hashing parity will be handled via DataType_transform_mapping in a follow-up.
|
Thanks @m-abulazm — addressed all three (CI green):
PTAL. |
What
Adds BigQuery as a reconcile source, at parity with the other sources (
schema/row/data/all/aggregate). BigQuery was already supported by the transpiler and profiler; this closes the gap for reconcile.How it works
BigQuery uses the same Lakehouse Federation
remote_querypath as Snowflake/Oracle/etc. — no new dependencies.BigQueryDataSourceuses backtick-quoted 3-partproject.dataset.tablenames and reads metadata fromINFORMATION_SCHEMA.COLUMNS. sqlglot already ships a BigQuery dialect, so query generation and schema comparison are reused unchanged.Changes
reconcile/connectors/bigquery.py(BigQueryDataSource); registered inReconSourceTypeandsource_adapter.create_adapter; install prompt +recon_capturedisplay name.INFORMATION_SCHEMAquery canonicalizes the few BigQuery types sqlglot can't bridge to Databricks (BIGNUMERIC→string, bareNUMERIC→decimal(38,9),TIME→string,JSON→variant,RANGE<T>→struct<…>); everything else is left to sqlglot.TO_HEX(SHA256()), matching Databrickssha2(...,256)) and a scale-aware decimal transform (FORMAT('%.<scale>f', col)) so BigQuery's trailing-zero-stripped numeric strings match Spark's scale-paddedDECIMALstrings.Testing
make fmt/make lint(pylint 10.0/10.0) andmake test— green (1319 passed).schema,row, anddataall matched (0 mismatches / 0 missing).Notes / limitations
remote_query, which requires Databricks Runtime 17.3+ or serverless compute (the reconcile job's default cluster may run an older runtime — point it at a DBR 17.3+ cluster viajob_overrides.existing_cluster_id, or run serverless). This applies to all Lakehouse Federation reconcile sources. Documented in the config tab.INTERVALmaps to two Databricks columns, which the 1:1 schema comparison can't represent — surfaces as a visible mismatch (documented).