Add BigQuery as a reconcile source by take60 · Pull Request #2527 · databrickslabs/lakebridge

take60 · 2026-06-24T06:22:25Z

What

Adds BigQuery as a reconcile source, at parity with the other sources (schema / row / data / all / aggregate). BigQuery was already supported by the transpiler and profiler; this closes the gap for reconcile.

How it works

BigQuery uses the same Lakehouse Federation remote_query path as Snowflake/Oracle/etc. — no new dependencies. BigQueryDataSource uses backtick-quoted 3-part project.dataset.table names and reads metadata from INFORMATION_SCHEMA.COLUMNS. sqlglot already ships a BigQuery dialect, so query generation and schema comparison are reused unchanged.

Changes

Connector reconcile/connectors/bigquery.py (BigQueryDataSource); registered in ReconSourceType and source_adapter.create_adapter; install prompt + recon_capture display name.
Schema type handling: the connector's INFORMATION_SCHEMA query canonicalizes the few BigQuery types sqlglot can't bridge to Databricks (BIGNUMERIC→string, bare NUMERIC→decimal(38,9), TIME→string, JSON→variant, RANGE<T>→struct<…>); everything else is left to sqlglot.
Row hashing: adds a BigQuery hash algorithm (TO_HEX(SHA256()), matching Databricks sha2(...,256)) and a scale-aware decimal transform (FORMAT('%.<scale>f', col)) so BigQuery's trailing-zero-stripped numeric strings match Spark's scale-padded DECIMAL strings.
Docs + tests: supported-sources row, a BigQuery config tab (with a compute note), connector tests, a type-coverage guardrail test, and an adapter test.

Testing

make fmt / make lint (pylint 10.0/10.0) and make test — green (1319 passed).
End-to-end on a real workspace: a BigQuery source (UC Federation connection) reconciled against an identical Databricks copy via the deployed reconcile job — schema, row, and data all matched (0 mismatches / 0 missing).

Notes / limitations

Compute: BigQuery reads use remote_query, which requires Databricks Runtime 17.3+ or serverless compute (the reconcile job's default cluster may run an older runtime — point it at a DBR 17.3+ cluster via job_overrides.existing_cluster_id, or run serverless). This applies to all Lakehouse Federation reconcile sources. Documented in the config tab.
INTERVAL maps to two Databricks columns, which the 1:1 schema comparison can't represent — surfaces as a visible mismatch (documented).

codecov · 2026-06-24T06:26:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.41%. Comparing base (5eb38d1) to head (f425afd).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2527      +/-   ##
==========================================
+ Coverage   69.18%   69.41%   +0.22%     
==========================================
  Files         105      106       +1     
  Lines        9503     9574      +71     
  Branches     1052     1056       +4     
==========================================
+ Hits         6575     6646      +71     
  Misses       2731     2731              
  Partials      197      197

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-06-24T06:33:24Z

✅ 172/172 passed, 3 skipped, 48m48s total

_{Running from acceptance #4945}

Adds BigQuery as a Lakehouse Federation reconcile source (schema/row/data/all/aggregate), reusing the existing remote_query path like the other federation connectors. - New BigQueryDataSource: remote_query reads, backtick 3-part `project.dataset.table` names, INFORMATION_SCHEMA schema query with scale/precision canonicalization - Register BIGQUERY in ReconSourceType and source_adapter; install prompts + result display name - Row hashing for BigQuery: TO_HEX(SHA256()) (matches Databricks sha2) and scale-aware decimal FORMAT so cross-engine hashes match Databricks DECIMAL string output - Docs (supported sources + config tab incl. DBR 17.3+/serverless compute note) and unit tests incl. a type-coverage guardrail

m-abulazm

added UC connection with name bigquery_sandbox for the e2e test

…ion tests - bigquery.py: reference tables and INFORMATION_SCHEMA two-part (dataset.table); the project is abstracted by the UC connection, matching the other federated connectors. list_schemas uses bare SCHEMATA via the connection's default project. - install: drop the BigQuery project prompt; catalog is empty for the bigquery dialect. - unit tests: update assertions from three-part to two-part naming. - integration: add bigquery e2e (report_type=schema) plus read_schema/list tests against the bigquery_sandbox UC connection.

…igquery

main removed profiler_dashboard from LakebridgeConfiguration (#2512); update the BigQuery reconcile install test to match.

catalog="" is dropped by blueprint serde and reloads as None, breaking the required str field (e2e SerdeError). BigQuery has no separate catalog, so mirror the dataset into catalog (non-empty, round-trips); the connector ignores it (two-part naming).

take60 · 2026-06-24T22:47:58Z

@m-abulazm thanks for setting up the connection.

All 4 BigQuery acceptance tests now fail on the same single cause — a connection grant, not
code:

PERMISSION_DENIED: User does not have USE CONNECTION on Connection 'bigquery_sandbox'
(the deployed-job path shows it masked as bigquery_TEST_CATALOG, but it's the same
connection.)

Two things from your side:

Grant USE CONNECTION on the BigQuery connection to the acceptance test principal
Heads-up for the next step: once the grant is in, remote_query will try to materialize
results into the read dataset (public) — so please also make sure that dataset is writable by
the connection's service account, otherwise we'll hit that right after. (Not failing on this
yet — execution stops at the grant first.)

Thanks

bishwajit-db · 2026-06-25T08:18:12Z

Materialization dataset: undocumented write requirement + no config knob

In production the BigQuery materialization target always defaults to the read dataset (_mat_dataset falls back to schema; create_adapter never sets materialization_dataset). Two gaps follow from that:

Docs: the source dataset must be writable by the connection's service account (remote_query materializes results there), but only the DBR 17.3+ requirement is documented. Suggest a line in the BigQuery config notes, e.g.: "ensure the source dataset, or a dedicated materialization dataset, is writable by the connection's service account."
Config: materialization_dataset is a constructor-only arg with no path from config (SourceConnectionConfig has no field, create_adapter never passes it), so it's only settable in test code. Adding it to SourceConnectionConfig and plumbing it through create_adapter would let users keep the source dataset read-only. It also fixes list_schemas, which passes _mat_dataset("") and resolves to an empty materializationDataset when none is configured.

The doc line unblocks the PR; the config change can be a follow-up.

…ataset Address review feedback: document that the source dataset (or a dedicated materialization dataset) must be writable by the connection's service account, since remote_query materializes results there. Also correct the catalog description for two-part naming — the project is taken from the UC connection; catalog mirrors the dataset.

…igquery

take60 · 2026-06-25T13:15:31Z

Thanks @bishwajit-db for the review!

(1) Docs — done here: added a "Writable dataset required" note (source or a dedicated materialization dataset must be writable by the connection's SA). Also fixed the naming text for two-part — project comes from the UC connection, schema is the dataset, catalog mirrors it.

(2) Config — agreed, taking as a follow-up: add materialization_dataset to SourceConnectionConfig and thread it through create_adapter (also fixes list_schemas' empty materializationDataset). Separate PR so this can land on the doc fix. #2529

@m-abulazm thanks for setting up UC connection and grant the permission. added testing and verified.

m-abulazm

Nice work, a few things before this is good to merge:

1. Use catalog as materializationDataset. Rather than threading a separate materialization_dataset arg through the connector and defaulting it to the source dataset, let's just use catalog for this. We already carry catalog around for every source, it's already non-empty in the config and reusing it avoids inventing a parallel concept.

2. Drop the comments. There's a lot of explanatory commentary here — the module docstring, the inline notes on the _SCHEMA_QUERY CASE, the read_data placeholder substitution, list_schemas, etc. Readable, well-written code doesn't need this. Please lift out anything that's genuinely surprising into the names/structure of the code itself and delete the rest.

3. Revert the type canonicalization; do it through DataType_transform_mapping. We already support this through DataType_transform_mapping and I don't want another magical spot that conditionally rewrites types (The special-cased bigquery_decimal_transform branch in _get_transform). Let's improve DataType_transform_mapping to handle these cases properly instead. Please revert this part here and bring the refactoring on a dedicated follow-up PR so it can be reviewed on its own.

Happy to pair on the DataType_transform_mapping extension if useful.

m-abulazm · 2026-06-26T08:36:46Z

+        # BigQuery's remote_query rejects `database`; a `query` push requires `materializationDataset`
+        # (a writable BigQuery dataset where results are materialized). Defaults to the dataset being
+        # read; set explicitly when the source dataset is read-only for the connection's service account.


Suggested change

# BigQuery's remote_query rejects `database`; a `query` push requires `materializationDataset`

# (a writable BigQuery dataset where results are materialized). Defaults to the dataset being

# read; set explicitly when the source dataset is read-only for the connection's service account.

m-abulazm · 2026-06-26T08:41:22Z

+logger = logging.getLogger(__name__)
+
+
+class BigQueryDataSource(DataSource):


lets use catalog as materializationDataset and document that clearly plus logs when using it

Done — catalog is now used as the materializationDataset, documented in the class docstring and logged on each read.

m-abulazm · 2026-06-26T08:42:40Z

+    """BigQuery data source read through a Databricks Lakehouse Federation UC connection.
+
+    Data is fetched via the `remote_query` table-valued function (same path as Snowflake/Oracle/etc.),
+    so credentials live in the UC connection and no JDBC driver is required here.
+
+    Naming/quoting follows GoogleSQL: identifiers are backtick-quoted and tables are referenced
+    two-part as `dataset.table` (dataset == schema), the same way the other federated connectors
+    keep the top-level container out of the dotted name. The project is abstracted by the UC
+    connection (its default project scopes unqualified names), so the `catalog` argument is unused.
+
+    The `_SCHEMA_QUERY` `CASE` is the *Stage-1* type canonicalization for schema reconciliation: it
+    emits, for the handful of BigQuery types that sqlglot cannot bridge to Databricks on its own, the
+    Databricks-equivalent type string so the downstream `schema_compare` round-trip matches. Targets are
+    taken from the empirically-tested BigQuery -> Databricks type mapping (FE GCP + DBSQL 2026.10):
+      * BIGNUMERIC -> string  (precision 76 exceeds Databricks DECIMAL max 38; STRING preserves it)
+      * NUMERIC    -> decimal(38,9)  (bare NUMERIC is fixed 38/9; sqlglot would emit DECIMAL(10,0))
+      * TIME       -> string  (Databricks has no TIME type)
+      * JSON       -> variant
+      * RANGE<T>   -> struct<start <T>, end <T>>
+    All other types (INT64, FLOAT64, BOOL, STRING, BYTES, DATE, DATETIME, TIMESTAMP, NUMERIC(p,s),
+    GEOGRAPHY, ARRAY, STRUCT) are left raw because sqlglot translates them correctly.
+    INTERVAL is intentionally not mapped: it migrates to two Databricks columns, which the one-to-one
+    schema comparison cannot represent, so such columns surface as a visible mismatch.
+    """


Suggested change

"""BigQuery data source read through a Databricks Lakehouse Federation UC connection.

Data is fetched via the `remote_query` table-valued function (same path as Snowflake/Oracle/etc.),

so credentials live in the UC connection and no JDBC driver is required here.

Naming/quoting follows GoogleSQL: identifiers are backtick-quoted and tables are referenced

two-part as `dataset.table` (dataset == schema), the same way the other federated connectors

keep the top-level container out of the dotted name. The project is abstracted by the UC

connection (its default project scopes unqualified names), so the `catalog` argument is unused.

The `_SCHEMA_QUERY` `CASE` is the *Stage-1* type canonicalization for schema reconciliation: it

emits, for the handful of BigQuery types that sqlglot cannot bridge to Databricks on its own, the

Databricks-equivalent type string so the downstream `schema_compare` round-trip matches. Targets are

taken from the empirically-tested BigQuery -> Databricks type mapping (FE GCP + DBSQL 2026.10):

* BIGNUMERIC -> string (precision 76 exceeds Databricks DECIMAL max 38; STRING preserves it)

* NUMERIC -> decimal(38,9) (bare NUMERIC is fixed 38/9; sqlglot would emit DECIMAL(10,0))

* TIME -> string (Databricks has no TIME type)

* JSON -> variant

* RANGE<T> -> struct<start <T>, end <T>>

All other types (INT64, FLOAT64, BOOL, STRING, BYTES, DATE, DATETIME, TIMESTAMP, NUMERIC(p,s),

GEOGRAPHY, ARRAY, STRUCT) are left raw because sqlglot translates them correctly.

INTERVAL is intentionally not mapped: it migrates to two Databricks columns, which the one-to-one

schema comparison cannot represent, so such columns surface as a visible mismatch.

"""

please avoid comments. Readable well-written code does not need comments

m-abulazm · 2026-06-26T08:48:21Z

                logger.warning(f"Could not parse datatype {datatype} for source {source_dialect}")

+            # BigQuery decimals need scale-aware padding to match Spark's CAST(DECIMAL AS STRING).
+            if source_dialect == "bigquery" and parsed == exp.DataType.Type.DECIMAL.value:


we support this through DataType_transform_mapping I dont want to introduce another magical place that does things conditionally.

we should improve DataType_transform_mapping to better support this. please revert this and add the new refactoring on a new dedicated PR

Reverted the conditional branch. Tracking the DataType_transform_mapping refactor in #2537.

m-abulazm · 2026-06-26T08:49:33Z

+        # BigQuery CONCAT/|| and TRIM/COALESCE require STRING operands, so cast every column to STRING
+        # before concatenation (mirrors the tsql default). DATE/INT64/NUMERIC/STRING cast deterministically;
+        # TIMESTAMP/FLOAT64 string formatting parity with Databricks is a follow-up (explicit FORMAT needed).


Suggested change

# BigQuery CONCAT/|| and TRIM/COALESCE require STRING operands, so cast every column to STRING

# before concatenation (mirrors the tsql default). DATE/INT64/NUMERIC/STRING cast deterministically;

# TIMESTAMP/FLOAT64 string formatting parity with Databricks is a follow-up (explicit FORMAT needed).

# TODO: add timestamps and numbers handling

Applied — left the # TODO: add timestamps and numbers handling marker; full handling in #2537.

m-abulazm · 2026-06-26T08:49:53Z

+            # The BigQuery project is abstracted by the UC connection (its default project scopes
+            # unqualified names), so no project is prompted; the connector uses two-part dataset.table.
+            # catalog is set to the dataset below (it must be non-empty to round-trip through serde).


Suggested change

# The BigQuery project is abstracted by the UC connection (its default project scopes

# unqualified names), so no project is prompted; the connector uses two-part dataset.table.

# catalog is set to the dataset below (it must be non-empty to round-trip through serde).

m-abulazm · 2026-06-26T08:50:04Z

+            # BigQuery has no separate catalog; mirror the dataset so the value is non-empty (the
+            # connector ignores it). The project is abstracted by the UC connection.


Suggested change

# BigQuery has no separate catalog; mirror the dataset so the value is non-empty (the

# connector ignores it). The project is abstracted by the UC connection.

… revert decimal transform - BigQuery connector uses catalog as the remote_query materializationDataset (logged), removing the constructor-only materialization_dataset arg. - Remove explanatory comments in the connector, install, and reconcile fixtures. - Revert the special-cased BigQuery decimal transform (bigquery_decimal_transform and the _get_transform branch); decimal/timestamp hashing parity will be handled via DataType_transform_mapping in a follow-up.

…igquery

take60 · 2026-06-26T14:17:48Z

Thanks @m-abulazm — addressed all three (CI green):

catalog as materializationDataset — the connector now uses catalog for it (with a log line); dropped the materialization_dataset constructor arg. That made the separate config knob unnecessary, so I closed BigQuery reconcile: make materialization_dataset configurable via SourceConnectionConfig #2529.
Comments — removed the docstring prose and inline comments in the connector / install / fixtures.
Decimal transform — reverted bigquery_decimal_transform and the _get_transform branch; left a # TODO. Doing it properly via DataType_transform_mapping is tracked in BigQuery reconcile: handle decimal/timestamp hash parity via DataType_transform_mapping #2537 (happy to pair).

PTAL.

take60 force-pushed the feature/reconcile-bigquery branch from d73cf77 to 13141ee Compare June 24, 2026 06:50

take60 marked this pull request as ready for review June 24, 2026 07:26

take60 requested a review from a team as a code owner June 24, 2026 07:26

m-abulazm requested changes Jun 24, 2026

View reviewed changes

take60 added 4 commits June 24, 2026 22:02

Merge remote-tracking branch 'upstream/main' into feature/reconcile-b…

bbb15a7

…igquery

Drop profiler_dashboard from bigquery install test after merging main

3a40913

main removed profiler_dashboard from LakebridgeConfiguration (#2512); update the BigQuery reconcile install test to match.

take60 added 2 commits June 25, 2026 21:23

Merge remote-tracking branch 'upstream/main' into feature/reconcile-b…

ac18df1

…igquery

take60 mentioned this pull request Jun 25, 2026

BigQuery reconcile: make materialization_dataset configurable via SourceConnectionConfig #2529

Closed

take60 enabled auto-merge June 25, 2026 13:16

take60 requested a review from m-abulazm June 25, 2026 13:16

m-abulazm requested changes Jun 26, 2026

View reviewed changes

take60 mentioned this pull request Jun 26, 2026

BigQuery reconcile: handle decimal/timestamp hash parity via DataType_transform_mapping #2537

Open

Merge remote-tracking branch 'upstream/main' into feature/reconcile-b…

f425afd

…igquery

take60 requested a review from m-abulazm June 26, 2026 14:21

	# BigQuery's remote_query rejects `database`; a `query` push requires `materializationDataset`
	# (a writable BigQuery dataset where results are materialized). Defaults to the dataset being
	# read; set explicitly when the source dataset is read-only for the connection's service account.

		logger = logging.getLogger(__name__)


		class BigQueryDataSource(DataSource):

	# The BigQuery project is abstracted by the UC connection (its default project scopes
	# unqualified names), so no project is prompted; the connector uses two-part dataset.table.
	# catalog is set to the dataset below (it must be non-empty to round-trip through serde).

		# BigQuery has no separate catalog; mirror the dataset so the value is non-empty (the
		# connector ignores it). The project is abstracted by the UC connection.

Uh oh!

Conversation

take60 commented Jun 24, 2026

What

How it works

Changes

Testing

Notes / limitations

Uh oh!

codecov Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m-abulazm left a comment

Choose a reason for hiding this comment

Uh oh!

take60 commented Jun 24, 2026

Uh oh!

bishwajit-db commented Jun 25, 2026

Uh oh!

take60 commented Jun 25, 2026

Uh oh!

m-abulazm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

take60 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jun 24, 2026 •

edited

Loading

github-actions Bot commented Jun 24, 2026 •

edited

Loading