Skip to content

Add BigQuery as a reconcile source#2527

Open
take60 wants to merge 9 commits into
mainfrom
feature/reconcile-bigquery
Open

Add BigQuery as a reconcile source#2527
take60 wants to merge 9 commits into
mainfrom
feature/reconcile-bigquery

Conversation

@take60

@take60 take60 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What

Adds BigQuery as a reconcile source, at parity with the other sources (schema / row / data / all / aggregate). BigQuery was already supported by the transpiler and profiler; this closes the gap for reconcile.

How it works

BigQuery uses the same Lakehouse Federation remote_query path as Snowflake/Oracle/etc. — no new dependencies. BigQueryDataSource uses backtick-quoted 3-part project.dataset.table names and reads metadata from INFORMATION_SCHEMA.COLUMNS. sqlglot already ships a BigQuery dialect, so query generation and schema comparison are reused unchanged.

Changes

  • Connector reconcile/connectors/bigquery.py (BigQueryDataSource); registered in ReconSourceType and source_adapter.create_adapter; install prompt + recon_capture display name.
  • Schema type handling: the connector's INFORMATION_SCHEMA query canonicalizes the few BigQuery types sqlglot can't bridge to Databricks (BIGNUMERICstring, bare NUMERICdecimal(38,9), TIMEstring, JSONvariant, RANGE<T>struct<…>); everything else is left to sqlglot.
  • Row hashing: adds a BigQuery hash algorithm (TO_HEX(SHA256()), matching Databricks sha2(...,256)) and a scale-aware decimal transform (FORMAT('%.<scale>f', col)) so BigQuery's trailing-zero-stripped numeric strings match Spark's scale-padded DECIMAL strings.
  • Docs + tests: supported-sources row, a BigQuery config tab (with a compute note), connector tests, a type-coverage guardrail test, and an adapter test.

Testing

  • make fmt / make lint (pylint 10.0/10.0) and make test — green (1319 passed).
  • End-to-end on a real workspace: a BigQuery source (UC Federation connection) reconciled against an identical Databricks copy via the deployed reconcile jobschema, row, and data all matched (0 mismatches / 0 missing).

Notes / limitations

  • Compute: BigQuery reads use remote_query, which requires Databricks Runtime 17.3+ or serverless compute (the reconcile job's default cluster may run an older runtime — point it at a DBR 17.3+ cluster via job_overrides.existing_cluster_id, or run serverless). This applies to all Lakehouse Federation reconcile sources. Documented in the config tab.
  • INTERVAL maps to two Databricks columns, which the 1:1 schema comparison can't represent — surfaces as a visible mismatch (documented).

@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.41%. Comparing base (5eb38d1) to head (f425afd).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2527      +/-   ##
==========================================
+ Coverage   69.18%   69.41%   +0.22%     
==========================================
  Files         105      106       +1     
  Lines        9503     9574      +71     
  Branches     1052     1056       +4     
==========================================
+ Hits         6575     6646      +71     
  Misses       2731     2731              
  Partials      197      197              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

✅ 172/172 passed, 3 skipped, 48m48s total

Running from acceptance #4945

Adds BigQuery as a Lakehouse Federation reconcile source (schema/row/data/all/aggregate),
reusing the existing remote_query path like the other federation connectors.

- New BigQueryDataSource: remote_query reads, backtick 3-part `project.dataset.table` names,
  INFORMATION_SCHEMA schema query with scale/precision canonicalization
- Register BIGQUERY in ReconSourceType and source_adapter; install prompts + result display name
- Row hashing for BigQuery: TO_HEX(SHA256()) (matches Databricks sha2) and scale-aware decimal
  FORMAT so cross-engine hashes match Databricks DECIMAL string output
- Docs (supported sources + config tab incl. DBR 17.3+/serverless compute note) and unit tests
  incl. a type-coverage guardrail
@take60 take60 force-pushed the feature/reconcile-bigquery branch from d73cf77 to 13141ee Compare June 24, 2026 06:50
@take60 take60 marked this pull request as ready for review June 24, 2026 07:26
@take60 take60 requested a review from a team as a code owner June 24, 2026 07:26

@m-abulazm m-abulazm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added UC connection with name bigquery_sandbox for the e2e test

take60 added 4 commits June 24, 2026 22:02
…ion tests

- bigquery.py: reference tables and INFORMATION_SCHEMA two-part (dataset.table);
  the project is abstracted by the UC connection, matching the other federated
  connectors. list_schemas uses bare SCHEMATA via the connection's default project.
- install: drop the BigQuery project prompt; catalog is empty for the bigquery dialect.
- unit tests: update assertions from three-part to two-part naming.
- integration: add bigquery e2e (report_type=schema) plus read_schema/list tests
  against the bigquery_sandbox UC connection.
main removed profiler_dashboard from LakebridgeConfiguration (#2512); update the
BigQuery reconcile install test to match.
catalog="" is dropped by blueprint serde and reloads as None, breaking the required
str field (e2e SerdeError). BigQuery has no separate catalog, so mirror the dataset
into catalog (non-empty, round-trips); the connector ignores it (two-part naming).
@take60

take60 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@m-abulazm thanks for setting up the connection.

All 4 BigQuery acceptance tests now fail on the same single cause — a connection grant, not
code:

PERMISSION_DENIED: User does not have USE CONNECTION on Connection 'bigquery_sandbox'
(the deployed-job path shows it masked as bigquery_TEST_CATALOG, but it's the same
connection.)

Two things from your side:

  1. Grant USE CONNECTION on the BigQuery connection to the acceptance test principal
  2. Heads-up for the next step: once the grant is in, remote_query will try to materialize
    results into the read dataset (public) — so please also make sure that dataset is writable by
    the connection's service account, otherwise we'll hit that right after. (Not failing on this
    yet — execution stops at the grant first.)

Thanks

@bishwajit-db

Copy link
Copy Markdown
Contributor

Materialization dataset: undocumented write requirement + no config knob

In production the BigQuery materialization target always defaults to the read dataset (_mat_dataset falls back to schema; create_adapter never sets materialization_dataset). Two gaps follow from that:

  1. Docs: the source dataset must be writable by the connection's service account (remote_query materializes results there), but only the DBR 17.3+ requirement is documented. Suggest a line in the BigQuery config notes, e.g.: "ensure the source dataset, or a dedicated materialization dataset, is writable by the connection's service account."

  2. Config: materialization_dataset is a constructor-only arg with no path from config (SourceConnectionConfig has no field, create_adapter never passes it), so it's only settable in test code. Adding it to SourceConnectionConfig and plumbing it through create_adapter would let users keep the source dataset read-only. It also fixes list_schemas, which passes _mat_dataset("") and resolves to an empty materializationDataset when none is configured.

The doc line unblocks the PR; the config change can be a follow-up.

take60 added 2 commits June 25, 2026 21:23
…ataset

Address review feedback: document that the source dataset (or a dedicated materialization
dataset) must be writable by the connection's service account, since remote_query
materializes results there. Also correct the catalog description for two-part naming —
the project is taken from the UC connection; catalog mirrors the dataset.
@take60

take60 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @bishwajit-db for the review!

(1) Docs — done here: added a "Writable dataset required" note (source or a dedicated materialization dataset must be writable by the connection's SA). Also fixed the naming text for two-part — project comes from the UC connection, schema is the dataset, catalog mirrors it.

(2) Config — agreed, taking as a follow-up: add materialization_dataset to SourceConnectionConfig and thread it through create_adapter (also fixes list_schemas' empty materializationDataset). Separate PR so this can land on the doc fix. #2529

@m-abulazm thanks for setting up UC connection and grant the permission. added testing and verified.

@take60 take60 enabled auto-merge June 25, 2026 13:16
@take60 take60 requested a review from m-abulazm June 25, 2026 13:16

@m-abulazm m-abulazm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, a few things before this is good to merge:

1. Use catalog as materializationDataset. Rather than threading a separate materialization_dataset arg through the connector and defaulting it to the source dataset, let's just use catalog for this. We already carry catalog around for every source, it's already non-empty in the config and reusing it avoids inventing a parallel concept.

2. Drop the comments. There's a lot of explanatory commentary here — the module docstring, the inline notes on the _SCHEMA_QUERY CASE, the read_data placeholder substitution, list_schemas, etc. Readable, well-written code doesn't need this. Please lift out anything that's genuinely surprising into the names/structure of the code itself and delete the rest.

3. Revert the type canonicalization; do it through DataType_transform_mapping. We already support this through DataType_transform_mapping and I don't want another magical spot that conditionally rewrites types (The special-cased bigquery_decimal_transform branch in _get_transform). Let's improve DataType_transform_mapping to handle these cases properly instead. Please revert this part here and bring the refactoring on a dedicated follow-up PR so it can be reviewed on its own.

Happy to pair on the DataType_transform_mapping extension if useful.

Comment on lines +74 to +76
# BigQuery's remote_query rejects `database`; a `query` push requires `materializationDataset`
# (a writable BigQuery dataset where results are materialized). Defaults to the dataset being
# read; set explicitly when the source dataset is read-only for the connection's service account.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# BigQuery's remote_query rejects `database`; a `query` push requires `materializationDataset`
# (a writable BigQuery dataset where results are materialized). Defaults to the dataset being
# read; set explicitly when the source dataset is read-only for the connection's service account.

logger = logging.getLogger(__name__)


class BigQueryDataSource(DataSource):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets use catalog as materializationDataset and document that clearly plus logs when using it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — catalog is now used as the materializationDataset, documented in the class docstring and logged on each read.

Comment on lines +20 to +43
"""BigQuery data source read through a Databricks Lakehouse Federation UC connection.

Data is fetched via the `remote_query` table-valued function (same path as Snowflake/Oracle/etc.),
so credentials live in the UC connection and no JDBC driver is required here.

Naming/quoting follows GoogleSQL: identifiers are backtick-quoted and tables are referenced
two-part as `dataset.table` (dataset == schema), the same way the other federated connectors
keep the top-level container out of the dotted name. The project is abstracted by the UC
connection (its default project scopes unqualified names), so the `catalog` argument is unused.

The `_SCHEMA_QUERY` `CASE` is the *Stage-1* type canonicalization for schema reconciliation: it
emits, for the handful of BigQuery types that sqlglot cannot bridge to Databricks on its own, the
Databricks-equivalent type string so the downstream `schema_compare` round-trip matches. Targets are
taken from the empirically-tested BigQuery -> Databricks type mapping (FE GCP + DBSQL 2026.10):
* BIGNUMERIC -> string (precision 76 exceeds Databricks DECIMAL max 38; STRING preserves it)
* NUMERIC -> decimal(38,9) (bare NUMERIC is fixed 38/9; sqlglot would emit DECIMAL(10,0))
* TIME -> string (Databricks has no TIME type)
* JSON -> variant
* RANGE<T> -> struct<start <T>, end <T>>
All other types (INT64, FLOAT64, BOOL, STRING, BYTES, DATE, DATETIME, TIMESTAMP, NUMERIC(p,s),
GEOGRAPHY, ARRAY, STRUCT) are left raw because sqlglot translates them correctly.
INTERVAL is intentionally not mapped: it migrates to two Databricks columns, which the one-to-one
schema comparison cannot represent, so such columns surface as a visible mismatch.
"""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""BigQuery data source read through a Databricks Lakehouse Federation UC connection.
Data is fetched via the `remote_query` table-valued function (same path as Snowflake/Oracle/etc.),
so credentials live in the UC connection and no JDBC driver is required here.
Naming/quoting follows GoogleSQL: identifiers are backtick-quoted and tables are referenced
two-part as `dataset.table` (dataset == schema), the same way the other federated connectors
keep the top-level container out of the dotted name. The project is abstracted by the UC
connection (its default project scopes unqualified names), so the `catalog` argument is unused.
The `_SCHEMA_QUERY` `CASE` is the *Stage-1* type canonicalization for schema reconciliation: it
emits, for the handful of BigQuery types that sqlglot cannot bridge to Databricks on its own, the
Databricks-equivalent type string so the downstream `schema_compare` round-trip matches. Targets are
taken from the empirically-tested BigQuery -> Databricks type mapping (FE GCP + DBSQL 2026.10):
* BIGNUMERIC -> string (precision 76 exceeds Databricks DECIMAL max 38; STRING preserves it)
* NUMERIC -> decimal(38,9) (bare NUMERIC is fixed 38/9; sqlglot would emit DECIMAL(10,0))
* TIME -> string (Databricks has no TIME type)
* JSON -> variant
* RANGE<T> -> struct<start <T>, end <T>>
All other types (INT64, FLOAT64, BOOL, STRING, BYTES, DATE, DATETIME, TIMESTAMP, NUMERIC(p,s),
GEOGRAPHY, ARRAY, STRUCT) are left raw because sqlglot translates them correctly.
INTERVAL is intentionally not mapped: it migrates to two Databricks columns, which the one-to-one
schema comparison cannot represent, so such columns surface as a visible mismatch.
"""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid comments. Readable well-written code does not need comments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

logger.warning(f"Could not parse datatype {datatype} for source {source_dialect}")

# BigQuery decimals need scale-aware padding to match Spark's CAST(DECIMAL AS STRING).
if source_dialect == "bigquery" and parsed == exp.DataType.Type.DECIMAL.value:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we support this through DataType_transform_mapping I dont want to introduce another magical place that does things conditionally.

we should improve DataType_transform_mapping to better support this. please revert this and add the new refactoring on a new dedicated PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the conditional branch. Tracking the DataType_transform_mapping refactor in #2537.

Comment on lines +283 to +285
# BigQuery CONCAT/|| and TRIM/COALESCE require STRING operands, so cast every column to STRING
# before concatenation (mirrors the tsql default). DATE/INT64/NUMERIC/STRING cast deterministically;
# TIMESTAMP/FLOAT64 string formatting parity with Databricks is a follow-up (explicit FORMAT needed).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# BigQuery CONCAT/|| and TRIM/COALESCE require STRING operands, so cast every column to STRING
# before concatenation (mirrors the tsql default). DATE/INT64/NUMERIC/STRING cast deterministically;
# TIMESTAMP/FLOAT64 string formatting parity with Databricks is a follow-up (explicit FORMAT needed).
# TODO: add timestamps and numbers handling

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied — left the # TODO: add timestamps and numbers handling marker; full handling in #2537.

Comment on lines +367 to +369
# The BigQuery project is abstracted by the UC connection (its default project scopes
# unqualified names), so no project is prompted; the connector uses two-part dataset.table.
# catalog is set to the dataset below (it must be non-empty to round-trip through serde).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# The BigQuery project is abstracted by the UC connection (its default project scopes
# unqualified names), so no project is prompted; the connector uses two-part dataset.table.
# catalog is set to the dataset below (it must be non-empty to round-trip through serde).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines +383 to +384
# BigQuery has no separate catalog; mirror the dataset so the value is non-empty (the
# connector ignores it). The project is abstracted by the UC connection.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# BigQuery has no separate catalog; mirror the dataset so the value is non-empty (the
# connector ignores it). The project is abstracted by the UC connection.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

… revert decimal transform

- BigQuery connector uses catalog as the remote_query materializationDataset (logged),
  removing the constructor-only materialization_dataset arg.
- Remove explanatory comments in the connector, install, and reconcile fixtures.
- Revert the special-cased BigQuery decimal transform (bigquery_decimal_transform and the
  _get_transform branch); decimal/timestamp hashing parity will be handled via
  DataType_transform_mapping in a follow-up.
@take60

take60 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @m-abulazm — addressed all three (CI green):

  1. catalog as materializationDataset — the connector now uses catalog for it (with a log line); dropped the materialization_dataset constructor arg. That made the separate config knob unnecessary, so I closed BigQuery reconcile: make materialization_dataset configurable via SourceConnectionConfig #2529.
  2. Comments — removed the docstring prose and inline comments in the connector / install / fixtures.
  3. Decimal transform — reverted bigquery_decimal_transform and the _get_transform branch; left a # TODO. Doing it properly via DataType_transform_mapping is tracked in BigQuery reconcile: handle decimal/timestamp hash parity via DataType_transform_mapping #2537 (happy to pair).

PTAL.

@take60 take60 requested a review from m-abulazm June 26, 2026 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants