feat: --events-bq-query-file event source for bqaa-revalidate-extractors (#75 follow-up)#151
Merged
haiyuan-eng-google merged 3 commits intoMay 12, 2026
Conversation
…ors (GoogleCloudPlatform#75 follow-up) The CLI now accepts a BigQuery event source in addition to --events-jsonl; the two are mutually exclusive and exactly one must be supplied. Promised follow-up to the local-only landing of PR GoogleCloudPlatform#150. Contract: the SQL must produce exactly one column named ``event_json`` (STRING) per row, containing a JSON-encoded event dict — same shape --events-jsonl consumes line-by-line. The CLI does NOT auto-shape bigquery.Row objects; the query writer controls projection (typically via ``TO_JSON_STRING(STRUCT(...))``). Explicit non-goal: row-shape auto-inference would couple the CLI to every caller's table schema; the single-column contract keeps the CLI predictable. New flags: * --events-bq-query-file PATH (mutex with --events-jsonl; argparse mutex group with required=True so "both" / "neither" both return exit 2 via the existing _CliError boundary). * --bq-project PROJECT_ID — OPTIONAL. When absent the BigQuery client falls back to Application Default Credentials / environment for project inference. If both the flag and the inferred project are empty, the CLI exits 2 with "Set --bq-project explicitly" rather than confusing the operator with a downstream BigQuery API error. * --bq-location LOC — defaults to "US"; ignored under --events-jsonl. Error handling — every failure mode surfaces as a clean exit 2 with the file path and (for row-level errors) the 0-based row index named, so an operator can find the offender with ``LIMIT N OFFSET row_index``: * BigQuery raises (auth, syntax, table-not-found, permission) → "BigQuery query failed: <Type>: <message>" * Row missing the event_json column → "row N: missing required column 'event_json'" * event_json value isn't a STRING (struct projected without TO_JSON_STRING) → "row N: 'event_json' must be STRING" * event_json STRING isn't valid JSON → "row N: invalid JSON in 'event_json'" * event_json decodes to non-object (array, scalar) → "row N: 'event_json' decodes to <type>, expected a JSON object" * Empty / whitespace-only .sql file → "is empty" * Invalid UTF-8 in .sql file → "not valid UTF-8" Implementation notes: * Client construction is centralized behind ``_make_bq_client(project, location)`` so unit tests inject fakes via ``monkeypatch.setattr(cli_revalidate, "_make_bq_client", ...)`` rather than wiring through every call site. The real factory tries ``bigquery.Client(project=...)`` first; falls back to ``bigquery.Client()`` if project is None; raises _CliError if client.project ends up falsy. * ``_load_events_from_bq`` reads the .sql file (with the same UTF-8 / OSError wrapping pattern as _load_jsonl / _load_thresholds), constructs the client via the factory, runs ``client.query(sql).result()`` inside a single try/except that catches Exception (not BaseException, so KeyboardInterrupt / SystemExit still propagate), then iterates rows with row-indexed validation. * ``_load_config`` dispatches on whichever event-source flag is set; argparse's mutex group enforces exactly one. Tests: 20 -> 31 in tests/test_extractor_compilation_cli_revalidate.py. * TestCliUsageErrors gains two cases for the mutex (both / neither event sources). * New TestCliEventsBQ class (9 cases): happy path, ADC project inference, no-project-anywhere, query exceptions, every row-shape failure mode (missing column, non-string, invalid JSON, non-dict decode), empty .sql file. Live BQ test in tests/test_extractor_compilation_cli_revalidate_bq_live.py, gated behind BQAA_RUN_LIVE_TESTS=1 + BQAA_RUN_LIVE_BQ_REVALIDATE_TESTS=1 + PROJECT_ID + DATASET_ID. Creates a temp table, inserts two event_json rows shaped like BKA decisions, runs the CLI, asserts the report is written with both events as compiled_unchanged + parity_matches, deletes the table on the way out. Docs: CLI doc page describes the contract, error mapping, project resolution, and adds a BigQuery usage example. docs/README.md index entry + CHANGELOG entry updated.
…e-column + fix stale example SQL Addresses PR GoogleCloudPlatform#151 round-1 reviewer findings. P1 - BigQuery client construction now wrapped in its own try/except. Previously ``_make_bq_client(...)`` ran OUTSIDE the try that wraps query failures, so auth / ADC / invalid- credentials / network failures escaped as raw tracebacks despite the docs promising clean CLI errors. The construction wrap distinguishes the failure mode from query-time failures via a separate prefix: * "BigQuery client construction failed: ..." for auth/ADC errors before ``client.query`` runs. * "BigQuery query failed: ..." for syntax/permission/table- not-found at query time. Our own ``_CliError`` from project-inference failure inside the factory passes through unchanged (no re-wrapping). P2 - "Exactly one column named event_json" contract is now enforced. Previously a query returning ``event_json, extra_col`` was silently accepted because the row loop only reads ``row["event_json"]``. Now validated against the first row's keys before iteration; mismatch fails with exit 2 listing the offending column set so an operator can fix the SQL. Check is via ``sorted(rows[0].keys())`` (works for both ``bigquery.Row`` and dict fakes). Empty result sets are a no-op — nothing to validate, harness produces a zero-event report. P3 - Stale example SQL fixed. The doc previously used ``WHERE event_timestamp BETWEEN @start_time AND @end_time``, but the CLI doesn't accept query parameters; copying the example as-is would fail at BigQuery. Replaced with literal ``TIMESTAMP('2026-05-01')`` / ``TIMESTAMP('2026-05-12')`` and explicit doc text: "The SQL file must be fully self-contained — the CLI does not accept query parameters, so substitute concrete literals before invoking." Tests: 31 -> 33 in tests/test_extractor_compilation_cli_revalidate.py. * test_bq_client_construction_failure_returns_two: monkey-patches _make_bq_client to raise RuntimeError; asserts exit 2 with "BigQuery client construction failed" prefix + the underlying message. * test_bq_extra_column_rejected: row with extra_col key fails with "exactly one column ... got ['event_json', 'extra_col']"; report is NOT written. All 32 CLI tests + 1 entry-point skip pass; pyink + isort clean on the touched files.
… zero-row wrong-schema case) PR GoogleCloudPlatform#151 round-2 reviewer finding. P2 - The previous "exactly one column" check only ran when rows is non-empty. A query like ``SELECT event_json, extra_col FROM t WHERE FALSE`` returns zero rows but still has the wrong schema; the CLI silently skipped validation and wrote a successful zero-event report — a misleading "all clear" signal when the SQL itself is broken. Fix: derive column names via a new ``_query_result_column_names(job, rows)`` helper that prefers ``job.schema`` (real ``bigquery.QueryJob`` populates this regardless of row count) and falls back to ``sorted(rows[0].keys())`` only when the job lacks a usable schema (test fakes that don't simulate it). Returns ``None`` on the degenerate "fake + zero rows + no schema" case, which the caller treats as "skip the contract check" — that preserves existing fake-based tests without weakening the real-BigQuery path (production always has schema). The fake test client now supports an optional ``schema=`` parameter (list of objects exposing ``.name``) so tests can simulate both correct and incorrect schemas independently of the row payload. Tests: 33 -> 35. * test_bq_extra_column_rejected_on_empty_result_set: zero rows + ``schema=[event_json, extra_col]`` → exit 2 with the column-set listed; report NOT written. The bug reproducer. * test_bq_correct_schema_empty_result_set_succeeds: zero rows + ``schema=[event_json]`` → exit 0 with a zero-event report. Locks the design choice that an empty-but-well-shaped result is a valid revalidation outcome, not an error. Docs: failure-mode table notes the schema-based check; contract section + test list updated.
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The deliberate follow-up to PR #150:
bqaa-revalidate-extractorsnow accepts a BigQuery event source in addition to--events-jsonl. Mutually exclusive — exactly one must be supplied (argparse mutex group withrequired=True).Contract
The SQL must produce exactly one column named
event_json(STRING) per row, containing a JSON-encoded event dict. Same shape--events-jsonlconsumes line-by-line. The CLI does not auto-shapebigquery.Rowobjects — the query writer controls projection:Explicit non-goal: auto-row-shape inference would couple the CLI to every caller's table schema. The single-column contract keeps the CLI predictable; the query writer owns projection.
New flags
--events-bq-query-file--events-jsonl.--bq-projectSet --bq-project explicitly.--bq-locationUS.Error mapping (all exit 2, no traceback)
BigQuery query failed: <Type>: <message>event_jsoncolumnrow N: missing required column 'event_json'event_jsonvalue isn't STRINGrow N: 'event_json' must be STRING; got <type>event_jsonSTRING isn't valid JSONrow N: invalid JSON in 'event_json': <msg>event_jsondecodes to non-objectrow N: 'event_json' decodes to <type>, expected a JSON object.sqlfile... is empty.sqlfilenot valid UTF-8Row indices are 0-based against the result set, so an operator can
LIMIT N OFFSET row_indexto find the offender.Implementation
Client construction centralized behind
_make_bq_client(project, location)so unit tests inject in-memory fakes viamonkeypatch.setattrrather than wiring through every call site. The factory triesbigquery.Client(project=...)first; falls back tobigquery.Client()if project is None; raises_CliErrorif the resultingclient.projectis empty.Test plan
tests/test_extractor_compilation_cli_revalidate.py): 20 → 31 cases.TestCliUsageErrorsgains two mutex cases (both / neither event sources).TestCliEventsBQ(9): happy path, ADC project inference, no-project-anywhere, query exceptions, every row-shape failure mode, empty SQL file.tests/test_extractor_compilation_cli_revalidate_bq_live.py, 1 case) gated behindBQAA_RUN_LIVE_TESTS=1+BQAA_RUN_LIVE_BQ_REVALIDATE_TESTS=1+PROJECT_ID+DATASET_ID. Creates a temp table, inserts twoevent_jsonrows, runs the CLI, asserts both events came through ascompiled_unchanged+ parity matches, drops the table.pyink+isortclean.Out of scope (deferred)
--events-row-limit, streaming aggregation) — current iteration handles tens of thousands comfortably.Up next after this lands: the end-to-end rollout doc stitching compile → publish → sync → wire → revalidate into one flow.