Skip to content

feat: --events-bq-query-file event source for bqaa-revalidate-extractors (#75 follow-up)#151

Merged
haiyuan-eng-google merged 3 commits into
GoogleCloudPlatform:mainfrom
caohy1988:feat/extractor-revalidate-cli-bq-query
May 12, 2026
Merged

feat: --events-bq-query-file event source for bqaa-revalidate-extractors (#75 follow-up)#151
haiyuan-eng-google merged 3 commits into
GoogleCloudPlatform:mainfrom
caohy1988:feat/extractor-revalidate-cli-bq-query

Conversation

@caohy1988
Copy link
Copy Markdown
Contributor

Summary

The deliberate follow-up to PR #150: bqaa-revalidate-extractors now accepts a BigQuery event source in addition to --events-jsonl. Mutually exclusive — exactly one must be supplied (argparse mutex group with required=True).

Contract

The SQL must produce exactly one column named event_json (STRING) per row, containing a JSON-encoded event dict. Same shape --events-jsonl consumes line-by-line. The CLI does not auto-shape bigquery.Row objects — the query writer controls projection:

SELECT TO_JSON_STRING(STRUCT(event_type, span_id, session_id, content)) AS event_json
FROM `proj.ds.agent_events`
WHERE event_timestamp BETWEEN @start AND @end
LIMIT 10000

Explicit non-goal: auto-row-shape inference would couple the CLI to every caller's table schema. The single-column contract keeps the CLI predictable; the query writer owns projection.

New flags

Flag Required Description
--events-bq-query-file one of Mutex with --events-jsonl.
--bq-project no Optional with ADC fallback. If both absent AND ADC can't infer, exit 2 with Set --bq-project explicitly.
--bq-location no Defaults to US.

Error mapping (all exit 2, no traceback)

Failure Surface
BigQuery raises (auth, syntax, table-not-found, permission) BigQuery query failed: <Type>: <message>
Row missing event_json column row N: missing required column 'event_json'
event_json value isn't STRING row N: 'event_json' must be STRING; got <type>
event_json STRING isn't valid JSON row N: invalid JSON in 'event_json': <msg>
event_json decodes to non-object row N: 'event_json' decodes to <type>, expected a JSON object
Empty .sql file ... is empty
Invalid UTF-8 in .sql file not valid UTF-8

Row indices are 0-based against the result set, so an operator can LIMIT N OFFSET row_index to find the offender.

Implementation

Client construction centralized behind _make_bq_client(project, location) so unit tests inject in-memory fakes via monkeypatch.setattr rather than wiring through every call site. The factory tries bigquery.Client(project=...) first; falls back to bigquery.Client() if project is None; raises _CliError if the resulting client.project is empty.

Test plan

  • CI suite (tests/test_extractor_compilation_cli_revalidate.py): 20 → 31 cases.
    • TestCliUsageErrors gains two mutex cases (both / neither event sources).
    • New TestCliEventsBQ (9): happy path, ADC project inference, no-project-anywhere, query exceptions, every row-shape failure mode, empty SQL file.
  • Live BQ suite (tests/test_extractor_compilation_cli_revalidate_bq_live.py, 1 case) gated behind BQAA_RUN_LIVE_TESTS=1 + BQAA_RUN_LIVE_BQ_REVALIDATE_TESTS=1 + PROJECT_ID + DATASET_ID. Creates a temp table, inserts two event_json rows, runs the CLI, asserts both events came through as compiled_unchanged + parity matches, drops the table.
  • 133 related extractor_compilation tests pass; pyink + isort clean.

Out of scope (deferred)

  • Pagination strategy for ultra-large corpora (--events-row-limit, streaming aggregation) — current iteration handles tens of thousands comfortably.
  • Scheduled execution, BQ persistence of reports, multiple bundle roots.
  • Auto-row-shape inference (explicit non-goal).

Up next after this lands: the end-to-end rollout doc stitching compile → publish → sync → wire → revalidate into one flow.

caohy1988 added 3 commits May 12, 2026 09:41
…ors (GoogleCloudPlatform#75 follow-up)

The CLI now accepts a BigQuery event source in addition to
--events-jsonl; the two are mutually exclusive and exactly one
must be supplied. Promised follow-up to the local-only landing
of PR GoogleCloudPlatform#150.

Contract: the SQL must produce exactly one column named
``event_json`` (STRING) per row, containing a JSON-encoded
event dict — same shape --events-jsonl consumes line-by-line.
The CLI does NOT auto-shape bigquery.Row objects; the query
writer controls projection (typically via
``TO_JSON_STRING(STRUCT(...))``). Explicit non-goal: row-shape
auto-inference would couple the CLI to every caller's table
schema; the single-column contract keeps the CLI predictable.

New flags:

* --events-bq-query-file PATH (mutex with --events-jsonl;
  argparse mutex group with required=True so "both" / "neither"
  both return exit 2 via the existing _CliError boundary).
* --bq-project PROJECT_ID — OPTIONAL. When absent the BigQuery
  client falls back to Application Default Credentials /
  environment for project inference. If both the flag and the
  inferred project are empty, the CLI exits 2 with
  "Set --bq-project explicitly" rather than confusing the
  operator with a downstream BigQuery API error.
* --bq-location LOC — defaults to "US"; ignored under
  --events-jsonl.

Error handling — every failure mode surfaces as a clean exit 2
with the file path and (for row-level errors) the 0-based row
index named, so an operator can find the offender with
``LIMIT N OFFSET row_index``:

* BigQuery raises (auth, syntax, table-not-found, permission)
  → "BigQuery query failed: <Type>: <message>"
* Row missing the event_json column
  → "row N: missing required column 'event_json'"
* event_json value isn't a STRING (struct projected without
  TO_JSON_STRING) → "row N: 'event_json' must be STRING"
* event_json STRING isn't valid JSON
  → "row N: invalid JSON in 'event_json'"
* event_json decodes to non-object (array, scalar)
  → "row N: 'event_json' decodes to <type>, expected a JSON
  object"
* Empty / whitespace-only .sql file → "is empty"
* Invalid UTF-8 in .sql file → "not valid UTF-8"

Implementation notes:

* Client construction is centralized behind
  ``_make_bq_client(project, location)`` so unit tests inject
  fakes via ``monkeypatch.setattr(cli_revalidate,
  "_make_bq_client", ...)`` rather than wiring through every
  call site. The real factory tries
  ``bigquery.Client(project=...)`` first; falls back to
  ``bigquery.Client()`` if project is None; raises _CliError if
  client.project ends up falsy.
* ``_load_events_from_bq`` reads the .sql file (with the same
  UTF-8 / OSError wrapping pattern as _load_jsonl /
  _load_thresholds), constructs the client via the factory,
  runs ``client.query(sql).result()`` inside a single
  try/except that catches Exception (not BaseException, so
  KeyboardInterrupt / SystemExit still propagate), then
  iterates rows with row-indexed validation.
* ``_load_config`` dispatches on whichever event-source flag
  is set; argparse's mutex group enforces exactly one.

Tests: 20 -> 31 in tests/test_extractor_compilation_cli_revalidate.py.

* TestCliUsageErrors gains two cases for the mutex (both /
  neither event sources).
* New TestCliEventsBQ class (9 cases): happy path, ADC
  project inference, no-project-anywhere, query exceptions,
  every row-shape failure mode (missing column, non-string,
  invalid JSON, non-dict decode), empty .sql file.

Live BQ test in tests/test_extractor_compilation_cli_revalidate_bq_live.py,
gated behind BQAA_RUN_LIVE_TESTS=1 +
BQAA_RUN_LIVE_BQ_REVALIDATE_TESTS=1 + PROJECT_ID + DATASET_ID.
Creates a temp table, inserts two event_json rows shaped like
BKA decisions, runs the CLI, asserts the report is written
with both events as compiled_unchanged + parity_matches,
deletes the table on the way out.

Docs: CLI doc page describes the contract, error mapping,
project resolution, and adds a BigQuery usage example.
docs/README.md index entry + CHANGELOG entry updated.
…e-column + fix stale example SQL

Addresses PR GoogleCloudPlatform#151 round-1 reviewer findings.

P1 - BigQuery client construction now wrapped in its own
try/except. Previously ``_make_bq_client(...)`` ran OUTSIDE
the try that wraps query failures, so auth / ADC / invalid-
credentials / network failures escaped as raw tracebacks
despite the docs promising clean CLI errors. The construction
wrap distinguishes the failure mode from query-time failures
via a separate prefix:

* "BigQuery client construction failed: ..." for auth/ADC
  errors before ``client.query`` runs.
* "BigQuery query failed: ..." for syntax/permission/table-
  not-found at query time.

Our own ``_CliError`` from project-inference failure inside
the factory passes through unchanged (no re-wrapping).

P2 - "Exactly one column named event_json" contract is now
enforced. Previously a query returning ``event_json,
extra_col`` was silently accepted because the row loop only
reads ``row["event_json"]``. Now validated against the first
row's keys before iteration; mismatch fails with exit 2
listing the offending column set so an operator can fix the
SQL.

Check is via ``sorted(rows[0].keys())`` (works for both
``bigquery.Row`` and dict fakes). Empty result sets are a
no-op — nothing to validate, harness produces a zero-event
report.

P3 - Stale example SQL fixed. The doc previously used
``WHERE event_timestamp BETWEEN @start_time AND @end_time``,
but the CLI doesn't accept query parameters; copying the
example as-is would fail at BigQuery. Replaced with literal
``TIMESTAMP('2026-05-01')`` / ``TIMESTAMP('2026-05-12')``
and explicit doc text: "The SQL file must be fully
self-contained — the CLI does not accept query parameters,
so substitute concrete literals before invoking."

Tests: 31 -> 33 in
tests/test_extractor_compilation_cli_revalidate.py.

* test_bq_client_construction_failure_returns_two:
  monkey-patches _make_bq_client to raise RuntimeError;
  asserts exit 2 with "BigQuery client construction failed"
  prefix + the underlying message.
* test_bq_extra_column_rejected: row with extra_col key
  fails with "exactly one column ... got ['event_json',
  'extra_col']"; report is NOT written.

All 32 CLI tests + 1 entry-point skip pass; pyink + isort
clean on the touched files.
… zero-row wrong-schema case)

PR GoogleCloudPlatform#151 round-2 reviewer finding.

P2 - The previous "exactly one column" check only ran when
rows is non-empty. A query like ``SELECT event_json,
extra_col FROM t WHERE FALSE`` returns zero rows but still
has the wrong schema; the CLI silently skipped validation
and wrote a successful zero-event report — a misleading
"all clear" signal when the SQL itself is broken.

Fix: derive column names via a new
``_query_result_column_names(job, rows)`` helper that
prefers ``job.schema`` (real ``bigquery.QueryJob`` populates
this regardless of row count) and falls back to
``sorted(rows[0].keys())`` only when the job lacks a usable
schema (test fakes that don't simulate it). Returns ``None``
on the degenerate "fake + zero rows + no schema" case,
which the caller treats as "skip the contract check" — that
preserves existing fake-based tests without weakening the
real-BigQuery path (production always has schema).

The fake test client now supports an optional ``schema=``
parameter (list of objects exposing ``.name``) so tests can
simulate both correct and incorrect schemas independently
of the row payload.

Tests: 33 -> 35.

* test_bq_extra_column_rejected_on_empty_result_set:
  zero rows + ``schema=[event_json, extra_col]`` → exit 2
  with the column-set listed; report NOT written. The
  bug reproducer.
* test_bq_correct_schema_empty_result_set_succeeds:
  zero rows + ``schema=[event_json]`` → exit 0 with a
  zero-event report. Locks the design choice that an
  empty-but-well-shaped result is a valid revalidation
  outcome, not an error.

Docs: failure-mode table notes the schema-based check;
contract section + test list updated.
@haiyuan-eng-google haiyuan-eng-google merged commit c0d6eac into GoogleCloudPlatform:main May 12, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants