feat: BigQuery-table bundle mirror for compiled extractors (#75 PR C2.c.3) by caohy1988 · Pull Request #148 · GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK

caohy1988 · 2026-05-12T07:14:06Z

Summary

PR C2.c.3 of issue #75 — BigQuery-table bundle mirror for compiled structured extractors. Closes the compiled-bundle distribution story for environments where the filesystem isn't shared (Cloud Run, Cloud Functions, ephemeral CI workers).

The mirror is a utility, not a runtime loader. Runtime path stays unchanged:

sync_bundles_from_bq  →  discover_bundles  →  from_bundles_root

Sync writes verified files to a local directory; C2.a's existing loader does the actual import. No "fetch-direct-from-BQ" loader — that would double the trust surface and diverge from the loader's audit fields.

Per-bundle flow

Publish: walk bundle_root, read each bundle's manifest.json, run load_bundle(child, expected_fingerprint=manifest.fingerprint) as a pre-publish gate (mirror only distributes bundles that would load at runtime), emit two BundleRows per bundle (manifest + module file), upsert via DELETE+INSERT.

Sync: fetch rows, group by fingerprint, validate every bundle_path (path-safety + strict file-set: exactly manifest.json + the manifest's module_filename), write files to dest_dir/<fingerprint>/, run load_bundle as a post-sync gate. Tampered or incomplete bundles fail at sync — partial directories are scrubbed.

Stable failure codes

Publish: bundle_root_missing, manifest_missing, manifest_unreadable, bundle_load_failed.

Sync: fingerprint_not_in_table, manifest_row_missing, manifest_row_unreadable, invalid_bundle_path (traversal / absolute / NUL / backslash — fail-closed before any write), unexpected_file, module_row_missing, duplicate_row, malformed_row, bundle_load_failed (post-sync).

Per-bundle problems land in the result's failures tuple; neither function raises. Store-side exceptions (BQ network, auth, table missing) propagate.

Test plan

tests/test_extractor_compilation_bq_bundle_mirror.py — 17 cases using an in-memory BundleStore:
- TestRoundTrip (2): single + multi-bundle round-trip; load_bundle accepts the reconstruction.
- TestAllowlist (3): publish allowlist; sync allowlist; sync allowlist with missing fingerprint → fingerprint_not_in_table.
- TestPathSafety (3): ../escape.py / /etc/passwd / ..\windows-style-escape.py all rejected; no file written outside dest_dir.
- TestMissingAndMalformedRows (5): missing manifest row, malformed manifest content, unexpected file, malformed row type, duplicate rows.
- TestIdempotentRepublish (1): two consecutive publishes leave exactly two rows, not four.
- TestPublishFailures (3): subdir without manifest; bundle that fails load_bundle pre-publish; missing bundle_root.
tests/test_extractor_compilation_bq_bundle_mirror_live.py — gated behind BQAA_RUN_LIVE_TESTS=1 + BQAA_RUN_LIVE_BQ_MIRROR_TESTS=1 + PROJECT_ID + DATASET_ID. Creates a temp table, round-trips publish+sync, deletes the table.
All 168 related extractor_compilation tests pass.
Public surface re-exported from bigquery_agent_analytics.extractor_compilation and the top-level package.
Docs: docs/extractor_compilation_bq_bundle_mirror.md; docs/README.md index entry; CHANGELOG.md updated.

Out of scope (deferred)

GCS signed-URL fetch for large bundles; caching / TTL of synced bundles; garbage collection of stale fingerprints; multi-region replication.

…udPlatform#75 PR C2.c.3) Adds a publish/sync utility so compiled bundles can flow between processes via a BigQuery table — useful for Cloud Run, Cloud Functions, ephemeral CI workers, or any environment where the filesystem isn't shared. The mirror is a utility, not a runtime loader. The runtime path stays unchanged: sync_bundles_from_bq -> discover_bundles -> from_bundles_root Sync writes verified files to a local directory and lets C2.a's existing loader do the actual import. Public surface (src/bigquery_agent_analytics/extractor_compilation/ bq_bundle_mirror.py): * publish_bundles_to_bq(bundle_root, store, ...) walks a local bundle root, validates each candidate via load_bundle as a pre-publish gate (the mirror only distributes bundles that would load at runtime), and pushes the constituent files as rows. * sync_bundles_from_bq(store, dest_dir, ...) fetches rows for the requested fingerprints, writes files into dest_dir/<fingerprint>/, and calls load_bundle as a post-sync gate. Tampered or incomplete bundles fail at sync rather than at runtime; the partial directory is scrubbed. * BundleStore Protocol the two functions consume; concrete BigQueryBundleStore wraps google.cloud.bigquery with DELETE+INSERT upsert by (bundle_fingerprint, bundle_path) so re-publishing replaces rather than accumulates rows. * BUNDLE_MIRROR_TABLE_SCHEMA exported for callers who want to create the table themselves; BigQueryBundleStore.ensure_table() does it idempotently otherwise. Strict bundle-shape: the table stores exactly two rows per fingerprint (manifest.json + the manifest's module_filename). Stable MirrorFailure codes cover every per-bundle problem: * Publish: bundle_root_missing, manifest_missing, manifest_unreadable, bundle_load_failed. * Sync: fingerprint_not_in_table, manifest_row_missing, manifest_row_unreadable, invalid_bundle_path (traversal / absolute / NUL / backslash; fail-closed before any write), unexpected_file, module_row_missing, duplicate_row, malformed_row, bundle_load_failed (post-sync). Per-bundle problems land in the result's failures tuple; neither function raises. Store-side exceptions (BQ network, auth, table missing) DO propagate. Tests: * tests/test_extractor_compilation_bq_bundle_mirror.py — 17 cases using an in-memory BundleStore. Round-trip (single + multi), publish allowlist, sync allowlist, sync allowlist with missing fingerprint, path-safety (traversal / absolute / backslash — no file written outside dest_dir), missing manifest, malformed manifest, unexpected file, malformed row type, duplicate rows, idempotent republish, publish failure surfaces (subdir without manifest, bundle that would fail load_bundle, missing bundle_root). * tests/test_extractor_compilation_bq_bundle_mirror_live.py — gated behind BQAA_RUN_LIVE_TESTS=1 + BQAA_RUN_LIVE_BQ_MIRROR_TESTS=1 + PROJECT_ID + DATASET_ID. Creates a temp table, round-trips publish+sync, asserts load_bundle accepts the reconstruction, deletes the table. Docs: docs/extractor_compilation_bq_bundle_mirror.md describes the contract; docs/README.md index entry + CHANGELOG entry added. Out of scope: GCS signed-URL fetch, caching / TTL, garbage collection, multi-region replication.

…-fingerprint guard Addresses PR GoogleCloudPlatform#148 round-1 reviewer findings (4 total). P1 - Strict manifest shape check before using module_filename. Manifest.from_json is lenient; a row with module_filename="subdir/extractor.py" used to slip past _validate_bundle_path (a simple subdir is not a traversal) and raise FileNotFoundError at the write step. New _validate_manifest_shape rejects non-bare module_filename (path separators, "..", ".", NUL) plus non-string / empty fingerprint, function_name, event_types entries. Surfaces as manifest_row_unreadable with an explicit detail. P1 - Atomic sync. Sync used to remove dest_dir/<fp>/ before writing the replacement; a bad mirror row therefore destroyed the last-good runtime bundle. New flow: write to a side-by-side .staging-<fp>-<uuid>/ directory, run load_bundle on the staged copy, only then atomically replace the target via rmtree+move. load_bundle failure or write exception scrubs the staging dir and leaves any pre-existing dest_dir/<fp>/ intact. P2 - Publish-side duplicate fingerprint detection. Two subdirs under bundle_root declaring the same manifest fingerprint used to both emit (fp, manifest.json) + (fp, module) rows; BigQueryBundleStore.publish_rows would DELETE the keys and INSERT both copies, leaving real BQ with logical duplicates that sync later rejects fail-closed. New two-pass walk in publish_bundles_to_bq accumulates per-fingerprint candidates, then refuses to publish any fingerprint claimed by more than one subdir. Both subdirs get a duplicate_fingerprint failure naming the participants; zero rows leak. P2 - publish_rows defensive input check. ValueError on duplicate (bundle_fingerprint, bundle_path) input pairs *before* any DELETE runs, so direct callers of the store can't bypass the publisher's duplicate-fingerprint guard. P3 - Soften the BigQueryBundleStore.publish_rows docstring. The DELETE + insert_rows_json sequence is NOT a single atomic transaction (insert_rows_json is streaming insert, not part of the query transaction). The docstring now spells this out: transient INSERT failure leaves rows missing for the affected keys; recovery is to re-run publish (mirror is publish-side idempotent). A staging-table-plus-MERGE flow would close the gap; deliberately deferred (see module docstring). Tests: 17 -> 21. New cases in TestPublishFailures (duplicate-fingerprint reproducer) and a new TestRoundTwoFindings class with three reproducers: manifest with path-separator module_filename surfaces as manifest_row_unreadable instead of FileNotFoundError; existing good local bundle preserved across a corrupt re-sync (no orphaned .staging-* dirs); publish_rows raises ValueError on duplicate input pairs without running DELETE or INSERT. All 172 related extractor_compilation tests pass. Docs: docs/extractor_compilation_bq_bundle_mirror.md updated to document the new failure codes, the atomic-replace flow, and the non-atomicity caveat. CHANGELOG + docs/README.md summaries refreshed.

…+ doc honesty Addresses PR GoogleCloudPlatform#148 round-2 reviewer findings. P1 - Bundle fingerprint shape check before path use. Sync used the row's ``bundle_fingerprint`` directly as a directory name (``dest_dir/<fingerprint>/`` + the staging-dir name). ``_check_row_shape`` only required a non-empty string, so a tampered row with ``bundle_fingerprint="../escape"`` synced successfully and wrote OUTSIDE ``dest_dir``. New ``_FINGERPRINT_PATTERN`` enforces strict 64-char lowercase hex (sha256) and is applied in two places: * ``_check_row_shape`` rejects rows with bad fingerprints before any grouping or path computation (``malformed_row``). * ``_validate_manifest_shape`` rejects manifests whose fingerprint isn't well-formed; called at BOTH publish-side (new) and sync-side, so a tampered local manifest can never introduce a path-escape value into the table either. P2 - ``BigQueryBundleStore.__init__`` validates ``table_id``. The constructor interpolates ``table_id`` directly into backtick-quoted SQL; a malformed value (backtick, semicolon, newline, ``--``) could break out. New ``_TABLE_ID_PATTERN`` requires exactly three ASCII-segment ``project.dataset.table`` form, enforced via ``fullmatch`` so a trailing newline can't sneak past Python's lenient ``$`` anchor. ``fullmatch`` is also used for the fingerprint pattern for the same reason. P3 - Sync docs no longer claim "atomically replaces". The rmtree+move pair is not strictly atomic — a crash between the two leaves the bundle absent on disk. Doc + code comment now say "staged replace" / "validation before replacement" and explicitly call out that the load-bundle-failure case (the one the staged flow is designed to protect) is correctly atomic in the failure direction. Tests: 21 -> 24. Three new reproducers in TestRoundTwoFindings: ``bundle_fingerprint="../escape"`` rejected as ``malformed_row`` before any path is computed (no write outside ``dest_dir``); tampered manifest ``fingerprint="../escape"`` rejected at the publish-side shape check; ``BigQueryBundleStore.__init__`` raises ``ValueError`` on every malformed table_id form (wrong dot count, backtick, semicolon, whitespace, ``--``, trailing newline, empty segment, non-string). All 175 related extractor_compilation tests pass. Docs: per-event-flow + failure-code sections updated; the ``malformed_row`` description now calls out the fingerprint shape check as the load-bearing piece preventing path escape. CHANGELOG entry doesn't need a separate update (round-2 changes are still within the same Unreleased section).

PR GoogleCloudPlatform#148 round-3 P3 cleanup. The dedicated mirror doc and the code comments correctly describe sync as a staged replace (not strictly atomic in the success direction; atomic in the load-bundle-failure direction). The docs/README.md index row still said "atomically replacing the target" — overclaim that the rest of the documentation no longer makes. Brought the summary in line: spelled out the staged-replace shape, the crash-window caveat, and the load-bundle-failure atomicity that the flow actually guarantees. No code changes.

…ding PR GoogleCloudPlatform#148 round-4 P3 cleanup — the last consistency gap on sync atomicity wording. The dedicated mirror doc, the docs/README.md summary, and the code comments now all describe sync as a staged replace (not strictly atomic in the success direction; atomic in the load-bundle-failure direction). The CHANGELOG entry still said "before atomically replacing the target" — brought it in line with the rest of the documentation, spelling out the crash-window caveat and the load-bundle-failure atomicity the flow actually guarantees. No code changes.

caohy1988 and others added 6 commits May 12, 2026 00:13

Merge branch 'main' into feat/extractor-bq-bundle-mirror

85eacf0

haiyuan-eng-google approved these changes May 12, 2026

View reviewed changes

haiyuan-eng-google merged commit b0824df into GoogleCloudPlatform:main May 12, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: BigQuery-table bundle mirror for compiled extractors (#75 PR C2.c.3)#148

feat: BigQuery-table bundle mirror for compiled extractors (#75 PR C2.c.3)#148
haiyuan-eng-google merged 6 commits into
GoogleCloudPlatform:mainfrom
caohy1988:feat/extractor-bq-bundle-mirror

caohy1988 commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

caohy1988 commented May 12, 2026

Summary

Per-bundle flow

Stable failure codes

Test plan

Out of scope (deferred)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants