feat: BigQuery-table bundle mirror for compiled extractors (#75 PR C2.c.3)#148
Merged
haiyuan-eng-google merged 6 commits intoMay 12, 2026
Conversation
…udPlatform#75 PR C2.c.3) Adds a publish/sync utility so compiled bundles can flow between processes via a BigQuery table — useful for Cloud Run, Cloud Functions, ephemeral CI workers, or any environment where the filesystem isn't shared. The mirror is a utility, not a runtime loader. The runtime path stays unchanged: sync_bundles_from_bq -> discover_bundles -> from_bundles_root Sync writes verified files to a local directory and lets C2.a's existing loader do the actual import. Public surface (src/bigquery_agent_analytics/extractor_compilation/ bq_bundle_mirror.py): * publish_bundles_to_bq(bundle_root, store, ...) walks a local bundle root, validates each candidate via load_bundle as a pre-publish gate (the mirror only distributes bundles that would load at runtime), and pushes the constituent files as rows. * sync_bundles_from_bq(store, dest_dir, ...) fetches rows for the requested fingerprints, writes files into dest_dir/<fingerprint>/, and calls load_bundle as a post-sync gate. Tampered or incomplete bundles fail at sync rather than at runtime; the partial directory is scrubbed. * BundleStore Protocol the two functions consume; concrete BigQueryBundleStore wraps google.cloud.bigquery with DELETE+INSERT upsert by (bundle_fingerprint, bundle_path) so re-publishing replaces rather than accumulates rows. * BUNDLE_MIRROR_TABLE_SCHEMA exported for callers who want to create the table themselves; BigQueryBundleStore.ensure_table() does it idempotently otherwise. Strict bundle-shape: the table stores exactly two rows per fingerprint (manifest.json + the manifest's module_filename). Stable MirrorFailure codes cover every per-bundle problem: * Publish: bundle_root_missing, manifest_missing, manifest_unreadable, bundle_load_failed. * Sync: fingerprint_not_in_table, manifest_row_missing, manifest_row_unreadable, invalid_bundle_path (traversal / absolute / NUL / backslash; fail-closed before any write), unexpected_file, module_row_missing, duplicate_row, malformed_row, bundle_load_failed (post-sync). Per-bundle problems land in the result's failures tuple; neither function raises. Store-side exceptions (BQ network, auth, table missing) DO propagate. Tests: * tests/test_extractor_compilation_bq_bundle_mirror.py — 17 cases using an in-memory BundleStore. Round-trip (single + multi), publish allowlist, sync allowlist, sync allowlist with missing fingerprint, path-safety (traversal / absolute / backslash — no file written outside dest_dir), missing manifest, malformed manifest, unexpected file, malformed row type, duplicate rows, idempotent republish, publish failure surfaces (subdir without manifest, bundle that would fail load_bundle, missing bundle_root). * tests/test_extractor_compilation_bq_bundle_mirror_live.py — gated behind BQAA_RUN_LIVE_TESTS=1 + BQAA_RUN_LIVE_BQ_MIRROR_TESTS=1 + PROJECT_ID + DATASET_ID. Creates a temp table, round-trips publish+sync, asserts load_bundle accepts the reconstruction, deletes the table. Docs: docs/extractor_compilation_bq_bundle_mirror.md describes the contract; docs/README.md index entry + CHANGELOG entry added. Out of scope: GCS signed-URL fetch, caching / TTL, garbage collection, multi-region replication.
…-fingerprint guard Addresses PR GoogleCloudPlatform#148 round-1 reviewer findings (4 total). P1 - Strict manifest shape check before using module_filename. Manifest.from_json is lenient; a row with module_filename="subdir/extractor.py" used to slip past _validate_bundle_path (a simple subdir is not a traversal) and raise FileNotFoundError at the write step. New _validate_manifest_shape rejects non-bare module_filename (path separators, "..", ".", NUL) plus non-string / empty fingerprint, function_name, event_types entries. Surfaces as manifest_row_unreadable with an explicit detail. P1 - Atomic sync. Sync used to remove dest_dir/<fp>/ before writing the replacement; a bad mirror row therefore destroyed the last-good runtime bundle. New flow: write to a side-by-side .staging-<fp>-<uuid>/ directory, run load_bundle on the staged copy, only then atomically replace the target via rmtree+move. load_bundle failure or write exception scrubs the staging dir and leaves any pre-existing dest_dir/<fp>/ intact. P2 - Publish-side duplicate fingerprint detection. Two subdirs under bundle_root declaring the same manifest fingerprint used to both emit (fp, manifest.json) + (fp, module) rows; BigQueryBundleStore.publish_rows would DELETE the keys and INSERT both copies, leaving real BQ with logical duplicates that sync later rejects fail-closed. New two-pass walk in publish_bundles_to_bq accumulates per-fingerprint candidates, then refuses to publish any fingerprint claimed by more than one subdir. Both subdirs get a duplicate_fingerprint failure naming the participants; zero rows leak. P2 - publish_rows defensive input check. ValueError on duplicate (bundle_fingerprint, bundle_path) input pairs *before* any DELETE runs, so direct callers of the store can't bypass the publisher's duplicate-fingerprint guard. P3 - Soften the BigQueryBundleStore.publish_rows docstring. The DELETE + insert_rows_json sequence is NOT a single atomic transaction (insert_rows_json is streaming insert, not part of the query transaction). The docstring now spells this out: transient INSERT failure leaves rows missing for the affected keys; recovery is to re-run publish (mirror is publish-side idempotent). A staging-table-plus-MERGE flow would close the gap; deliberately deferred (see module docstring). Tests: 17 -> 21. New cases in TestPublishFailures (duplicate-fingerprint reproducer) and a new TestRoundTwoFindings class with three reproducers: manifest with path-separator module_filename surfaces as manifest_row_unreadable instead of FileNotFoundError; existing good local bundle preserved across a corrupt re-sync (no orphaned .staging-* dirs); publish_rows raises ValueError on duplicate input pairs without running DELETE or INSERT. All 172 related extractor_compilation tests pass. Docs: docs/extractor_compilation_bq_bundle_mirror.md updated to document the new failure codes, the atomic-replace flow, and the non-atomicity caveat. CHANGELOG + docs/README.md summaries refreshed.
…+ doc honesty Addresses PR GoogleCloudPlatform#148 round-2 reviewer findings. P1 - Bundle fingerprint shape check before path use. Sync used the row's ``bundle_fingerprint`` directly as a directory name (``dest_dir/<fingerprint>/`` + the staging-dir name). ``_check_row_shape`` only required a non-empty string, so a tampered row with ``bundle_fingerprint="../escape"`` synced successfully and wrote OUTSIDE ``dest_dir``. New ``_FINGERPRINT_PATTERN`` enforces strict 64-char lowercase hex (sha256) and is applied in two places: * ``_check_row_shape`` rejects rows with bad fingerprints before any grouping or path computation (``malformed_row``). * ``_validate_manifest_shape`` rejects manifests whose fingerprint isn't well-formed; called at BOTH publish-side (new) and sync-side, so a tampered local manifest can never introduce a path-escape value into the table either. P2 - ``BigQueryBundleStore.__init__`` validates ``table_id``. The constructor interpolates ``table_id`` directly into backtick-quoted SQL; a malformed value (backtick, semicolon, newline, ``--``) could break out. New ``_TABLE_ID_PATTERN`` requires exactly three ASCII-segment ``project.dataset.table`` form, enforced via ``fullmatch`` so a trailing newline can't sneak past Python's lenient ``$`` anchor. ``fullmatch`` is also used for the fingerprint pattern for the same reason. P3 - Sync docs no longer claim "atomically replaces". The rmtree+move pair is not strictly atomic — a crash between the two leaves the bundle absent on disk. Doc + code comment now say "staged replace" / "validation before replacement" and explicitly call out that the load-bundle-failure case (the one the staged flow is designed to protect) is correctly atomic in the failure direction. Tests: 21 -> 24. Three new reproducers in TestRoundTwoFindings: ``bundle_fingerprint="../escape"`` rejected as ``malformed_row`` before any path is computed (no write outside ``dest_dir``); tampered manifest ``fingerprint="../escape"`` rejected at the publish-side shape check; ``BigQueryBundleStore.__init__`` raises ``ValueError`` on every malformed table_id form (wrong dot count, backtick, semicolon, whitespace, ``--``, trailing newline, empty segment, non-string). All 175 related extractor_compilation tests pass. Docs: per-event-flow + failure-code sections updated; the ``malformed_row`` description now calls out the fingerprint shape check as the load-bearing piece preventing path escape. CHANGELOG entry doesn't need a separate update (round-2 changes are still within the same Unreleased section).
PR GoogleCloudPlatform#148 round-3 P3 cleanup. The dedicated mirror doc and the code comments correctly describe sync as a staged replace (not strictly atomic in the success direction; atomic in the load-bundle-failure direction). The docs/README.md index row still said "atomically replacing the target" — overclaim that the rest of the documentation no longer makes. Brought the summary in line: spelled out the staged-replace shape, the crash-window caveat, and the load-bundle-failure atomicity that the flow actually guarantees. No code changes.
…ding PR GoogleCloudPlatform#148 round-4 P3 cleanup — the last consistency gap on sync atomicity wording. The dedicated mirror doc, the docs/README.md summary, and the code comments now all describe sync as a staged replace (not strictly atomic in the success direction; atomic in the load-bundle-failure direction). The CHANGELOG entry still said "before atomically replacing the target" — brought it in line with the rest of the documentation, spelling out the crash-window caveat and the load-bundle-failure atomicity the flow actually guarantees. No code changes.
haiyuan-eng-google
approved these changes
May 12, 2026
haiyuan-eng-google
approved these changes
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR C2.c.3 of issue #75 — BigQuery-table bundle mirror for compiled structured extractors. Closes the compiled-bundle distribution story for environments where the filesystem isn't shared (Cloud Run, Cloud Functions, ephemeral CI workers).
The mirror is a utility, not a runtime loader. Runtime path stays unchanged:
Sync writes verified files to a local directory; C2.a's existing loader does the actual import. No "fetch-direct-from-BQ" loader — that would double the trust surface and diverge from the loader's audit fields.
Per-bundle flow
Publish: walk
bundle_root, read each bundle'smanifest.json, runload_bundle(child, expected_fingerprint=manifest.fingerprint)as a pre-publish gate (mirror only distributes bundles that would load at runtime), emit twoBundleRows per bundle (manifest + module file), upsert via DELETE+INSERT.Sync: fetch rows, group by fingerprint, validate every
bundle_path(path-safety + strict file-set: exactlymanifest.json+ the manifest'smodule_filename), write files todest_dir/<fingerprint>/, runload_bundleas a post-sync gate. Tampered or incomplete bundles fail at sync — partial directories are scrubbed.Stable failure codes
Publish:
bundle_root_missing,manifest_missing,manifest_unreadable,bundle_load_failed.Sync:
fingerprint_not_in_table,manifest_row_missing,manifest_row_unreadable,invalid_bundle_path(traversal / absolute / NUL / backslash — fail-closed before any write),unexpected_file,module_row_missing,duplicate_row,malformed_row,bundle_load_failed(post-sync).Per-bundle problems land in the result's
failurestuple; neither function raises. Store-side exceptions (BQ network, auth, table missing) propagate.Test plan
tests/test_extractor_compilation_bq_bundle_mirror.py— 17 cases using an in-memoryBundleStore:TestRoundTrip(2): single + multi-bundle round-trip; load_bundle accepts the reconstruction.TestAllowlist(3): publish allowlist; sync allowlist; sync allowlist with missing fingerprint →fingerprint_not_in_table.TestPathSafety(3):../escape.py//etc/passwd/..\windows-style-escape.pyall rejected; no file written outsidedest_dir.TestMissingAndMalformedRows(5): missing manifest row, malformed manifest content, unexpected file, malformed row type, duplicate rows.TestIdempotentRepublish(1): two consecutive publishes leave exactly two rows, not four.TestPublishFailures(3): subdir without manifest; bundle that failsload_bundlepre-publish; missingbundle_root.tests/test_extractor_compilation_bq_bundle_mirror_live.py— gated behindBQAA_RUN_LIVE_TESTS=1+BQAA_RUN_LIVE_BQ_MIRROR_TESTS=1+PROJECT_ID+DATASET_ID. Creates a temp table, round-trips publish+sync, deletes the table.bigquery_agent_analytics.extractor_compilationand the top-level package.docs/extractor_compilation_bq_bundle_mirror.md;docs/README.mdindex entry;CHANGELOG.mdupdated.Out of scope (deferred)
GCS signed-URL fetch for large bundles; caching / TTL of synced bundles; garbage collection of stale fingerprints; multi-region replication.