Skip to content

feat: BigQuery-table bundle mirror for compiled extractors (#75 PR C2.c.3)#148

Merged
haiyuan-eng-google merged 6 commits into
GoogleCloudPlatform:mainfrom
caohy1988:feat/extractor-bq-bundle-mirror
May 12, 2026
Merged

feat: BigQuery-table bundle mirror for compiled extractors (#75 PR C2.c.3)#148
haiyuan-eng-google merged 6 commits into
GoogleCloudPlatform:mainfrom
caohy1988:feat/extractor-bq-bundle-mirror

Conversation

@caohy1988
Copy link
Copy Markdown
Contributor

Summary

PR C2.c.3 of issue #75 — BigQuery-table bundle mirror for compiled structured extractors. Closes the compiled-bundle distribution story for environments where the filesystem isn't shared (Cloud Run, Cloud Functions, ephemeral CI workers).

The mirror is a utility, not a runtime loader. Runtime path stays unchanged:

sync_bundles_from_bq  →  discover_bundles  →  from_bundles_root

Sync writes verified files to a local directory; C2.a's existing loader does the actual import. No "fetch-direct-from-BQ" loader — that would double the trust surface and diverge from the loader's audit fields.

Per-bundle flow

Publish: walk bundle_root, read each bundle's manifest.json, run load_bundle(child, expected_fingerprint=manifest.fingerprint) as a pre-publish gate (mirror only distributes bundles that would load at runtime), emit two BundleRows per bundle (manifest + module file), upsert via DELETE+INSERT.

Sync: fetch rows, group by fingerprint, validate every bundle_path (path-safety + strict file-set: exactly manifest.json + the manifest's module_filename), write files to dest_dir/<fingerprint>/, run load_bundle as a post-sync gate. Tampered or incomplete bundles fail at sync — partial directories are scrubbed.

Stable failure codes

Publish: bundle_root_missing, manifest_missing, manifest_unreadable, bundle_load_failed.

Sync: fingerprint_not_in_table, manifest_row_missing, manifest_row_unreadable, invalid_bundle_path (traversal / absolute / NUL / backslash — fail-closed before any write), unexpected_file, module_row_missing, duplicate_row, malformed_row, bundle_load_failed (post-sync).

Per-bundle problems land in the result's failures tuple; neither function raises. Store-side exceptions (BQ network, auth, table missing) propagate.

Test plan

  • tests/test_extractor_compilation_bq_bundle_mirror.py — 17 cases using an in-memory BundleStore:
    • TestRoundTrip (2): single + multi-bundle round-trip; load_bundle accepts the reconstruction.
    • TestAllowlist (3): publish allowlist; sync allowlist; sync allowlist with missing fingerprint → fingerprint_not_in_table.
    • TestPathSafety (3): ../escape.py / /etc/passwd / ..\windows-style-escape.py all rejected; no file written outside dest_dir.
    • TestMissingAndMalformedRows (5): missing manifest row, malformed manifest content, unexpected file, malformed row type, duplicate rows.
    • TestIdempotentRepublish (1): two consecutive publishes leave exactly two rows, not four.
    • TestPublishFailures (3): subdir without manifest; bundle that fails load_bundle pre-publish; missing bundle_root.
  • tests/test_extractor_compilation_bq_bundle_mirror_live.py — gated behind BQAA_RUN_LIVE_TESTS=1 + BQAA_RUN_LIVE_BQ_MIRROR_TESTS=1 + PROJECT_ID + DATASET_ID. Creates a temp table, round-trips publish+sync, deletes the table.
  • All 168 related extractor_compilation tests pass.
  • Public surface re-exported from bigquery_agent_analytics.extractor_compilation and the top-level package.
  • Docs: docs/extractor_compilation_bq_bundle_mirror.md; docs/README.md index entry; CHANGELOG.md updated.

Out of scope (deferred)

GCS signed-URL fetch for large bundles; caching / TTL of synced bundles; garbage collection of stale fingerprints; multi-region replication.

caohy1988 and others added 6 commits May 12, 2026 00:13
…udPlatform#75 PR C2.c.3)

Adds a publish/sync utility so compiled bundles can flow between
processes via a BigQuery table — useful for Cloud Run, Cloud
Functions, ephemeral CI workers, or any environment where the
filesystem isn't shared.

The mirror is a utility, not a runtime loader. The runtime path
stays unchanged:

    sync_bundles_from_bq -> discover_bundles -> from_bundles_root

Sync writes verified files to a local directory and lets C2.a's
existing loader do the actual import.

Public surface (src/bigquery_agent_analytics/extractor_compilation/
bq_bundle_mirror.py):

* publish_bundles_to_bq(bundle_root, store, ...) walks a local
  bundle root, validates each candidate via load_bundle as a
  pre-publish gate (the mirror only distributes bundles that
  would load at runtime), and pushes the constituent files as
  rows.
* sync_bundles_from_bq(store, dest_dir, ...) fetches rows for
  the requested fingerprints, writes files into
  dest_dir/<fingerprint>/, and calls load_bundle as a post-sync
  gate. Tampered or incomplete bundles fail at sync rather than
  at runtime; the partial directory is scrubbed.
* BundleStore Protocol the two functions consume; concrete
  BigQueryBundleStore wraps google.cloud.bigquery with
  DELETE+INSERT upsert by (bundle_fingerprint, bundle_path) so
  re-publishing replaces rather than accumulates rows.
* BUNDLE_MIRROR_TABLE_SCHEMA exported for callers who want to
  create the table themselves; BigQueryBundleStore.ensure_table()
  does it idempotently otherwise.

Strict bundle-shape: the table stores exactly two rows per
fingerprint (manifest.json + the manifest's module_filename).
Stable MirrorFailure codes cover every per-bundle problem:

* Publish: bundle_root_missing, manifest_missing,
  manifest_unreadable, bundle_load_failed.
* Sync: fingerprint_not_in_table, manifest_row_missing,
  manifest_row_unreadable, invalid_bundle_path (traversal /
  absolute / NUL / backslash; fail-closed before any write),
  unexpected_file, module_row_missing, duplicate_row,
  malformed_row, bundle_load_failed (post-sync).

Per-bundle problems land in the result's failures tuple;
neither function raises. Store-side exceptions (BQ network,
auth, table missing) DO propagate.

Tests:

* tests/test_extractor_compilation_bq_bundle_mirror.py — 17 cases
  using an in-memory BundleStore. Round-trip (single + multi),
  publish allowlist, sync allowlist, sync allowlist with missing
  fingerprint, path-safety (traversal / absolute / backslash —
  no file written outside dest_dir), missing manifest, malformed
  manifest, unexpected file, malformed row type, duplicate rows,
  idempotent republish, publish failure surfaces (subdir without
  manifest, bundle that would fail load_bundle, missing
  bundle_root).
* tests/test_extractor_compilation_bq_bundle_mirror_live.py —
  gated behind BQAA_RUN_LIVE_TESTS=1 +
  BQAA_RUN_LIVE_BQ_MIRROR_TESTS=1 + PROJECT_ID + DATASET_ID.
  Creates a temp table, round-trips publish+sync, asserts
  load_bundle accepts the reconstruction, deletes the table.

Docs: docs/extractor_compilation_bq_bundle_mirror.md describes
the contract; docs/README.md index entry + CHANGELOG entry added.

Out of scope: GCS signed-URL fetch, caching / TTL, garbage
collection, multi-region replication.
…-fingerprint guard

Addresses PR GoogleCloudPlatform#148 round-1 reviewer findings (4 total).

P1 - Strict manifest shape check before using module_filename.
Manifest.from_json is lenient; a row with
module_filename="subdir/extractor.py" used to slip past
_validate_bundle_path (a simple subdir is not a traversal) and
raise FileNotFoundError at the write step. New
_validate_manifest_shape rejects non-bare module_filename
(path separators, "..", ".", NUL) plus non-string / empty
fingerprint, function_name, event_types entries. Surfaces as
manifest_row_unreadable with an explicit detail.

P1 - Atomic sync. Sync used to remove dest_dir/<fp>/ before
writing the replacement; a bad mirror row therefore destroyed
the last-good runtime bundle. New flow: write to a side-by-side
.staging-<fp>-<uuid>/ directory, run load_bundle on the staged
copy, only then atomically replace the target via rmtree+move.
load_bundle failure or write exception scrubs the staging dir
and leaves any pre-existing dest_dir/<fp>/ intact.

P2 - Publish-side duplicate fingerprint detection. Two subdirs
under bundle_root declaring the same manifest fingerprint used
to both emit (fp, manifest.json) + (fp, module) rows;
BigQueryBundleStore.publish_rows would DELETE the keys and
INSERT both copies, leaving real BQ with logical duplicates
that sync later rejects fail-closed. New two-pass walk in
publish_bundles_to_bq accumulates per-fingerprint candidates,
then refuses to publish any fingerprint claimed by more than
one subdir. Both subdirs get a duplicate_fingerprint failure
naming the participants; zero rows leak.

P2 - publish_rows defensive input check. ValueError on duplicate
(bundle_fingerprint, bundle_path) input pairs *before* any
DELETE runs, so direct callers of the store can't bypass the
publisher's duplicate-fingerprint guard.

P3 - Soften the BigQueryBundleStore.publish_rows docstring.
The DELETE + insert_rows_json sequence is NOT a single atomic
transaction (insert_rows_json is streaming insert, not part of
the query transaction). The docstring now spells this out:
transient INSERT failure leaves rows missing for the affected
keys; recovery is to re-run publish (mirror is publish-side
idempotent). A staging-table-plus-MERGE flow would close the
gap; deliberately deferred (see module docstring).

Tests: 17 -> 21. New cases in TestPublishFailures
(duplicate-fingerprint reproducer) and a new TestRoundTwoFindings
class with three reproducers: manifest with path-separator
module_filename surfaces as manifest_row_unreadable instead of
FileNotFoundError; existing good local bundle preserved across a
corrupt re-sync (no orphaned .staging-* dirs); publish_rows
raises ValueError on duplicate input pairs without running
DELETE or INSERT. All 172 related extractor_compilation tests
pass.

Docs: docs/extractor_compilation_bq_bundle_mirror.md updated to
document the new failure codes, the atomic-replace flow, and
the non-atomicity caveat. CHANGELOG + docs/README.md summaries
refreshed.
…+ doc honesty

Addresses PR GoogleCloudPlatform#148 round-2 reviewer findings.

P1 - Bundle fingerprint shape check before path use. Sync used
the row's ``bundle_fingerprint`` directly as a directory name
(``dest_dir/<fingerprint>/`` + the staging-dir name).
``_check_row_shape`` only required a non-empty string, so a
tampered row with ``bundle_fingerprint="../escape"`` synced
successfully and wrote OUTSIDE ``dest_dir``. New
``_FINGERPRINT_PATTERN`` enforces strict 64-char lowercase hex
(sha256) and is applied in two places:

* ``_check_row_shape`` rejects rows with bad fingerprints
  before any grouping or path computation (``malformed_row``).
* ``_validate_manifest_shape`` rejects manifests whose
  fingerprint isn't well-formed; called at BOTH publish-side
  (new) and sync-side, so a tampered local manifest can never
  introduce a path-escape value into the table either.

P2 - ``BigQueryBundleStore.__init__`` validates ``table_id``.
The constructor interpolates ``table_id`` directly into
backtick-quoted SQL; a malformed value (backtick, semicolon,
newline, ``--``) could break out. New ``_TABLE_ID_PATTERN``
requires exactly three ASCII-segment ``project.dataset.table``
form, enforced via ``fullmatch`` so a trailing newline can't
sneak past Python's lenient ``$`` anchor. ``fullmatch`` is
also used for the fingerprint pattern for the same reason.

P3 - Sync docs no longer claim "atomically replaces". The
rmtree+move pair is not strictly atomic — a crash between the
two leaves the bundle absent on disk. Doc + code comment now
say "staged replace" / "validation before replacement" and
explicitly call out that the load-bundle-failure case (the
one the staged flow is designed to protect) is correctly
atomic in the failure direction.

Tests: 21 -> 24. Three new reproducers in TestRoundTwoFindings:
``bundle_fingerprint="../escape"`` rejected as ``malformed_row``
before any path is computed (no write outside ``dest_dir``);
tampered manifest ``fingerprint="../escape"`` rejected at the
publish-side shape check; ``BigQueryBundleStore.__init__``
raises ``ValueError`` on every malformed table_id form
(wrong dot count, backtick, semicolon, whitespace, ``--``,
trailing newline, empty segment, non-string). All 175 related
extractor_compilation tests pass.

Docs: per-event-flow + failure-code sections updated; the
``malformed_row`` description now calls out the fingerprint
shape check as the load-bearing piece preventing path
escape. CHANGELOG entry doesn't need a separate update
(round-2 changes are still within the same Unreleased
section).
PR GoogleCloudPlatform#148 round-3 P3 cleanup. The dedicated mirror doc and the
code comments correctly describe sync as a staged replace
(not strictly atomic in the success direction; atomic in the
load-bundle-failure direction). The docs/README.md index row
still said "atomically replacing the target" — overclaim that
the rest of the documentation no longer makes. Brought the
summary in line: spelled out the staged-replace shape, the
crash-window caveat, and the load-bundle-failure atomicity
that the flow actually guarantees.

No code changes.
…ding

PR GoogleCloudPlatform#148 round-4 P3 cleanup — the last consistency gap on
sync atomicity wording. The dedicated mirror doc, the
docs/README.md summary, and the code comments now all
describe sync as a staged replace (not strictly atomic in the
success direction; atomic in the load-bundle-failure
direction). The CHANGELOG entry still said "before atomically
replacing the target" — brought it in line with the rest of
the documentation, spelling out the crash-window caveat and
the load-bundle-failure atomicity the flow actually
guarantees.

No code changes.
@haiyuan-eng-google haiyuan-eng-google merged commit b0824df into GoogleCloudPlatform:main May 12, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants