GoogleCloudPlatform · haiyuan-eng-google · May 12, 2026 · May 12, 2026 · May 12, 2026 · May 12, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,74 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+- **BigQuery-table bundle mirror** in
+  `bigquery_agent_analytics.extractor_compilation.bq_bundle_mirror`
+  and
+  [`docs/extractor_compilation_bq_bundle_mirror.md`](docs/extractor_compilation_bq_bundle_mirror.md).
+  Issue [#75](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/issues/75)
+  PR C2.c.3 — publishes compiled bundles to a BigQuery
+  table and syncs them back into a local directory for
+  C2.a's existing loader. Runtime path stays
+  ``sync_bundles_from_bq → discover_bundles →
+  from_bundles_root``; the mirror is a utility, not a
+  runtime loader. Public surface:
+  ``publish_bundles_to_bq(bundle_root, store,
+  bundle_fingerprint_allowlist=None)`` and
+  ``sync_bundles_from_bq(store, dest_dir,
+  bundle_fingerprint_allowlist=None)``. Both call
+  :func:`load_bundle` as a gate — publish refuses bundles
+  that wouldn't load at the runtime; sync refuses bundles
+  whose reconstruction the loader rejects, scrubbing any
+  partial directory it wrote. Sync writes each
+  fingerprint to a side-by-side **staging directory** and
+  runs ``load_bundle`` on the staged copy before performing
+  a **staged replace** of the target (the rmtree+move pair
+  is not strictly atomic — a crash between the two leaves
+  the bundle absent on disk, recoverable by re-sync — but
+  the load-bundle-failure direction *is* atomic, so a bad
+  mirror row never destroys a previously-good local
+  bundle).
+  Strict bundle-shape check: the table stores exactly two
+  rows per fingerprint (``manifest.json`` + the manifest's
+  ``module_filename``); ``unexpected_file`` codes reject
+  anything else. The manifest's own ``module_filename`` is
+  shape-checked at sync (bare filename — no separators, no
+  ``..``, no NUL); a path-separator value surfaces as
+  ``manifest_row_unreadable`` instead of raising
+  ``FileNotFoundError`` at the write step.
+  ``invalid_bundle_path`` rejects traversal / absolute /
+  backslash / NUL paths before writing to disk.
+  ``duplicate_row`` rejects two rows sharing the same
+  ``(fingerprint, bundle_path)`` (BigQuery has no unique
+  constraint; the mirror enforces uniqueness at sync).
+  ``duplicate_fingerprint`` rejects publish-side cases
+  where two subdirs of ``bundle_root`` claim the same
+  manifest fingerprint — neither is published, so the
+  table can't end up with logical duplicates.
+  ``malformed_row`` rejects rows with wrong field types.
+  Idempotent republish via DELETE+INSERT in
+  ``BigQueryBundleStore.publish_rows`` —
+  re-publishing the same fingerprint replaces the prior
+  rows rather than accumulating duplicates. The DELETE +
+  ``insert_rows_json`` are NOT a single atomic
+  transaction; a transient INSERT failure leaves rows
+  missing until the caller re-runs publish (recoverable;
+  documented in the class docstring).
+  ``publish_rows`` also raises ``ValueError`` on duplicate
+  ``(fingerprint, bundle_path)`` input pairs as defense in
+  depth.
+  ``BundleStore`` is a Protocol so tests can pass in-memory
+  fakes; ``BigQueryBundleStore`` is the concrete
+  implementation wrapping ``google.cloud.bigquery``.
+  ``BUNDLE_MIRROR_TABLE_SCHEMA`` is exported for callers
+  who need to create the table themselves (or
+  ``BigQueryBundleStore.ensure_table()`` does it
+  idempotently). Failure codes are stable strings;
+  per-bundle problems land in ``failures`` instead of
+  raising. Store exceptions (BQ-side: network, auth, table
+  missing) propagate. Out of scope: GCS-backed signed-URL
+  fetch, caching / TTL, garbage collection, multi-region
+  replication.
 - **Revalidation harness for compiled structured extractors**
   in
   `bigquery_agent_analytics.extractor_compilation.revalidation`

diff --git a/docs/README.md b/docs/README.md
@@ -51,6 +51,7 @@ architecture, rationale, and implementation plans behind key SDK features.
 | [extractor_compilation_runtime_fallback.md](extractor_compilation_runtime_fallback.md) | Runtime fallback wiring for compiled structured extractors (issue #75 PR C2.b): `run_with_fallback(...)` returning `FallbackOutcome` (`decision` is one of `compiled_unchanged` / `compiled_filtered` / `fallback_for_event`). Validates compiled output via #76; on per-element failures drops just the offending nodes / edges (with orphan cleanup) AND downgrades the event's span from `fully_handled` to `partially_handled` so the AI transcript still sees the source span. EVENT-scope, exception, wrong-type, and unpinpointable failures all trigger whole-event fallback. Does not validate fallback output; fallback exceptions propagate. Orchestrator call-site swap is C2.c. |
 | [extractor_compilation_runtime_registry.md](extractor_compilation_runtime_registry.md) | Runtime extractor-registry adapter (issue #75 PR C2.c.1): `build_runtime_extractor_registry(...)` glues C2.a's `discover_bundles` + C2.b's `run_with_fallback` into one call, returning a `WrappedRegistry` with an `extractors` dict ready for `run_structured_extractors` plus `bundles_without_fallback` (compiled-only, skipped) and `fallbacks_without_bundle` (no usable compiled registry entry — "never built" *and* "rejected by discovery"; cross-reference `discovery.failures` for the reason). Compiled-only event_types are skipped and recorded (fail-closed); fallback-only event_types pass through unchanged. Non-callable fallbacks are rejected at build time with `TypeError` naming the event_type. The `on_outcome(event_type, outcome)` callback fires on every wrapped invocation (denominator metric); callback exceptions propagate. Out of scope: actual orchestrator call-site swap (C2.c.2), BQ mirror (C2.c.3), revalidation (C2.d). |
 | [extractor_compilation_orchestrator_swap.md](extractor_compilation_orchestrator_swap.md) | Orchestrator call-site swap (issue #75 PR C2.c.2): `OntologyGraphManager.from_bundles_root(...)` classmethod that builds the runtime registry internally and constructs a manager whose `extractors` dict is the wrapped registry, so existing `run_structured_extractors` calls inside `extract_graph` pick up compiled-with-fallback behavior with no other code changes. Adds `manager.runtime_registry: WrappedRegistry | None` audit handle (non-None when bundle-wired). Mirrors `from_ontology_binding` arg shape; existing `__init__` and `from_ontology_binding` paths are unchanged. Compiled-only event_types without a matching fallback are NOT registered (fail-closed). Out of scope: BQ mirror (C2.c.3), revalidation (C2.d). |
+| [extractor_compilation_bq_bundle_mirror.md](extractor_compilation_bq_bundle_mirror.md) | BigQuery-table bundle mirror (issue #75 PR C2.c.3): `publish_bundles_to_bq(bundle_root, store, ...)` + `sync_bundles_from_bq(store, dest_dir, ...)`. Mirror is a publish/sync utility, NOT a runtime loader — the runtime path stays `sync_bundles_from_bq → discover_bundles → from_bundles_root`. Both functions call `load_bundle` as a gate: publish refuses bundles that wouldn't load at the runtime; sync writes to a side-by-side **staging directory** and `load_bundle`-validates the staged copy before performing a **staged replace** of the target (the rmtree+move pair is not strictly atomic — a crash between the two leaves the bundle absent on disk and is recoverable by re-sync — but the load-bundle-failure direction *is* atomic, so a bad mirror row never destroys a previously-good local bundle). Strict bundle-shape (exactly `manifest.json` + the manifest's `module_filename`) plus shape-check on the manifest's `module_filename` (bare filename only — no separators, no `..`, no NUL; otherwise `manifest_row_unreadable`). Path-safety rejects traversal / absolute / backslash / NUL. `duplicate_fingerprint` rejects publish-side cases where two subdirs claim the same fingerprint (neither published). `duplicate_row` rejects two rows sharing the same `(fingerprint, bundle_path)` at sync. `malformed_row` shape check. Idempotent republish via DELETE+INSERT in `BigQueryBundleStore.publish_rows` (NOT a single atomic transaction; a transient INSERT failure is recoverable by re-running publish). `publish_rows` raises `ValueError` on duplicate input pairs as defense in depth. `BundleStore` Protocol for testability; `BigQueryBundleStore` is the concrete impl. Stable `MirrorFailure` codes; per-bundle problems accumulate, store exceptions propagate. Out of scope: GCS signed URLs, caching, garbage collection, multi-region. |
 | [extractor_compilation_revalidation.md](extractor_compilation_revalidation.md) | Revalidation harness (issue #75 PR C2.d): `revalidate_compiled_extractors(events, compiled_extractors, reference_extractors, resolved_graph, ...)` drives `run_with_fallback` (with a no-op fallback) over a batch of events AND calls the reference extractor directly, aggregating outcomes into a `RevalidationReport` with **two orthogonal dimensions**: runtime decision (`compiled_unchanged` / `compiled_filtered` / `fallback_for_event`, plus `compiled_path_faults` split out so bundle bugs are distinguishable from ontology drift) and agreement against reference (`parity_match` / `parity_divergence` / `parity_not_checked`). Parity uses three comparators: `_compare_nodes` and `_compare_span_handling` from `measurement.py` plus `_compare_edges` in `revalidation.py` (same edge_id set with matching relationship_name / endpoints / property-set per shared edge; duplicate edge_ids on either side reported as a divergence rather than silently collapsed by dict keying). The parity dimension catches **schema-valid but semantically wrong** outputs the schema-only check would miss. **Every failure mode on the reference side becomes a parity divergence, never a batch abort**: exceptions, non-`StructuredExtractionResult` returns (including `None`), and comparator crashes all funnel into the divergence channel with a descriptive string. `check_thresholds(report, RevalidationThresholds(...))` evaluates policy gates; threshold rates are validated to `[0, 1]` at construction so a typo like `=5` (intended as 5%) fails loud. JSON-serializable for persistence; deterministic. Out of scope: scheduled orchestration, BQ persistence, CLI, sampling strategy. |
 
 ## Deployment Surfaces