Google Cloud BigQuery storage backend for PathRAG.
This package provides three BigQuery-backed storage classes as an external plugin — no modifications to PathRAG source code required.
| Storage | Class | Description |
|---|---|---|
| KV | BigQueryKVStorage |
Key-value storage with JSON serialization |
| Vector | BigQueryVectorDBStorage |
Vector storage with cosine similarity search |
| Graph | BigQueryGraphStorage |
Graph storage with BigQuery Property Graph + GQL support |
pip install PathRAG @ git+https://github.com/ksmin23/PathRAG.git@v0.1.1
pip install git+https://github.com/ksmin23/pathrag-bigquery.git@v0.1.0import asyncio
import pathrag_bigquery
from PathRAG import PathRAG, QueryParam
# Register BigQuery storage classes with PathRAG
pathrag_bigquery.register()
async def main():
rag = PathRAG(
working_dir="./rag_storage",
kv_storage="BigQueryKVStorage",
vector_storage="BigQueryVectorDBStorage",
graph_storage="BigQueryGraphStorage",
addon_params={
"bigquery_project_id": "my-project",
"bigquery_dataset_id": "my-dataset",
},
)
await rag.ainsert("Your document text here")
result = await rag.aquery("Your question", param=QueryParam(mode="hybrid"))
print(result)
asyncio.run(main())BigQuery connection settings can be provided via addon_params or environment variables. Environment variables are used as fallback when addon_params are not set.
| addon_params key | Environment Variable | Description |
|---|---|---|
bigquery_project_id |
BIGQUERY_PROJECT or GOOGLE_CLOUD_PROJECT |
GCP project ID |
bigquery_dataset_id |
BIGQUERY_DATASET |
BigQuery dataset ID |
bigquery_graph_name |
BIGQUERY_GRAPH_NAME |
Property graph name (default: pathrag_knowledge_graph) |
export GOOGLE_CLOUD_PROJECT=my-project
export BIGQUERY_DATASET=my-datasetimport pathrag_bigquery
pathrag_bigquery.register()
rag = PathRAG(
kv_storage="BigQueryKVStorage",
vector_storage="BigQueryVectorDBStorage",
graph_storage="BigQueryGraphStorage",
...
)gcloud auth application-default loginOr with a service account key:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.jsonThe dataset is created automatically during initialization (CREATE SCHEMA IF NOT EXISTS). Tables and the property graph are also created automatically.
If you prefer to create the dataset manually:
export BIGQUERY_DATASET=pathrag
bq --location=US mk --dataset $GOOGLE_CLOUD_PROJECT:$BIGQUERY_DATASET| Column | Type | Description |
|---|---|---|
id |
STRING | Primary key |
data |
STRING | JSON-serialized value |
| Column | Type | Description |
|---|---|---|
id |
STRING | Primary key |
embedding |
ARRAY<FLOAT64> | Embedding vector |
content |
STRING | Original text content |
| meta fields | STRING | Dynamic metadata columns |
Nodes:
| Column | Type | Description |
|---|---|---|
id |
STRING | Node identifier |
entity_type |
STRING | Entity type label |
description |
STRING | Entity description |
source_id |
STRING | Source document reference |
Edges:
| Column | Type | Description |
|---|---|---|
id |
STRING | Source node ID |
target_id |
STRING | Target node ID |
weight |
FLOAT64 | Edge weight |
description |
STRING | Relationship description |
keywords |
STRING | Relationship keywords |
source_id |
STRING | Source document reference |
A BigQuery Property Graph is created on top of the node and edge tables, enabling GQL queries via GRAPH_TABLE() for path traversal (1-hop, 2-hop, 3-hop).
pathrag-bigquery/
├── pyproject.toml
├── src/
│ └── pathrag_bigquery/
│ ├── __init__.py # register() and public exports
│ ├── client.py # BigQueryClientManager and helpers
│ ├── kv.py # BigQueryKVStorage
│ ├── vector.py # BigQueryVectorDBStorage
│ └── graph.py # BigQueryGraphStorage
├── examples/
│ ├── _config.py # Shared configuration loader
│ ├── .env.example # Environment variable template
│ ├── requirements.txt # Example dependencies
│ ├── basic_usage.py # Simple insert and query
│ ├── batch_insert_and_query.py # Batch insert with context query
│ ├── env_var_config.py # Env-var-only configuration
│ └── knowledge_graph_exploration.py # Direct graph access
└── tests/
├── _config.py # Shared test configuration
├── .env.example # Test environment template
├── Makefile # Test runner targets
├── README.md # Test documentation
├── test_kv_storage.py # KV storage CRUD tests
├── test_graph_storage.py # Graph storage operation tests
└── test_upsert_edge_preserves_node.py # Edge upsert node preservation tests
| Decision | Approach | Rationale |
|---|---|---|
| Sync vs Async | Synchronous BigQuery SDK wrapped with asyncio.to_thread |
BigQuery Python SDK is synchronous; avoids blocking the event loop |
| Upsert (KV) | Query existing keys first, then MERGE INTO for new keys only |
Matches PathRAG's JsonKVStorage semantics — existing keys are skipped |
| Upsert (Graph/Vector) | MERGE INTO ... WHEN MATCHED / NOT MATCHED |
BigQuery lacks INSERT OR UPDATE; MERGE is the idiomatic alternative |
| Namespace Isolation | Table-name-based ({namespace}_kv, vdb_{namespace}, etc.) |
Follows PathRAG's convention; no extra workspace column needed |
| Property Graph | BigQuery Property Graph (CREATE PROPERTY GRAPH) |
Native graph support with GQL queries via GRAPH_TABLE() |
| GQL Syntax | GRAPH_TABLE(graph MATCH ... COLUMNS ...) |
BigQuery GQL syntax (differs from Spanner's GRAPH ... MATCH ... RETURN) |
| Embedding Type | ARRAY<FLOAT64> |
BigQuery's native vector type with COSINE_DISTANCE support |
| Vector Search | COSINE_DISTANCE() in ORDER BY |
Simple and universal; VECTOR_SEARCH with IVF index can be added later |
| Client Reuse | Singleton BigQueryClientManager |
Shares a single BigQuery client across all storage classes |
| Primary Key | PRIMARY KEY (id) NOT ENFORCED |
BigQuery does not enforce primary keys; used as advisory hints |
| Path Finding | GQL multi-hop MATCH patterns | find_paths_between uses 1/2/3-hop GQL queries for PathRAG's path-based retrieval |
| PageRank | Normalized degree approximation | Full iterative PageRank is impractical for real-time BigQuery queries |
| Plugin Registration | register() adds classes to PathRAG's global _EXTERNAL_STORAGES registry |
No need to modify PathRAG source; classes are resolved by string name at runtime |
MIT