pathrag-bigquery

Google Cloud BigQuery storage backend for PathRAG.

This package provides three BigQuery-backed storage classes as an external plugin — no modifications to PathRAG source code required.

Storage	Class	Description
KV	`BigQueryKVStorage`	Key-value storage with JSON serialization
Vector	`BigQueryVectorDBStorage`	Vector storage with cosine similarity search
Graph	`BigQueryGraphStorage`	Graph storage with BigQuery Property Graph + GQL support

Installation

pip install PathRAG @ git+https://github.com/ksmin23/PathRAG.git@v0.1.1
pip install git+https://github.com/ksmin23/pathrag-bigquery.git@v0.1.0

Quick Start

import asyncio
import pathrag_bigquery
from PathRAG import PathRAG, QueryParam

# Register BigQuery storage classes with PathRAG
pathrag_bigquery.register()

async def main():
    rag = PathRAG(
        working_dir="./rag_storage",
        kv_storage="BigQueryKVStorage",
        vector_storage="BigQueryVectorDBStorage",
        graph_storage="BigQueryGraphStorage",
        addon_params={
            "bigquery_project_id": "my-project",
            "bigquery_dataset_id": "my-dataset",
        },
    )

    await rag.ainsert("Your document text here")
    result = await rag.aquery("Your question", param=QueryParam(mode="hybrid"))
    print(result)

asyncio.run(main())

Configuration

BigQuery connection settings can be provided via addon_params or environment variables. Environment variables are used as fallback when addon_params are not set.

addon_params key	Environment Variable	Description
`bigquery_project_id`	`BIGQUERY_PROJECT` or `GOOGLE_CLOUD_PROJECT`	GCP project ID
`bigquery_dataset_id`	`BIGQUERY_DATASET`	BigQuery dataset ID
`bigquery_graph_name`	`BIGQUERY_GRAPH_NAME`	Property graph name (default: `pathrag_knowledge_graph`)

Using Environment Variables

export GOOGLE_CLOUD_PROJECT=my-project
export BIGQUERY_DATASET=my-dataset

import pathrag_bigquery
pathrag_bigquery.register()

rag = PathRAG(
    kv_storage="BigQueryKVStorage",
    vector_storage="BigQueryVectorDBStorage",
    graph_storage="BigQueryGraphStorage",
    ...
)

Prerequisites

GCP Authentication

gcloud auth application-default login

Or with a service account key:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

BigQuery Dataset

The dataset is created automatically during initialization (CREATE SCHEMA IF NOT EXISTS). Tables and the property graph are also created automatically.

If you prefer to create the dataset manually:

export BIGQUERY_DATASET=pathrag

bq --location=US mk --dataset $GOOGLE_CLOUD_PROJECT:$BIGQUERY_DATASET

Table Schema

KV Storage (`{namespace}_kv`)

Column	Type	Description
`id`	STRING	Primary key
`data`	STRING	JSON-serialized value

Vector Storage (`vdb_{namespace}`)

Column	Type	Description
`id`	STRING	Primary key
`embedding`	ARRAY<FLOAT64>	Embedding vector
`content`	STRING	Original text content
meta fields	STRING	Dynamic metadata columns

Graph Storage (`{namespace}_nodes`, `{namespace}_edges`)

Nodes:

Column	Type	Description
`id`	STRING	Node identifier
`entity_type`	STRING	Entity type label
`description`	STRING	Entity description
`source_id`	STRING	Source document reference

Edges:

Column	Type	Description
`id`	STRING	Source node ID
`target_id`	STRING	Target node ID
`weight`	FLOAT64	Edge weight
`description`	STRING	Relationship description
`keywords`	STRING	Relationship keywords
`source_id`	STRING	Source document reference

A BigQuery Property Graph is created on top of the node and edge tables, enabling GQL queries via GRAPH_TABLE() for path traversal (1-hop, 2-hop, 3-hop).

Project Structure

pathrag-bigquery/
├── pyproject.toml
├── src/
│   └── pathrag_bigquery/
│       ├── __init__.py          # register() and public exports
│       ├── client.py            # BigQueryClientManager and helpers
│       ├── kv.py                # BigQueryKVStorage
│       ├── vector.py            # BigQueryVectorDBStorage
│       └── graph.py             # BigQueryGraphStorage
├── examples/
│   ├── _config.py               # Shared configuration loader
│   ├── .env.example             # Environment variable template
│   ├── requirements.txt         # Example dependencies
│   ├── basic_usage.py           # Simple insert and query
│   ├── batch_insert_and_query.py    # Batch insert with context query
│   ├── env_var_config.py        # Env-var-only configuration
│   └── knowledge_graph_exploration.py  # Direct graph access
└── tests/
    ├── _config.py               # Shared test configuration
    ├── .env.example             # Test environment template
    ├── Makefile                 # Test runner targets
    ├── README.md                # Test documentation
    ├── test_kv_storage.py       # KV storage CRUD tests
    ├── test_graph_storage.py    # Graph storage operation tests
    └── test_upsert_edge_preserves_node.py  # Edge upsert node preservation tests

Design Decisions

Decision	Approach	Rationale
Sync vs Async	Synchronous BigQuery SDK wrapped with `asyncio.to_thread`	BigQuery Python SDK is synchronous; avoids blocking the event loop
Upsert (KV)	Query existing keys first, then `MERGE INTO` for new keys only	Matches PathRAG's `JsonKVStorage` semantics — existing keys are skipped
Upsert (Graph/Vector)	`MERGE INTO ... WHEN MATCHED / NOT MATCHED`	BigQuery lacks `INSERT OR UPDATE`; MERGE is the idiomatic alternative
Namespace Isolation	Table-name-based (`{namespace}_kv`, `vdb_{namespace}`, etc.)	Follows PathRAG's convention; no extra `workspace` column needed
Property Graph	BigQuery Property Graph (`CREATE PROPERTY GRAPH`)	Native graph support with GQL queries via `GRAPH_TABLE()`
GQL Syntax	`GRAPH_TABLE(graph MATCH ... COLUMNS ...)`	BigQuery GQL syntax (differs from Spanner's `GRAPH ... MATCH ... RETURN`)
Embedding Type	`ARRAY<FLOAT64>`	BigQuery's native vector type with `COSINE_DISTANCE` support
Vector Search	`COSINE_DISTANCE()` in `ORDER BY`	Simple and universal; `VECTOR_SEARCH` with IVF index can be added later
Client Reuse	Singleton `BigQueryClientManager`	Shares a single BigQuery client across all storage classes
Primary Key	`PRIMARY KEY (id) NOT ENFORCED`	BigQuery does not enforce primary keys; used as advisory hints
Path Finding	GQL multi-hop MATCH patterns	`find_paths_between` uses 1/2/3-hop GQL queries for PathRAG's path-based retrieval
PageRank	Normalized degree approximation	Full iterative PageRank is impractical for real-time BigQuery queries
Plugin Registration	`register()` adds classes to PathRAG's global `_EXTERNAL_STORAGES` registry	No need to modify PathRAG source; classes are resolved by string name at runtime

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pathrag-bigquery

Installation

Quick Start

Configuration

Using Environment Variables

Prerequisites

GCP Authentication

BigQuery Dataset

Table Schema

KV Storage (`{namespace}_kv`)

Vector Storage (`vdb_{namespace}`)

Graph Storage (`{namespace}_nodes`, `{namespace}_edges`)

Project Structure

Design Decisions

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
examples		examples
src/pathrag_bigquery		src/pathrag_bigquery
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

pathrag-bigquery

Installation

Quick Start

Configuration

Using Environment Variables

Prerequisites

GCP Authentication

BigQuery Dataset

Table Schema

KV Storage ({namespace}_kv)

Vector Storage (vdb_{namespace})

Graph Storage ({namespace}_nodes, {namespace}_edges)

Project Structure

Design Decisions

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

KV Storage (`{namespace}_kv`)

Vector Storage (`vdb_{namespace}`)

Graph Storage (`{namespace}_nodes`, `{namespace}_edges`)

Packages