Skip to content

ksmin23/pathrag-bigquery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pathrag-bigquery

Google Cloud BigQuery storage backend for PathRAG.

This package provides three BigQuery-backed storage classes as an external plugin — no modifications to PathRAG source code required.

Storage Class Description
KV BigQueryKVStorage Key-value storage with JSON serialization
Vector BigQueryVectorDBStorage Vector storage with cosine similarity search
Graph BigQueryGraphStorage Graph storage with BigQuery Property Graph + GQL support

Installation

pip install PathRAG @ git+https://github.com/ksmin23/PathRAG.git@v0.1.1
pip install git+https://github.com/ksmin23/pathrag-bigquery.git@v0.1.0

Quick Start

import asyncio
import pathrag_bigquery
from PathRAG import PathRAG, QueryParam

# Register BigQuery storage classes with PathRAG
pathrag_bigquery.register()

async def main():
    rag = PathRAG(
        working_dir="./rag_storage",
        kv_storage="BigQueryKVStorage",
        vector_storage="BigQueryVectorDBStorage",
        graph_storage="BigQueryGraphStorage",
        addon_params={
            "bigquery_project_id": "my-project",
            "bigquery_dataset_id": "my-dataset",
        },
    )

    await rag.ainsert("Your document text here")
    result = await rag.aquery("Your question", param=QueryParam(mode="hybrid"))
    print(result)

asyncio.run(main())

Configuration

BigQuery connection settings can be provided via addon_params or environment variables. Environment variables are used as fallback when addon_params are not set.

addon_params key Environment Variable Description
bigquery_project_id BIGQUERY_PROJECT or GOOGLE_CLOUD_PROJECT GCP project ID
bigquery_dataset_id BIGQUERY_DATASET BigQuery dataset ID
bigquery_graph_name BIGQUERY_GRAPH_NAME Property graph name (default: pathrag_knowledge_graph)

Using Environment Variables

export GOOGLE_CLOUD_PROJECT=my-project
export BIGQUERY_DATASET=my-dataset
import pathrag_bigquery
pathrag_bigquery.register()

rag = PathRAG(
    kv_storage="BigQueryKVStorage",
    vector_storage="BigQueryVectorDBStorage",
    graph_storage="BigQueryGraphStorage",
    ...
)

Prerequisites

GCP Authentication

gcloud auth application-default login

Or with a service account key:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

BigQuery Dataset

The dataset is created automatically during initialization (CREATE SCHEMA IF NOT EXISTS). Tables and the property graph are also created automatically.

If you prefer to create the dataset manually:

export BIGQUERY_DATASET=pathrag

bq --location=US mk --dataset $GOOGLE_CLOUD_PROJECT:$BIGQUERY_DATASET

Table Schema

KV Storage ({namespace}_kv)

Column Type Description
id STRING Primary key
data STRING JSON-serialized value

Vector Storage (vdb_{namespace})

Column Type Description
id STRING Primary key
embedding ARRAY<FLOAT64> Embedding vector
content STRING Original text content
meta fields STRING Dynamic metadata columns

Graph Storage ({namespace}_nodes, {namespace}_edges)

Nodes:

Column Type Description
id STRING Node identifier
entity_type STRING Entity type label
description STRING Entity description
source_id STRING Source document reference

Edges:

Column Type Description
id STRING Source node ID
target_id STRING Target node ID
weight FLOAT64 Edge weight
description STRING Relationship description
keywords STRING Relationship keywords
source_id STRING Source document reference

A BigQuery Property Graph is created on top of the node and edge tables, enabling GQL queries via GRAPH_TABLE() for path traversal (1-hop, 2-hop, 3-hop).

Project Structure

pathrag-bigquery/
├── pyproject.toml
├── src/
│   └── pathrag_bigquery/
│       ├── __init__.py          # register() and public exports
│       ├── client.py            # BigQueryClientManager and helpers
│       ├── kv.py                # BigQueryKVStorage
│       ├── vector.py            # BigQueryVectorDBStorage
│       └── graph.py             # BigQueryGraphStorage
├── examples/
│   ├── _config.py               # Shared configuration loader
│   ├── .env.example             # Environment variable template
│   ├── requirements.txt         # Example dependencies
│   ├── basic_usage.py           # Simple insert and query
│   ├── batch_insert_and_query.py    # Batch insert with context query
│   ├── env_var_config.py        # Env-var-only configuration
│   └── knowledge_graph_exploration.py  # Direct graph access
└── tests/
    ├── _config.py               # Shared test configuration
    ├── .env.example             # Test environment template
    ├── Makefile                 # Test runner targets
    ├── README.md                # Test documentation
    ├── test_kv_storage.py       # KV storage CRUD tests
    ├── test_graph_storage.py    # Graph storage operation tests
    └── test_upsert_edge_preserves_node.py  # Edge upsert node preservation tests

Design Decisions

Decision Approach Rationale
Sync vs Async Synchronous BigQuery SDK wrapped with asyncio.to_thread BigQuery Python SDK is synchronous; avoids blocking the event loop
Upsert (KV) Query existing keys first, then MERGE INTO for new keys only Matches PathRAG's JsonKVStorage semantics — existing keys are skipped
Upsert (Graph/Vector) MERGE INTO ... WHEN MATCHED / NOT MATCHED BigQuery lacks INSERT OR UPDATE; MERGE is the idiomatic alternative
Namespace Isolation Table-name-based ({namespace}_kv, vdb_{namespace}, etc.) Follows PathRAG's convention; no extra workspace column needed
Property Graph BigQuery Property Graph (CREATE PROPERTY GRAPH) Native graph support with GQL queries via GRAPH_TABLE()
GQL Syntax GRAPH_TABLE(graph MATCH ... COLUMNS ...) BigQuery GQL syntax (differs from Spanner's GRAPH ... MATCH ... RETURN)
Embedding Type ARRAY<FLOAT64> BigQuery's native vector type with COSINE_DISTANCE support
Vector Search COSINE_DISTANCE() in ORDER BY Simple and universal; VECTOR_SEARCH with IVF index can be added later
Client Reuse Singleton BigQueryClientManager Shares a single BigQuery client across all storage classes
Primary Key PRIMARY KEY (id) NOT ENFORCED BigQuery does not enforce primary keys; used as advisory hints
Path Finding GQL multi-hop MATCH patterns find_paths_between uses 1/2/3-hop GQL queries for PathRAG's path-based retrieval
PageRank Normalized degree approximation Full iterative PageRank is impractical for real-time BigQuery queries
Plugin Registration register() adds classes to PathRAG's global _EXTERNAL_STORAGES registry No need to modify PathRAG source; classes are resolved by string name at runtime

License

MIT

About

Google Cloud BigQuery storage backend for PathRAG — KV, Vector, and Graph storage using BigQuery Graph

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors