From fb09ea35681a3dd7a6847cba012ed340df00b6b4 Mon Sep 17 00:00:00 2001 From: Cloud IX Team Date: Thu, 18 Jun 2026 18:01:59 -0700 Subject: [PATCH] feat: Add Bigtable Basics skill to registry PiperOrigin-RevId: 934643230 --- README.md | 1 + skills/cloud/bigtable-basics/SKILL.md | 115 +++++++++++ .../assets/row_key_schema.yaml | 14 ++ .../references/cli_data_access.md | 79 ++++++++ .../references/client_libraries.md | 48 +++++ .../bigtable-basics/references/dataplex.md | 27 +++ .../references/infrastructure_management.md | 123 ++++++++++++ .../references/schema_design.md | 187 +++++++++++++++++ .../bigtable-basics/references/sql_guide.md | 190 ++++++++++++++++++ 9 files changed, 784 insertions(+) create mode 100644 skills/cloud/bigtable-basics/SKILL.md create mode 100644 skills/cloud/bigtable-basics/assets/row_key_schema.yaml create mode 100644 skills/cloud/bigtable-basics/references/cli_data_access.md create mode 100644 skills/cloud/bigtable-basics/references/client_libraries.md create mode 100644 skills/cloud/bigtable-basics/references/dataplex.md create mode 100644 skills/cloud/bigtable-basics/references/infrastructure_management.md create mode 100644 skills/cloud/bigtable-basics/references/schema_design.md create mode 100644 skills/cloud/bigtable-basics/references/sql_guide.md diff --git a/README.md b/README.md index 0e9a3c9754..d9c56aeb61 100644 --- a/README.md +++ b/README.md @@ -25,6 +25,7 @@ repo to install. - [**Skill Registry API on Agent Platform**](./skills/cloud/agent-platform-skill-registry) - [**AlloyDB Basics**](./skills/cloud/alloydb-basics) - [**BigQuery Basics**](./skills/cloud/bigquery-basics) +- [**Bigtable Basics**](./skills/cloud/bigtable-basics) - [**Cloud Run Basics**](./skills/cloud/cloud-run-basics) - [**Cloud SQL Basics**](./skills/cloud/cloud-sql-basics) - [**Firebase Basics**](./skills/cloud/firebase-basics) diff --git a/skills/cloud/bigtable-basics/SKILL.md b/skills/cloud/bigtable-basics/SKILL.md new file mode 100644 index 0000000000..257e40c73a --- /dev/null +++ b/skills/cloud/bigtable-basics/SKILL.md @@ -0,0 +1,115 @@ +--- +name: bigtable-basics +description: >- + Assists in provisioning instances/tables, designing performant schemas, and querying data in Bigtable. Use when designing Bigtable row keys, configuring column families, writing SQL queries or client library code (Java, Go, Python) for Bigtable, or diagnosing performance/hotspotting issues. Also use when provisioning Bigtable clusters using gcloud or cbt CLIs. Don't use for generic Cloud SQL administration. +--- + +# Bigtable Basics + +This skill provides core workflows and guidance for administering and developing +with Google Bigtable. + +## Core Principles + +- **Control Plane vs. Data Plane:** + - Use **`gcloud`** for Control Plane operations: Manage Instances, + Clusters, App Profiles, Backups and IAM. Create Tables, Logical Views, + Materialized Views and Authorized Views. + - Use **`cbt`** for Data Plane operations: Update Tables, Column Families, + and reading/writing data. +- **Performance First:** Bigtable is a NoSQL database. Efficiency is tied to + Row Key design. Always warn about Full Table Scans. +- **Client Selection:** For production use cases, prefer **Java** or **Go** + for their superior performance and feature coverage compared to other + languages. +- **Observability:** When diagnosing performance or hotspotting, **always** + mention **Key Visualizer** (via Cloud Console) as the primary diagnostic + tool because it provides the most granular view of access patterns across + row keys. This should be followed by the hot-tablets tool and table stats + in gcloud CLI and `include-stats=full` option under `cbt read` to diagnose + slow queries. + +> [!IMPORTANT] **Safety Rule:** You MUST obtain explicit user confirmation before +> making non-emulator database changes. You MUST mention this safety requirement +> when providing commands or instructions that modify the database structure or +> data. + +## Quick Recipes + +### 1. Querying Data + +Use SQL for complex transforms or aggregations and key-value APIs for simpler +query patterns. *Note: Use exact match, prefix (`_key LIKE 'myprefix%'`), or +range predicates on `_key` to avoid expensive unbounded scans. Recommend +explicit row ranges (`_key BETWEEN 'start' AND 'end'`) as a more performant +alternative to prefix matches where possible.* + +If expensive scans (either unbounded or prefix or range queries scanning a large +range) are unavoidable due to multiple access patterns that can’t all be +accommodated in a single schema, consider one of these two options: + +- If the query will be used in user facing and/or latency sensitive + applications, use continuous materialized views with keys optimized for the + additional access patterns. +- If secondary access patterns are infrequent, batch patterns like ETL, ML + model training or analytical read-only tasks, use Bigtable Data Boost + instead. + +### 2. Manipulating Data + +Use key-value APIs for insert, update, increment and delete operations. SQL API +is read-only. + +### 3. Data Model Definition (DDL) + +SQL API doesn't support DDL operations. Table creation, deletion, updates should +be made using gcloud CLI. Logical Views and Continuous Materialized Views are +defined as SQL queries but they must be created using gcloud CLI. + +## Reference Guides + +- **CLI Operations**: + - [infrastructure_management.md](references/infrastructure_management.md): + Provisioning instances, clusters, and table schemas. + - [cli_data_access.md](references/cli_data_access.md): Reading and writing + data via the `cbt` CLI. +- **Design & Discovery**: + - [schema_design.md](references/schema_design.md): Best practices for row + keys and performance with tables and continuous materialized views. + - [dataplex.md](references/dataplex.md): Data catalog search for Bigtable + assets. +- **Querying & Code**: + - [sql_guide.md](references/sql_guide.md): Querying structured row keys + via SQL and CLI. + - [client_libraries.md](references/client_libraries.md): Patterns for + high-performance Go/Java/Python code. + +## Common Workflows + +### Schema Evolution (DevOps) + +1. **Prefer Terraform** for production schema changes to prevent accidental + data loss. +2. For manual `cbt` changes, first check the existing state by listing the table's column families and GC policies before proposing any modifications: + + ```bash + cbt ls {table} + ``` + + If modifications are needed, create the family or update the GC policy: + + ```bash + cbt createfamily {table} {family} + cbt setgcpolicy {table} {family} "maxversions=5 AND maxage=30d" + ``` + +3. Reference + [infrastructure_management.md](references/infrastructure_management.md) for + full syntax. + +## External Resources + +* [Cloud Bigtable Documentation](https://cloud.google.com/bigtable/docs) +* [Bigtable SQL Reference](https://cloud.google.com/bigtable/docs/reference/sql) +* [cbt CLI Reference](https://cloud.google.com/bigtable/docs/cbt-reference) +* [gcloud bigtable Reference](https://cloud.google.com/sdk/gcloud/reference/bigtable) diff --git a/skills/cloud/bigtable-basics/assets/row_key_schema.yaml b/skills/cloud/bigtable-basics/assets/row_key_schema.yaml new file mode 100644 index 0000000000..95b7c411b1 --- /dev/null +++ b/skills/cloud/bigtable-basics/assets/row_key_schema.yaml @@ -0,0 +1,14 @@ +encoding: + delimitedBytes: + delimiter: '#' +fields: +- fieldName: field1 + type: + bytesType: + encoding: + raw: {} +- fieldName: field2 + type: + bytesType: + encoding: + raw: {} diff --git a/skills/cloud/bigtable-basics/references/cli_data_access.md b/skills/cloud/bigtable-basics/references/cli_data_access.md new file mode 100644 index 0000000000..eeaac0ac74 --- /dev/null +++ b/skills/cloud/bigtable-basics/references/cli_data_access.md @@ -0,0 +1,79 @@ +# Bigtable CLI Data Access + +This document provides patterns for reading and writing data in Bigtable using +the `cbt` CLI. This is primarily used for debugging and quick data validation. + +## Configuring cbt for Data Access + +```bash +echo project = ${BIGTABLE_PROJECT} > ~/.cbtrc +echo instance = ${BIGTABLE_INSTANCE} >> ~/.cbtrc +``` + +## Reading Data + +### Read Single Row (Lookup) + +Reads all columns and versions for a specific row. + +```bash +cbt lookup {table_name} {row_key} +``` + +*Note: `cbt lookup` is optimized for point reads and is significantly more +efficient than using `cbt read` with a count or filter for retrieving a single +known row.* + +### Read N Rows + +Reads the first `N` rows from the table. + +```bash +cbt read {table_name} count={n} +``` + +### Read Range + +Reads rows between `START_KEY` (inclusive) and `END_KEY` (exclusive). + +```bash +cbt read {table_name} start={start_key} end={end_key} +``` + +### Read using SQL + +For complex queries and aggregations use SQL via the `cbt sql` command + +```bash +cbt sql "SELECT * FROM my_table WHERE _key = 'user#123'" +``` + +### Row Count (Estimate) + +Provides an estimate of the number of rows in the table. + +```bash +gcloud bigtable instances tables describe {table_id} --instance={instance_id} --view stats +``` + +**Note**: cbt count {table_name} would do a full table scan. + +## Writing Data + +### Write Cell (Set) + +Writes a value to a specific cell (row, family, and column). + +```bash +cbt set {table_name} {row_key} {family}:{column}={value} +``` + +*Example:* `cbt set my-table user123 profile:email=user@example.com` + +## Deleting Data + +### Delete Row + +```bash +cbt deleterow {table_name} {row_key} +``` diff --git a/skills/cloud/bigtable-basics/references/client_libraries.md b/skills/cloud/bigtable-basics/references/client_libraries.md new file mode 100644 index 0000000000..7c16cb94df --- /dev/null +++ b/skills/cloud/bigtable-basics/references/client_libraries.md @@ -0,0 +1,48 @@ +# Bigtable Client Library User Guide + +This document outlines critical technical details about Bigtable data model and +client libraries. + +## Language Recommendations + +For production use cases requiring the **best performance and feature +coverage**, **Java** or **Go** are highly recommended. These libraries are +mature, highly optimized, and typically receive new features first. Python is +suitable for scripting and data science but may have lower throughput for +high-concurrency production workloads. + +- [Go Example](https://docs.cloud.google.com/bigtable/docs/samples-go-hello) +- [Java Example](https://docs.cloud.google.com/bigtable/docs/samples-java-hello-world) +- [Python Example](https://docs.cloud.google.com/bigtable/docs/samples-python-hello) +- [Node Example](https://docs.cloud.google.com/bigtable/docs/samples-nodejs-hello) + +## Timestamp Precision & Granularity + +Bigtable stores timestamps as **64-bit integers** representing **microseconds** +since the Unix epoch. However, Bigtable’s internal garbage collection and +versioning operate at **millisecond granularity**. + +> [!IMPORTANT] **Implementation Rule:** When generating code to store data, +> calculate the timestamp in milliseconds and multiply by 1,000. +> +> * **Correct:** `timestamp_micros = time_ms() * 1000` +> * **Incorrect:** Using raw microsecond precision (e.g., `time_micros()`), as +> this can lead to unexpected behavior with cell versioning and TTL. + +## Replication & Atomic Operations + +Bigtable’s replication model impacts the availability of certain "atomicity" +features. These atomic operations are generally less efficient than standard +writes. + +* **The Conflict:** **ReadModifyWrite** (increments/appends) and + **CheckAndMutateRow** (conditional updates) require a single-point-of-truth + to maintain consistency. They also require a read before a write, making + them significantly slower and more resource-intensive than standard blind + writes. +* **The Constraint:** These operations **will not work** with multi-cluster + routing (App Profiles set to Multi-cluster). +* **Agent Action:** If a user’s code contains these methods, proactively warn + them that these operations are inefficient and that they must use a + **Single-cluster routing** App Profile or accept that these operations will + fail in a multi-cluster configuration. diff --git a/skills/cloud/bigtable-basics/references/dataplex.md b/skills/cloud/bigtable-basics/references/dataplex.md new file mode 100644 index 0000000000..2aa6407e4c --- /dev/null +++ b/skills/cloud/bigtable-basics/references/dataplex.md @@ -0,0 +1,27 @@ +# Dataplex Catalog Search for Bigtable + +This document provides patterns for searching Bigtable data assets in the +Dataplex Universal Catalog. + +## Searching Entries + +Searches for entries matching a query in a specific Google Cloud project and +location. + +```bash +curl -X POST \ + -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ + -H "Content-Type: application/json" \ + "https://dataplex.googleapis.com/v1/projects/${BIGTABLE_PROJECT}/locations/{location}:searchEntries" \ + -d '{"query": "{search_term} system=Bigtable"}' +``` + +*Example:* Search for "customer list" in `us-east1`: + +```bash +curl -X POST \ + -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ + -H "Content-Type: application/json" \ + "https://dataplex.googleapis.com/v1/projects/my-project/locations/us-east1:searchEntries" \ + -d '{"query": "customer list system=Bigtable"}' +``` diff --git a/skills/cloud/bigtable-basics/references/infrastructure_management.md b/skills/cloud/bigtable-basics/references/infrastructure_management.md new file mode 100644 index 0000000000..ed11fce65c --- /dev/null +++ b/skills/cloud/bigtable-basics/references/infrastructure_management.md @@ -0,0 +1,123 @@ +# Bigtable Infrastructure and Administration + +This document provides patterns for provisioning and managing Bigtable +resources. + +## Table of Contents + +* [Tooling Split](#tooling-split) [L13-L17] +* [Control Plane (gcloud)](#control-plane-gcloud) [L19-L57] +* [Data Plane (cbt)](#data-plane-cbt) [L59-L76] +* [Observability and Performance](#observability-and-performance) [L78-L91] +* [Local Development (Emulator)](#local-development-emulator) [L93-L107] + +## Tooling Split + +- **`gcloud` (Control Plane):** Use for instances, clusters, app profiles, + backups, and IAM. +- **`cbt` (Data Plane):** Use for tables, column families, and data + manipulation. + +## Control Plane (gcloud) + +### Instance and Cluster Management + +```bash +# Create instance with a single cluster +gcloud bigtable instances create ${BIGTABLE_INSTANCE} \ + --project=${BIGTABLE_PROJECT} \ + --display-name="{display_name}" \ + --cluster-config=id=${BIGTABLE_CLUSTER},zone={zone},nodes={num_nodes} + +# Add a cluster to an existing instance +gcloud bigtable clusters create ${BIGTABLE_CLUSTER} \ + --instance=${BIGTABLE_INSTANCE} \ + --zone={zone} \ + --nodes={num_nodes} + +# Delete instance +gcloud bigtable instances delete ${BIGTABLE_INSTANCE} --project=${BIGTABLE_PROJECT} --quiet +``` + +### Table and Schema Management + +```bash +# Create table with a column family +gcloud bigtable instances tables create {table_name} \ + --instance=${BIGTABLE_INSTANCE} \ + --column-families={family_name} + +# Create table with multiple column families and GC policies +gcloud bigtable instances tables create {table_name} \ + --instance=${BIGTABLE_INSTANCE} \ + --column-families="family1:maxage=10d,family2:maxversions=5" +``` + +### Backup and Restore + +```bash +# Create a backup +gcloud bigtable backups create {backup_id} \ + --instance=${BIGTABLE_INSTANCE} \ + --cluster=${BIGTABLE_CLUSTER} \ + --table={table_id} \ + --retention-period=7d + +# Restore a table from backup +gcloud bigtable instances tables restore \ + --source=projects/{project_id_source}/instances/{instance_id_source}/clusters/${BIGTABLE_CLUSTER}/backups/{backup_id} \ + --destination={new_table_id} \ + --destination-instance={instance_id_destination} \ + --project={project_id_destination} \ + --async +``` + +## Data Plane (cbt) + +### Table and Schema Operations + +```bash +# Create/Delete table +cbt createtable {table_name} +cbt deletetable {table_name} + +# List tables and families +cbt ls +cbt ls {table_name} + +# Create/Delete column family +cbt createfamily {table_name} {family_name} +cbt setgcpolicy {table_name} {family_name} "maxversions=1" +cbt deletefamily {table_name} {family_name} +``` + +## Observability and Performance + +### Hotspotting Diagnosis + +When performance degrades or a "hotspot" is suspected: + +1. **Key Visualizer:** Direct the user to the Google Cloud Console. Key + Visualizer provides a heatmap of access patterns across row keys. +2. **List Hot Tablets (gcloud):** Identify specific tablets with high CPU + usage. + + ```bash + gcloud bigtable hot-tablets list ${BIGTABLE_CLUSTER} --instance=${BIGTABLE_INSTANCE} + ``` + +## Local Development (Emulator) + +Start the Bigtable emulator for testing: + +```bash +gcloud beta emulators bigtable start --host-port=localhost:8086 +``` + +To point `cbt` or client libraries to the emulator: + +```bash +export BIGTABLE_EMULATOR_HOST=localhost:8086 +``` + +** Note**: Bigtable emulator doesn't support Bigtable GoogleSQL yet. diff --git a/skills/cloud/bigtable-basics/references/schema_design.md b/skills/cloud/bigtable-basics/references/schema_design.md new file mode 100644 index 0000000000..24b64497f7 --- /dev/null +++ b/skills/cloud/bigtable-basics/references/schema_design.md @@ -0,0 +1,187 @@ +# Bigtable Schema Design Guide for Agents + +This document provides guidelines for designing performant schemas. + +## Table of Contents + +* [Key concepts](#key-concepts) [L15-L40] +* [Defining the Row Key Template](#defining-the-row-key-template) [L42-L63] +* [Structured Row Keys](#structured-row-keys) [L65-L106] +* [Row Key Design & Hotspotting](#row-key-design--hotspotting) [L108-L128] +* [Counters for real-time metrics](#counters-for-real-time-metrics) + [L130-L135] +* [Materialized views](#materialized-views) [L137-L178] +* [Performance Checklist (Agent Verification)](#performance-checklist-agent-verification) + [L180-L197] + +## Key concepts + +* **Row key:** Bigtable stores data lexicographically by row key. For best + performance queries should be designed to filter by row key in its entirety + or prefix. Point lookups by row key or reading ranges starting with a key + will be the most performant. Row keys can have multiple parts combined + using a delimiter, typically following a hierarchical format such as + `category#subcategory#productID` as in `apparel#shoes#0123`. + Bigtable doesn't support multi-row transactions but changes within a row are + transactional. When designing schemas put data that needs to be updated + transactionally within the same row. +* **Column Families:** Group data that is accessed together within a row. + Defined as part of the schema. Contents of a family can easily be deleted in + bulk with a single command for a given row key. +* **Column Qualifiers:** Defined at write time. Each row can have as many + unique qualifiers within the row size limits (256 MB) with no limit on + number of qualifiers per table. Qualifiers can be used in two ways: 1. as + attributes in a JSON document e.g. `zipcode`, `city`, `state`, `street + address` or 2. to store data like affinity scores e.g. `0.9`, `0.7` for + different products or web pages they visited e.g. `home`, `search`, `cart`. +* **Timestamps:** Are used for versioning. They are not system timestamps. + They are user-defined and often used for event times like a sensor reading, + address change timestamp or date a social media post was written. They can + be used to expire items using TTL or move them to cold storage for cost + savings as well as time-travel queries to find the "as of" state of a + record. + +## Defining the Row Key Template + +Since Bigtable treats row keys as opaque bytes, defining a **Row Key Template** +is a manual design process based on your use case. To ensure consistency when +programmatically interacting with data, your application code must implement a +mechanism (such as string formatting or concatenation) to construct and parse +these keys. + +### 1. The Template Format + +Define your keys using a placeholder syntax: +`{tenant_id}#{entity_type}#{reversed_timestamp}#{uuid}` + +### 2. Implementation Pattern + +Use centralized factory functions to construct keys. + +* **Java:** `String.format("%s#%s#%d#%s", tenantId, entity, Long.MAX_VALUE - + ts, uuid)` +* **Go:** `fmt.Sprintf("%s#%s#%d#%s", tenantID, entity, math.MaxInt64-ts, + uuid)` + +### 3. Delimiter Selection + +Use `#`, `:`, or `|`. Ensure delimiters don't appear in the field data. + +## Structured Row Keys + +Bigtable supports **Structured Row Keys** to define the structure of your row +keys. This metadata helps external tools (like BigQuery) and the Bigtable SQL +interface understand how to parse your keys. + +### Why use Structured Row Keys? + +* **Automatic Parsing:** SQL queries can reference individual segments by name + instead of using string functions. +* **Integration:** Improves the experience when querying Bigtable from + BigQuery or Spark. +* **Validation:** Helps prevent malformed keys. + +### Managing via gcloud + +You can define the structure when creating a table or update an existing one: + +```bash +gcloud bigtable instances tables update {table_id} \ + --instance=${BIGTABLE_INSTANCE} \ + --row-key-schema-definition-file={row_key_schema_definition_file} +``` + +Where `{row_key_schema_definition_file}` is a YAML file. A template is provided +in `assets/row_key_schema.yaml`. You can copy this template to create your +schema definition. + +## Row Key Design & Hotspotting + +If row keys are autoincrement or are prefixed by date or timestamp, all writes +will hit a single node, creating a "hotspot" and taxing the overall system +performance. Bigtable's in-memory tier addresses hotspotting for reads (e.g. +trending content on social media) but keys should be designed by keeping writes +in mind. + +### Distribution Strategy + +To ensure high performance, agents must validate that row keys are designed for +**high cardinality**. + +* **Avoid:** Sequential timestamps at the start of the key. +* **Prefer:** Prefixes to divide up the key space or reversed timestamps + (e.g., `tenantID#reversedTimestamp#objectID`). + +#### Field Salting Example + +If a user must use a low-cardinality prefix, recommend "salting" the key: +`salt = hash(original_key) % number_of_nodes` `new_row_key = salt + "#" + +original_key` + +## Counters for real-time metrics + +For frequent updates on single row metrics e.g. number of ad views, social media +post likes, API calls or daily unique viewers (using data sketches like HLL), +create an aggregate family to use Bigtable counters for much higher throughput +and lower latency compared to read-modify-write. + +## Materialized views + +### For real-time analytics + +Materialized views can be used for real-time aggregations across one or more +rows for any type of data including aggregate families (note that approximate +count distinct sketches will need to be aggregated using HLL_COUNT.MERGE). +Frequently used metrics can be pre-aggregated efficiently as these views are +incrementally maintained, then further filtered and aggregated at read time +using SQL. Telemetry, merchant analytics, ad performance monitoring and +real-time features for machine learning are some common use cases. Below is an +example query that can be used as a materialized view that returns hourly count +of messages in each chat room for a messaging application that has chatroom's +unique identifier as the first row key token. + +```sql +SELECT SPLIT(_key, '#')[0] AS chatroom, TIMESTAMP_TRUNC(_timestamp, HOUR) AS time_bucket, +COUNT(_key) AS total_messages FROM UNPACK((SELECT * FROM messages(WITH_HISTORY=>TRUE))) +GROUP BY 1, 2 +``` + +### For secondary indexing + +Materialized views can be used as asynchronous global secondary indexes. Given +the wide range of SQL functions supported, even geospatial index (using +`S2_CELLIDFROMPOINT`) and inverted index use cases can be served with +materialized views. Below is an example of an inverted index that allows fast +search for rows that have occurrences of a given word in any of its cells in +`user_profile` family or in a particular qualifier . + +```sql +SELECT +u.value AS indexed_value, +u.key AS indexed_qualifier, +ARRAY_AGG(_key) AS user_keys +FROM users, UNNEST(MAP_ENTRIES(user_profile)) u +GROUP BY 1,2 +``` + +Filtering this index view for just `indexed_value` returns all occurrences in +the form of an array of row keys from the `users` table while using both +`indexed_value` and `indexed_qualifier` returns results for only that qualifier. + +## Performance Checklist (Agent Verification) + +When reviewing or generating schema-related code, verify the following: + +- [ ] **Row Key Size:** Must be < 4KB (Ideal: 10–100 bytes). Large keys + increase memory pressure and disk usage. +- [ ] **Uniqueness:** Ensure row keys are globally unique. Duplicate keys will + overwrite existing data. +- [ ] **Character Set:** Use `^[a-zA-Z0-9\-_#]+$`. Stick to alphanumeric, + underscores, and hashes. Zero pad all numbers to ensure correct string + sorting. +- [ ] **Column Qualifier Size:** Keep < 16 KB to minimize storage footprint. +- [ ] **Column Family Count:** Limit to < 100 families. Keep names short. +- [ ] **Cell Field Size:** Keep < 10 MB (100 MB is the hard limit). Larger + cells slow down retrieval. +- [ ] **Row Size:** Keep < 100 MB. Note that Bigtable enforces a hard limit of + 256 MB at read time. diff --git a/skills/cloud/bigtable-basics/references/sql_guide.md b/skills/cloud/bigtable-basics/references/sql_guide.md new file mode 100644 index 0000000000..d7152fa78f --- /dev/null +++ b/skills/cloud/bigtable-basics/references/sql_guide.md @@ -0,0 +1,190 @@ +# Bigtable SQL Guide for Agents + +This document outlines key aspects of Google Bigtable's SQL dialect which +extends GoogleSQL to support a multi-version wide-column data model. Bigtable +currently only supports SELECT statements over single tables i.e. JOIN and UNION +operations are not supported with the exception of JOINs with UNNEST(array) to +support working with nested objects. + +## Table of Contents + +* [Bigtable SQL Data Structures](#bigtable-sql-data-structures) [L19-L190] + * [Columns](#columns) [L21-L42] + * [Primitive types](#primitive-types) [L43-L65] + * [Timestamps](#timestamps) [L66-L124] + * [Row keys](#row-keys) [L125-L157] + * [Maps](#maps) [L158-L164] + * [Protocol buffers (protos)](#protocol-buffers-protos) [L165-L181] + * [Functions and operators](#functions-and-operators) [L182-L190] + +## Bigtable SQL Data Structures + +### Columns + +Unless table metadata indicates otherwise, columns are of **Map type** which +hold versioned key-value pairs. In legacy Bigtable APIs, SQL maps correspond to +**column families**. + +* **Access Pattern:** You cannot select a column directly as a scalar value. + You must access the specific key within the column family. +* **Correct Syntax Example:** + + ```sql + -- CORRECT: Use map-like bracket notation for the column qualifier. + SELECT cf1['text'] FROM messages; + ``` + +* **Incorrect Syntax Example:** + + ```sql + -- INCORRECT: Dot notation is not supported and will fail. + SELECT cf1.text FROM messages; + ``` + +### Primitive types + +There is often no type information associated with column values. You should +infer the type from the name of the column and include an explicit cast in +generated SQL queries. + +* *Example:* `SELECT CAST(info['address'] AS STRING) AS address FROM + table_name` +* *Example:* `SELECT CAST(CAST(info['age'] AS STRING) AS INT64) AS age FROM + table_name` +* *Example:* `SELECT SAFE_CAST(info['address'] AS STRING) FROM table_name;` +* *Example:* `SELECT TO_INT64(cf['age']) as age FROM table_name` +* *Example:* `SELECT CAST(CAST(cf['checkin_date'] AS STRING) AS DATE) AS + checkin_date FROM table_name` +* *Example:* `SELECT TIMESTAMP(CAST(CAST(cf['checkin_date'] AS STRING) AS + DATE)) AS checkin_date_time FROM table_name` +* *Example:* `SELECT CAST(CAST(cf['is_booked'] AS STRING) AS BOOL) AS + is_booked FROM table_name` - if "true" or "false" was stored +* *Example:* `SELECT CAST(TO_INT64(cf['is_booked']) AS BOOL) AS is_booked FROM + table_name` - if 1 or 0 was stored` +* *Example:* `SELECT CAST(SAFE_CONVERT_BYTES_TO_STRING(cf['is_booked']) AS + BOOL) AS is_booked from table_name`, if "true" or "false" was stored + +### Timestamps + +By default Bigtable returns the **latest value** of each column when use with a +standard SELECT statement. You can use different flags explained below to get +prior versions. Using any flag other than "as_of" and "with_history => FALSE" +will return timestamp-value pairs. SQL interface exposes Bigtable timestamps as +SQL TIMESTAMP type. This section doesn't contain the exhaustive list of all +version management flags, for more details refer to the Google Cloud +documentation. + +* **Access Pattern:** To retrieve values as of a certain point in time (on or + immediately prior to the provided timestamp), use the as_of flag. + +```sql +SELECT * FROM table_name(as_of => TIMESTAMP("2025-03-28 14:13:40-0400")) +``` + +-------------------------------------------------------------------------------- + +* **Access Pattern:** To retrieve all versions treat table name as a + table-valued function and set the with_history flag to TRUE. + +```sql +SELECT * FROM table_name(with_history => TRUE) +``` + +-------------------------------------------------------------------------------- + +* **Access Pattern:** To retrieve last 5 versions treat table name as a + table-valued function and use the with_history flag. + +```sql +SELECT * FROM table_name(with_history => TRUE, latest_n => 5) +``` + +-------------------------------------------------------------------------------- + +* **Access Pattern:** To retrieve a range of timestamps treat table name as a + table-valued function and use the before, after, after_or_equal, and + before_or_equal flags. + +```sql +SELECT * FROM table_name(with_history => true, after => TIMESTAMP("2025-03-28 14:13:40-0400"), before_or_equal => TIMESTAMP("2025-03-28 14:15:10-04:00")) +``` + +-------------------------------------------------------------------------------- + +* **Access Pattern:** Convert timestamped values into a flat table and perform + time bucketing and aggregations. + +```sql +SELECT TIMESTAMP_TRUNC(_timestamp, HOUR) AS hourly, AVG(temp_versioned) AS average_temperature FROM +UNPACK((SELECT metrics['temperature'] AS temp_versioned FROM sensorReadings(with_history => true, after => TIMESTAMP('2023-01-14T23:00:00.000Z'), before => TIMESTAMP('2023-01-21T01:00:00.000Z')) +WHERE _key LIKE 'sensorA%')) +GROUP BY 1 +``` + +-------------------------------------------------------------------------------- + +### Row keys + +Each Bigtable row is identified with a unique key. Bigtable SQL interface has a +pseudo-column named **_key** that is used to query by row key. + +#### Standard Scans (Opaque Key) + +Success with Bigtable depends on translating logical queries into efficient +physical scans. + +* **Point Lookup:** `SELECT * FROM table_name WHERE _key = 'row_key'` +* **Prefix Scan:** `SELECT * FROM table_name WHERE STARTS_WITH(_key, + 'prefix#')` +* **Range Scan:** `SELECT * FROM table_name WHERE _key >= 'start#key' AND _key + < 'end#key'` +* **Fuzzy Match:** `SELECT * FROM table_name WHERE _key LIKE '%#pattern#%'` + **(Warning: Causes full table scan)** + +#### Querying with Structured Row Keys + +If a **Structured Row Key** is defined for the table (see `schema_design.md`), +you can reference segments directly by name in the `WHERE` clause. This allows +for cleaner, more expressive queries. + +* **Syntax:** `SELECT * FROM table_name WHERE segment_name = 'value'` +* **Example:** If your structure defines `tenant_id` and `timestamp`, you can + query: `SELECT * FROM table_name WHERE tenant_id = '123' AND timestamp > + 1713300000` + +**Agent Action:** When generating queries that are not looking for exact matches +(point lookups), key ranges, or prefixes, warn the user that the query will +result in a full table scan. + +### Maps + +Inspect qualifiers within a column family. + +* *Example:* `SELECT MAP_KEYS(cf1) FROM table_name` +* *Example:* `SELECT MAP_ENTRIES(cf1) FROM table_name` + +### Protocol buffers (protos) + +Some users prefer to serialize data to protobufs and store them as blobs in +Bigtable. Protobufs reduce the storage footprint and are often faster to read, +advantageous for data that is not frequently updated. Protobufs can be directly +queried using SQL after their schemas are registered using `gcloud bigtable +schema-bundles create`. + +* **Example:** If you have `profile` proto that has attributes `user_name`, + `gender` and `birth_year` registered under `accounts` bundle with + package_name `publisher` stored under family `user`, column qualifier `info` + you can query: + + ```sql + SELECT CAST(user['info'] AS accounts.publisher.profile).user_name FROM accounts; + ``` + +### Functions and operators + +Bigtable offers a wide range of +[SQL functions](https://docs.cloud.google.com/bigtable/docs/reference/sql/functions-all) +and +[operators](https://docs.cloud.google.com/bigtable/docs/reference/sql/operators) +and +[conditional expressions](https://docs.cloud.google.com/bigtable/docs/reference/sql/conditional_expressions).