diff --git a/docs/defradb/Concepts/secondary-index.md b/docs/defradb/Concepts/secondary-index.md new file mode 100644 index 0000000..68e5a7e --- /dev/null +++ b/docs/defradb/Concepts/secondary-index.md @@ -0,0 +1,359 @@ +--- +sidebar_label: Secondary index +sidebar_position: 10 +--- + +# Secondary indexes + +## Overview + +Secondary indexes in DefraDB enable efficient document lookups by creating optimized data structures that map field values to documents. Instead of scanning entire collections, indexes allow DefraDB to quickly locate documents matching specific criteria. + +**Key Points** + +DefraDB's secondary indexing system uses the `@index` directive on GraphQL schema fields to create indexes that **significantly improve query performance on filtered queries**. + +**Core capabilities:** + +- **Field-level indexes** – Index individual fields for fast lookups +- **Composite indexes** – Index multiple fields together for complex queries +- **Unique constraints** – Enforce uniqueness at the index level +- **Relationship indexes** – Index foreign key relationships between documents +- **JSON field indexes** – Index nested paths within JSON fields using inverted indexes +- **Array field indexes** – Index values within array fields + +**Performance trade-off:** Indexes improve read performance but add write overhead, as each document update must also update all relevant indexes. Indexing arrays and JSON fields can fill up storage quickly with large data. + +**Best practices:** Index frequently filtered fields, avoid indexing rarely queried fields, and plan indexes based on your application's query patterns. + +## How indexes work + +### Basic concept + +An index is a data structure that maps field values to document identifiers. Instead of scanning every document in a collection (a "table scan"), DefraDB can use the index to directly locate matching documents. + +**Without an index:** + +``` +Query: Find users with age = 30 +Process: Scan all user documents → Check each age field → Return matches +Cost: O(n) where n = total documents +``` + +**With an index on age:** + +``` +Query: Find users with age = 30 +Process: Look up "30" in age index → Return matching document IDs +Cost: O(1) for lookup + O(m) for retrieval where m = matching documents +``` + +### Index structure + +For regular indexes, DefraDB stores index entries as key-value pairs where the document ID is part of the key and the value is empty: + +``` +/col_id/ind_id/field_values/_docID → {} +``` + +For unique indexes, the document ID is stored as the value instead: + +``` +/col_id/ind_id/field_values → _docID +``` + +For a User collection with an indexed `name` field, the entries look like: + +``` +Index entries: +"Alice/doc_id_1" → {} +"Bob/doc_id_2" → {} +"Bob/doc_id_3" → {} +"Charlie/doc_id_4" → {} +``` + +When you query for `name = "Bob"`, DefraDB looks up "Bob" in the index and retrieves matching documents one by one (e.g., `doc_id_2`, then `doc_id_3`). If a `limit: 1` is applied, only the first match is fetched. + +## Index types + +### Single-field indexes + +The simplest form of index covers a single field: + +```graphql +type User { + name: String @index + email: String @index(unique: true) +} +``` + +Each indexed field creates a separate index structure. The `unique: true` parameter adds a constraint ensuring no duplicate values. + +### Composite indexes + +Composite indexes span multiple fields and are optimized for queries filtering on those fields together: + +```graphql +type Article @index(includes: [ + {field: "status"}, + {field: "publishedAt"} +]) { + status: String + publishedAt: DateTime +} +``` + +**Index structure:** + +``` +published/2024-01-15/doc_id_1 → {} +published/2024-01-16/doc_id_2 → {} +published/2024-01-16/doc_id_3 → {} +draft/2024-01-15/doc_id_4 → {} +``` + +(Note: `col_id` and `index_id` are always prefixed but omitted here for clarity.) + +Composite indexes are efficient for queries like: + +```graphql +filter: { + status: {_eq: "published"} + publishedAt: {_gt: "2017-07-23T03:46:56-05:00"} +} +``` + +Queries filtering only on the second field (`publishedAt` alone) will not use this index at all. + +### Unique indexes + +Unique indexes enforce uniqueness constraints at the database level: + +```graphql +type User { + email: String @index(unique: true) +} +``` + +When you try to create a document with a duplicate email, DefraDB will reject it. This is more efficient than manually checking for duplicates in your application code. + +**Performance impact:** Unique indexes require an additional read operation on every insert or update to check for existing values. + +## Relationship indexing + +### How relationship indexes work + +When you index a relationship field, DefraDB creates an index on the foreign key reference: + +```graphql +type User { + address: Address @primary @index +} + +type Address { + city: String @index +} +``` + +This creates two indexes: + +1. User → Address foreign key index +2. Address city field index + +### Query optimization with relationship indexes + +Consider this query: + +```graphql +User(filter: {address: {city: {_eq: "Montreal"}}}) +``` + +**Without indexes:** + +1. Scan all User documents +2. For each User, fetch the related Address +3. Check if city matches "Montreal" +4. Return matching Users + +**With indexes:** + +1. Look up "Montreal" in the Address city index → Get Address IDs +2. Look up those Address IDs in the User→Address relationship index → Get User IDs +3. Retrieve those User documents + +The indexed approach avoids scanning the entire User collection and performs direct lookups instead. + +### Enforcing relationship cardinality + +Unique relationship indexes enforce one-to-one relationships: + +```graphql +type User { + address: Address @primary @index(unique: true) +} +``` + +Without the unique constraint, the relationship defaults to one-to-many (multiple Users could reference the same Address). The unique index ensures exactly one User per Address. + +Note: 1-to-2-sided relations are automatically constrained by a unique index to enforce the 1-to-1 invariant. + +## JSON field indexing + +JSON fields present unique indexing challenges because they're hierarchical and semi-structured. DefraDB uses a specialized approach to handle them efficiently. + +> **Storage warning:** Indexing JSON fields can consume significant disk space with large data, as every leaf node at every path is indexed separately. + +### Path-aware indexing + +Unlike scalar fields (String, Int, Bool), JSON fields contain nested structures. DefraDB indexes every leaf node in the JSON tree along with its complete path: + +**Example document:** + +```json +{ + "user": { + "device": { + "model": "iPhone", + "version": "15" + }, + "location": { + "city": "Montreal" + } + } +} +``` + +**Index entries created** (using `/col_id/ind_id/` prefix, JSON path parts separated by `.`): + +``` +/1/1/user.device.model/iPhone/doc_id_1 → {} +/1/1/user.device.version/15/doc_id_1 → {} +/1/1/user.location.city/Montreal/doc_id_1 → {} +``` + +Each entry includes the full path to the value, ensuring DefraDB knows not just what the value is, but where it exists within the document structure. + +### Inverted indexes for JSON + +DefraDB uses **inverted indexes** for JSON fields. The whole idea is to tokenize key-value pairs that form a path, mapping values back to the documents that contain them. + +For context, a primary (non-inverted) index might look like: + +``` +/1/1/iPhone → {"user": {"device": {"model": "iPhone"}}} +``` + +The inverted secondary index instead maps paths and values to document IDs: + +``` +/1/1/user.device.model/iPhone/doc_id_1 → {} +/1/1/user.device.model/Android/doc_id_2 → {} +``` + +When you query for a specific path and value, DefraDB directly looks it up in the inverted index and retrieves all matching documents. For more on inverted indexes, see the [CockroachDB RFC on inverted indexes](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20171020_inverted_indexes.md). + +### Query execution with JSON indexes + +**Query:** + +```graphql +Collection(filter: { + jsonField: { + user: { + device: { + model: {_eq: "iPhone"} + } + } + } +}) +``` + +**Without index:** + +1. Scan all documents +2. Parse each JSON field +3. Navigate to `user.device.model` +4. Compare value to "iPhone" +5. Return matches + +**With index:** + +1. Look up `/user.device.model/iPhone` in inverted index +2. Retrieve matching document IDs +3. Return those documents + +The indexed approach avoids JSON parsing and navigation during query execution. + +### Key format for JSON indexes + +DefraDB uses a hierarchical key format for JSON index entries: + +``` +//// +``` + +Example (using numeric collection ID `1` and index ID `1`): + +``` +/1/1/user.device.model/iPhone/doc_id_1 +/1/1/user.location.city/Montreal/doc_id_1 +``` + +This format allows efficient prefix scanning for partial path matches and supports complex queries on nested JSON structures. + +## Performance considerations + +### Read vs write trade-off + +Every index improves read performance but adds write overhead. On reads, an `_eq` filter on an indexed field is O(1) for the lookup, plus O(m) to retrieve the m matching documents. On writes, each indexed field requires updating the index in addition to the document itself — so more indexes means slower writes. + +### When to use indexes + +Fields that are frequently used in query filters, foreign key relationships, or uniqueness constraints are good candidates. Fields that are rarely queried, change frequently without being filtered, or are in large JSON/array structures with big data volumes are generally poor candidates. + +### Composite vs multiple single-field indexes + +A composite index like `@index(includes: [{field: "status"}, {field: "date"}])` is best when queries regularly filter on both fields together. Multiple single-field indexes offer more flexibility when queries filter on either field independently, at the cost of slightly slower multi-field queries. + +## Direction and ordering + +Index direction (ASC or DESC) plays a significant role primarily for **composite indexes**. For single-field indexes, the index fetcher can traverse entries in reverse order just as efficiently as the default order, so direction has minimal practical impact there. + +For composite indexes, specifying direction can matter: + +```graphql +type Article { + publishedAt: DateTime @index(direction: DESC) +} +``` + +Each field in a composite index can have its own direction: + +```graphql +@index(includes: [ + {field: "status", direction: ASC}, + {field: "publishedAt", direction: DESC} +]) +``` + +When the index direction matches the query's sort order, DefraDB can use the index directly without a separate sorting step. + +## Managing indexes + +Indexes can be added or deleted at any time using CLI commands or the embedded client. GraphQL-based index management is not yet available. + +Refer to the CLI reference for commands to create and drop indexes on existing collections. + +## Limitations and considerations + +### Query pattern dependency + +Indexes only help queries that use the indexed fields. If your query patterns change, you may need to adjust your indexing strategy. + +### Write amplification + +Heavy indexing can significantly slow down write operations. Monitor write performance and adjust your indexing strategy if writes become a bottleneck. + +### Storage overhead + +Large collections with many indexes — especially on JSON or array fields — can consume significant disk space. Plan storage capacity accordingly. diff --git a/docs/defradb/How-to Guides/secondary-index-how-to.md b/docs/defradb/How-to Guides/secondary-index-how-to.md new file mode 100644 index 0000000..81990f3 --- /dev/null +++ b/docs/defradb/How-to Guides/secondary-index-how-to.md @@ -0,0 +1,313 @@ +--- +sidebar_label: Secondary index +sidebar_position: 10 +--- + + +# Secondary indexes + +This guide provides step-by-step instructions for creating and using secondary indexes in DefraDB to improve query performance. + +:::tip[Key Points] + +DefraDB's secondary indexing system enables efficient document lookups using the `@index` directive on GraphQL schema fields. Indexes trade write overhead for significantly faster read performance on filtered queries. + +**Best practices:** Index frequently filtered fields, avoid indexing rarely queried fields, and use unique indexes sparingly (they add a read operation on every write). Plan indexes based on query patterns to balance read/write performance. + +::: + +## Prerequisites + +Before following this guide, ensure you have: + +- DefraDB installed and running +- A defined schema for your collections +- Understanding of [secondary index concepts](/defradb/next/Concepts/secondary-index) + +## Create a basic index + +Add the `@index` directive to a field in your schema to create an index. + +### Index a single field + +```graphql +type User { + name: String @index + age: Int +} +``` + +This creates an ascending (ASC) index on the `name` field. + +### Specify index direction + +```graphql +type User { + name: String @index(direction: DESC) + age: Int +} +``` + +Use `direction: DESC` for descending order or `direction: ASC` (default) for ascending order. + +:::note +Direction plays a significant role only for composite indexes. For single-field indexes, the fetcher can traverse entries in either direction equally efficiently. +::: + +### Add the schema + +```bash +defradb client schema add -f schema.graphql +``` + +## Manage indexes with the CLI + +Indexes can be added or deleted at any time using CLI commands — you do not need to redefine the schema from scratch. + +```bash +# Create an index on an existing collection +defradb client index create --collection User --fields name + +# Create a unique index +defradb client index create --collection User --fields email --unique + +# Drop an index +defradb client index drop --collection User --name + +# List indexes on a collection +defradb client index list --collection User +``` + +:::note +GraphQL-based index management is not yet available. Use the CLI or embedded client. +::: + +## Create a unique index + +Unique indexes ensure no two documents have the same value for the indexed field. + +```graphql +type User { + email: String @index(unique: true) + name: String +} +``` + +This prevents duplicate email addresses in your User collection. + +## Create a composite index + +Composite indexes span multiple fields, useful for queries filtering on multiple fields simultaneously. + +### Using schema-level directive + +```graphql +type User @index(includes: [{field: "name"}, {field: "age"}]) { + name: String + age: Int + email: String +} +``` + +### Specify different directions per field + +```graphql +type User @index(includes: [ + {field: "name", direction: ASC}, + {field: "age", direction: DESC} +]) { + name: String + age: Int +} +``` + +:::note +A composite index is only used when the query filters on the leading field(s) of the index. Filtering on only a non-leading field (e.g. `age` alone in the example above) will not use this index at all. +::: + +## Index relationships + +Index relationship fields to improve query performance across related documents. + +### Basic relationship index + +```graphql +type User { + name: String + age: Int + address: Address @primary @index +} + +type Address { + user: User + city: String @index + street: String +} +``` + +This indexes both: + +- The relationship between User and Address +- The city field in Address + +### Query with relationship index + +```graphql +query { + User(filter: { + address: {city: {_eq: "Montreal"}} + }) { + name + } +} +``` + +With the indexes, DefraDB: + +1. Quickly finds Address documents with `city = "Montreal"` +2. Retrieves the related User documents efficiently + +### Enforce unique relationships + +Use a unique index to enforce one-to-one relationships: + +```graphql +type User { + name: String + age: Int + address: Address @primary @index(unique: true) +} + +type Address { + user: User + city: String + street: String +} +``` + +This ensures no two Users can reference the same Address document. Note that 1-to-2-sided relations are automatically constrained by a unique index to enforce the 1-to-1 invariant. + +## Index JSON fields + +DefraDB supports indexing JSON fields for efficient queries on nested data. + +> **Storage warning:** Indexing JSON fields can consume significant disk space with large datasets, as every leaf node at every path is indexed separately. + +### Define a schema with JSON field + +```graphql +type Product { + name: String + metadata: JSON @index +} +``` + +### Query nested JSON paths + +```graphql +query { + Product(filter: { + metadata: { + user: { + device: { + model: {_eq: "iPhone"} + } + } + } + }) { + name + } +} +``` + +The index enables direct lookup of documents matching the nested path and value. + +## Name your indexes + +Assign custom names to indexes for easier identification. + +```graphql +type User { + name: String @index(name: "user_name_idx") + email: String @index(name: "user_email_unique_idx", unique: true) +} +``` + +Default names are auto-generated from field names and direction. + +## Query patterns for best performance + +### Index frequently filtered fields + +```graphql +type Article { + title: String + content: String + status: String @index # Frequently filtered + publishedAt: DateTime @index # Frequently filtered + author: String +} +``` + +Index fields commonly used in `filter` clauses. + +### Use composite indexes for multi-field filters + +```graphql +type Article @index(includes: [ + {field: "status"}, + {field: "publishedAt"} +]) { + title: String + status: String + publishedAt: DateTime +} +``` + +```graphql +query { + Article(filter: { + status: {_eq: "published"} + publishedAt: {_gt: "2017-07-23T03:46:56-05:00"} + }) { + title + } +} +``` + +This composite index efficiently handles queries filtering on both `status` and `publishedAt` together. If you only filter on `publishedAt` alone, this index won't be used — add a separate single-field index on `publishedAt` if that query pattern is also common. + +### Avoid over-indexing + +Every index adds write overhead, so only index fields that are actually queried. Fields like `middleName` or `internalNote` that are rarely used in filters don't need indexes. + +## Performance considerations + +Analyze your application's queries and index the fields used in filters. Use the [explain systems](explain-systems-how-to.md) to verify that indexes are being used as expected. + +Unique indexes should be used only when uniqueness is a hard requirement — they require an additional read on every insert and update. For JSON and array fields, be mindful that indexing large datasets can consume significant disk space. + +## Troubleshooting + +### Queries still slow after adding indexes + +**Issue**: Query performance hasn't improved after adding indexes. + +**Solutions**: + +- Verify the index was created successfully using `defradb client index list` +- Ensure your query filter uses the indexed field +- For composite indexes, confirm you are filtering on the leading field +- Check if you're querying in the reverse direction of a relationship (may need to index the other side) + +### Unique constraint violations + +**Issue**: Cannot insert documents due to unique index constraint. + +**Solution**: Check for existing documents with the same value. Unique indexes prevent duplicates, so you must either update the existing document or use a different value. + +### Write performance degraded + +**Issue**: Document creation/updates are slower after adding indexes. + +**Solution**: This is expected — indexes trade write performance for read performance. Review your indexes and remove any that aren't serving active query patterns. diff --git a/docs/defradb/How-to Guides/secondary-index.md b/docs/defradb/How-to Guides/secondary-index.md deleted file mode 100644 index 18d2d1f..0000000 --- a/docs/defradb/How-to Guides/secondary-index.md +++ /dev/null @@ -1,255 +0,0 @@ ---- -sidebar_label: Secondary Indexes -sidebar_position: 60 ---- -# Seconday Indexes - -:::tip[Key Points] - -DefraDB's secondary indexing system enables efficient document lookups using the `@index` directive on GraphQL schema fields. Indexes trade write overhead for significantly faster read performance on filtered queries. - -**Best practices:** Index frequently filtered fields, avoid indexing rarely queried fields, and use unique indexes sparingly (they add validation overhead). Plan indexes based on query patterns to balance read/write performance. - -::: - -## Introduction - -DefraDB provides a powerful and flexible secondary indexing system that enables efficient document lookups and queries. - -## Usage - -The `@index` directive can be used on GraphQL schema objects and field definitions to configure indexes. - -```graphql -@index(name: String, unique: Bool, direction: ORDERING, includes: [{ field: String, direction: ORDERING }]) -``` - -### `name` -Sets the index name. Defaults to concatenated field names with direction. - -### `unique` -Makes the index unique. Defaults to false. - -### `direction` -Sets the default index direction for all fields. Can be one of ASC (ascending) or DESC (descending). Defaults to ASC. - -If a field in the includes list does not specify a direction the default direction from this value will be used instead. - -### `includes` -Sets the fields the index is created on. - -When used on a field definition and the field is not in the includes list it will be implicitly added as the first entry. - -## Examples - -### Field level usage - -Creates an index on the User name field with DESC direction. - -```graphql -type User { - name: String @index(direction: DESC) -} -``` - -### Schema level usage - -Creates an index on the User name field with default direction (ASC). - -```graphql -type User @index(includes: {field: "name"}) { - name: String - age: Int -} -``` - -### Unique index - -Creates a unique index on the User name field with default direction (ASC). - -```graphql -type User { - name: String @index(unique: true) -} -``` - -### Composite index - -Creates a composite index on the User name and age fields with default direction (ASC). - -```graphql -type User @index(includes: [{field: "name"}, {field: "age"}]) { - name: String - age: Int -} -``` - -### Relationship index - -Creates a unique index on the User relationship to Address. The unique index constraint ensures that no two Users can reference the same Address document. - -```graphql -type User { - name: String - age: Int - address: Address @primary @index(unique: true) -} - -type Address { - user: User - city: String - street: String -} -``` - -## Performance considerations - -Indexes can greatly improve query performance, but they also impact system performance during writes. Each index adds write overhead since every document update must also update the relevant indexes. Despite this, the boost in read performance for indexed queries usually makes this trade-off worthwhile. - -#### To optimize performance: - -- Choose indexes based on your query patterns. Focus on fields frequently used in query filters to maximize efficiency. -- Avoid indexing rarely queried fields. Doing so adds unnecessary overhead. -- Be cautious with unique indexes. These require extra validation, making their performance impact more significant. - -Plan your indexes carefully to balance read and write performance. - -### Indexing related objects - -DefraDB supports indexing relationships between documents, allowing for efficient queries across related data. - -#### Example schema: Users and addresses - -```graphql -type User { - name: String - age: Int - address: Address @primary @index -} - -type Address { - user: User - city: String @index - street: String -} -``` - -Key indexes in this schema: - -- **City field in address:** Indexed to enable efficient queries by city. -- **Relationship between user and address**: Indexed to support fast lookups based on relationships. - -#### Query example - -The following query retrieves all users living in Montreal: - -```graphql -query { - User(filter: { - address: {city: {_eq: "Montreal"}} - }) { - name - } -} -``` - -#### How indexing improves efficiency - -**Without indexes:** -- Fetch all user documents. -- For each user, retrieve the corresponding Address. This approach becomes slow with large datasets. - -**With indexes:** -- Fetch address documents matching the city value directly. -- Retrieve the corresponding User documents. This method is much faster because indexes enable direct lookups. - -### Enforcing unique relationships -Indexes can also enforce one-to-one relationships. For instance, to ensure each User has exactly one unique Address: - -```graphql -type User { - name: String - age: Int - address: Address @primary @index(unique: true) -} - -type Address { - user: User - city: String @index - street: String -} -``` - -Here, the @index(unique: true) constraint ensures no two Users can share the same Address. Without it, the relationship defaults to one-to-many, allowing multiple Users to reference a single Address. - -By combining relationship indexing with cardinality constraints, you can create highly efficient and logically consistent data structures. - -## JSON field indexing - -DefraDB offers a specialized indexing system for JSON fields, designed to handle their hierarchical structure efficiently. - -### Overview - -JSON fields differ from other field types (e.g., Int, String, Bool) because they are semi-structured and encoded. DefraDB uses a path-aware system to manage these complexities, enabling traversal and indexing of all leaf nodes in a JSON document. - -### Example - -```json -{ - "user": { - "device": { - "model": "iPhone" - } - } -} -``` - -Here, the `iPhone` value is represented with its complete path: [`user`, `device`, `model`]. This path-aware representation ensures that the system knows not just the value, but where it resides within the document. - -Retrieve documents where the model is "iPhone": - -```graphql -query { - Collection(filter: { - jsonField: { - user: { - device: { - model: {_eq: "iPhone"} - } - } - } - }) -} -``` - -With indexes, the system directly retrieves matching documents, avoiding the need to scan and parse the JSON during queries. - -### How it works - -#### Inverted Indexes for JSON -DefraDB uses inverted indexes for JSON fields. These indexes reverse the traditional "document-to-value" relationship by starting with a value and quickly locating all documents containing that value. - -- Regular fields map to a single index entry. -- JSON fields generate multiple entries—one for each leaf node, incorporating both the path and the value. - -During indexing, the system traverses the entire JSON structure, creating these detailed index entries. - -#### Value normalization in JSON -DefraDB normalizes JSON leaf values to ensure consistency in ordering and comparisons. For example: - -- JSON values include their normalized value and path information. -- Scalar types (e.g., integers) are normalized to a standard type, such as `int64`. - -This ensures that operations like filtering and sorting are reliable and efficient. - -#### How indexing works -When indexing a document with JSON fields, the system: - -1. Traverses the JSON structure using the JSON interface. -1. Generates index entries for every leaf node, combining path and normalized value. -1. Stores entries efficiently, enabling direct querying. - -#### Benefits of JSON field indexing -- **Efficient queries**: Leverages inverted indexes for fast lookups, even in deeply nested structures. -- **Precise path tracking**: Maintains path information for accurate indexing and retrieval. -- **Scalable structure**: Handles complex JSON documents with minimal performance overhead.