Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
359 changes: 359 additions & 0 deletions docs/defradb/Concepts/secondary-index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,359 @@
---
sidebar_label: Secondary index
sidebar_position: 10
---

# Secondary indexes

## Overview

Secondary indexes in DefraDB enable efficient document lookups by creating optimized data structures that map field values to documents. Instead of scanning entire collections, indexes allow DefraDB to quickly locate documents matching specific criteria.

**Key Points**

DefraDB's secondary indexing system uses the `@index` directive on GraphQL schema fields to create indexes that **significantly improve query performance on filtered queries**.

**Core capabilities:**

- **Field-level indexes** – Index individual fields for fast lookups
- **Composite indexes** – Index multiple fields together for complex queries
- **Unique constraints** – Enforce uniqueness at the index level
- **Relationship indexes** – Index foreign key relationships between documents
- **JSON field indexes** – Index nested paths within JSON fields using inverted indexes
- **Array field indexes** – Index values within array fields

**Performance trade-off:** Indexes improve read performance but add write overhead, as each document update must also update all relevant indexes. Indexing arrays and JSON fields can fill up storage quickly with large data.

**Best practices:** Index frequently filtered fields, avoid indexing rarely queried fields, and plan indexes based on your application's query patterns.

## How indexes work

### Basic concept

An index is a data structure that maps field values to document identifiers. Instead of scanning every document in a collection (a "table scan"), DefraDB can use the index to directly locate matching documents.

**Without an index:**

```
Query: Find users with age = 30
Process: Scan all user documents → Check each age field → Return matches
Cost: O(n) where n = total documents
```

**With an index on age:**

```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n docs/defradb/Concepts/secondary-index.md | sed -n '40,55p'

Repository: sourcenetwork/docs.source.network

Length of output: 598


🏁 Script executed:

cat -n docs/defradb/Concepts/secondary-index.md | sed -n '300,315p'

Repository: sourcenetwork/docs.source.network

Length of output: 1059


🏁 Script executed:

wc -l docs/defradb/Concepts/secondary-index.md

Repository: sourcenetwork/docs.source.network

Length of output: 121


🏁 Script executed:

cd docs && find . -type f \( -name "*.md" -o -name "*.mdx" \) -exec grep -l "BadgerDB\|storage engine\|LSM" {} \;

Repository: sourcenetwork/docs.source.network

Length of output: 173


🏁 Script executed:

# Search for DefraDB architecture documentation
find docs -type f \( -name "*.md" -o -name "*.mdx" \) | xargs grep -i "backend\|storage" | head -20

Repository: sourcenetwork/docs.source.network

Length of output: 5221


🏁 Script executed:

# Look for any mentions of indexing complexity or performance
cat -n docs/defradb/Concepts/secondary-index.md | grep -i "o(1)\|o(log\|o(n)\|complexity\|time"

Repository: sourcenetwork/docs.source.network

Length of output: 752


🏁 Script executed:

# Search for BadgerDB mentions
find docs -type f \( -name "*.md" -o -name "*.mdx" \) | xargs grep -i "badger" 2>/dev/null

Repository: sourcenetwork/docs.source.network

Length of output: 664


🏁 Script executed:

# Check deployment.md for backend info
cat -n docs/defradb/How-to\ Guides/deployment.md | head -50

Repository: sourcenetwork/docs.source.network

Length of output: 3319


🏁 Script executed:

# Search for DefraDB core documentation about storage
find docs -type f -name "*.md" | xargs grep -i "storage engine\|database backend\|merkle" 2>/dev/null | head -15

Repository: sourcenetwork/docs.source.network

Length of output: 3389


🏁 Script executed:

# Look at the main DefraDB documentation structure
ls -la docs/defradb/

Repository: sourcenetwork/docs.source.network

Length of output: 572


🏁 Script executed:

cat -n docs/defradb/Concepts/secondary-index.md | sed -n '50,100p'

Repository: sourcenetwork/docs.source.network

Length of output: 1694


🏁 Script executed:

# Check if there's more context about index implementation
cat -n docs/defradb/Concepts/secondary-index.md | sed -n '1,60p'

Repository: sourcenetwork/docs.source.network

Length of output: 2792


🌐 Web query:

BadgerDB LSM tree lookup complexity point query performance

💡 Result:

Web search failed: Server error: no LLM provider could handle the message


Update index lookup complexity from O(1) to O(log n) to reflect BadgerDB's LSM-tree implementation.

DefraDB uses BadgerDB (an LSM-tree-based key-value store) as its default storage backend. Point lookups in LSM trees have worst-case complexity of O(log n) due to checking the memtable and potentially multiple SSTable levels, not O(1). Lines 48 and 308 should reflect this accurate complexity.

Changes needed

Line 48:

-Cost: O(1) for lookup + O(m) for retrieval where m = matching documents
+Cost: O(log n) for lookup + O(m) for retrieval where m = matching documents

Line 308:

-On reads, an `_eq` filter on an indexed field is O(1) for the lookup, plus O(m) to retrieve the m matching documents.
+On reads, an `_eq` filter on an indexed field is O(log n) for the lookup, plus O(m) to retrieve the m matching documents.
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 45-45: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/defradb/Concepts/secondary-index.md` at line 45, Update the documented
index lookup complexity in docs/defradb/Concepts/secondary-index.md: replace the
incorrect "O(1)" complexity with "O(log n)" in the two locations called out (the
text around line 48 and the paragraph around line 308) and mention that this
reflects BadgerDB's LSM-tree behavior (memtable + multiple SSTable levels) so
point lookups are O(log n) in the worst case.

Query: Find users with age = 30
Process: Look up "30" in age index → Return matching document IDs
Cost: O(1) for lookup + O(m) for retrieval where m = matching documents
```

### Index structure

For regular indexes, DefraDB stores index entries as key-value pairs where the document ID is part of the key and the value is empty:

```
/col_id/ind_id/field_values/_docID → {}
```

For unique indexes, the document ID is stored as the value instead:

```
/col_id/ind_id/field_values → _docID
```

For a User collection with an indexed `name` field, the entries look like:

```
Index entries:
"Alice/doc_id_1" → {}
"Bob/doc_id_2" → {}
"Bob/doc_id_3" → {}
"Charlie/doc_id_4" → {}
```

When you query for `name = "Bob"`, DefraDB looks up "Bob" in the index and retrieves matching documents one by one (e.g., `doc_id_2`, then `doc_id_3`). If a `limit: 1` is applied, only the first match is fetched.

## Index types

### Single-field indexes

The simplest form of index covers a single field:

```graphql
type User {
name: String @index
email: String @index(unique: true)
}
```

Each indexed field creates a separate index structure. The `unique: true` parameter adds a constraint ensuring no duplicate values.

### Composite indexes

Composite indexes span multiple fields and are optimized for queries filtering on those fields together:

```graphql
type Article @index(includes: [
{field: "status"},
{field: "publishedAt"}
]) {
status: String
publishedAt: DateTime
}
```

**Index structure:**

```
published/2024-01-15/doc_id_1 → {}
published/2024-01-16/doc_id_2 → {}
published/2024-01-16/doc_id_3 → {}
draft/2024-01-15/doc_id_4 → {}
```

(Note: `col_id` and `index_id` are always prefixed but omitted here for clarity.)

Composite indexes are efficient for queries like:

```graphql
filter: {
status: {_eq: "published"}
publishedAt: {_gt: "2017-07-23T03:46:56-05:00"}
}
```

Queries filtering only on the second field (`publishedAt` alone) will not use this index at all.

### Unique indexes

Unique indexes enforce uniqueness constraints at the database level:

```graphql
type User {
email: String @index(unique: true)
}
```

When you try to create a document with a duplicate email, DefraDB will reject it. This is more efficient than manually checking for duplicates in your application code.

**Performance impact:** Unique indexes require an additional read operation on every insert or update to check for existing values.

## Relationship indexing

### How relationship indexes work

When you index a relationship field, DefraDB creates an index on the foreign key reference:

```graphql
type User {
address: Address @primary @index
}

type Address {
city: String @index
}
```

This creates two indexes:

1. User → Address foreign key index
2. Address city field index

### Query optimization with relationship indexes

Consider this query:

```graphql
User(filter: {address: {city: {_eq: "Montreal"}}})
```

**Without indexes:**

1. Scan all User documents
2. For each User, fetch the related Address
3. Check if city matches "Montreal"
4. Return matching Users

**With indexes:**

1. Look up "Montreal" in the Address city index → Get Address IDs
2. Look up those Address IDs in the User→Address relationship index → Get User IDs
3. Retrieve those User documents

The indexed approach avoids scanning the entire User collection and performs direct lookups instead.

### Enforcing relationship cardinality

Unique relationship indexes enforce one-to-one relationships:

```graphql
type User {
address: Address @primary @index(unique: true)
}
```

Without the unique constraint, the relationship defaults to one-to-many (multiple Users could reference the same Address). The unique index ensures exactly one User per Address.

Note: 1-to-2-sided relations are automatically constrained by a unique index to enforce the 1-to-1 invariant.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify the non-standard "1-to-2-sided" terminology.

"1-to-2-sided relations" is not a recognised cardinality term and is likely confusing to readers. Based on context (enforcing a 1-to-1 invariant), this probably means "one-to-one two-sided (bidirectional)" relations.

🔧 Proposed fix
-Note: 1-to-2-sided relations are automatically constrained by a unique index to enforce the 1-to-1 invariant.
+Note: One-to-one (bidirectional) relations are automatically constrained by a unique index to enforce the 1-to-1 invariant.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Note: 1-to-2-sided relations are automatically constrained by a unique index to enforce the 1-to-1 invariant.
Note: One-to-one (bidirectional) relations are automatically constrained by a unique index to enforce the 1-to-1 invariant.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/defradb/Concepts/secondary-index.md` at line 198, Replace the
non-standard phrase "1-to-2-sided relations" with a clearer term such as
"one-to-one two-sided (bidirectional) relations" and update the note so it reads
along the lines of: "Note: one-to-one two-sided (bidirectional) relations are
automatically constrained by a unique index to enforce the one-to-one
invariant." Ensure the replacement appears wherever the original phrase appears
in the Concepts/secondary-index.md content and preserve the intent about
automatic unique-index enforcement.


## JSON field indexing

JSON fields present unique indexing challenges because they're hierarchical and semi-structured. DefraDB uses a specialized approach to handle them efficiently.

> **Storage warning:** Indexing JSON fields can consume significant disk space with large data, as every leaf node at every path is indexed separately.

### Path-aware indexing

Unlike scalar fields (String, Int, Bool), JSON fields contain nested structures. DefraDB indexes every leaf node in the JSON tree along with its complete path:

**Example document:**

```json
{
"user": {
"device": {
"model": "iPhone",
"version": "15"
},
"location": {
"city": "Montreal"
}
}
}
```

**Index entries created** (using `/col_id/ind_id/` prefix, JSON path parts separated by `.`):

```
/1/1/user.device.model/iPhone/doc_id_1 → {}
/1/1/user.device.version/15/doc_id_1 → {}
/1/1/user.location.city/Montreal/doc_id_1 → {}
```

Each entry includes the full path to the value, ensuring DefraDB knows not just what the value is, but where it exists within the document structure.

### Inverted indexes for JSON

DefraDB uses **inverted indexes** for JSON fields. The whole idea is to tokenize key-value pairs that form a path, mapping values back to the documents that contain them.

For context, a primary (non-inverted) index might look like:

```
/1/1/iPhone → {"user": {"device": {"model": "iPhone"}}}
```

The inverted secondary index instead maps paths and values to document IDs:

```
/1/1/user.device.model/iPhone/doc_id_1 → {}
/1/1/user.device.model/Android/doc_id_2 → {}
```

When you query for a specific path and value, DefraDB directly looks it up in the inverted index and retrieves all matching documents. For more on inverted indexes, see the [CockroachDB RFC on inverted indexes](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20171020_inverted_indexes.md).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should reference CocroachDB in our documentation. @jsimnz what do you think?


### Query execution with JSON indexes

**Query:**

```graphql
Collection(filter: {
jsonField: {
user: {
device: {
model: {_eq: "iPhone"}
}
}
}
})
```

**Without index:**

1. Scan all documents
2. Parse each JSON field
3. Navigate to `user.device.model`
4. Compare value to "iPhone"
5. Return matches

**With index:**

1. Look up `/user.device.model/iPhone` in inverted index
2. Retrieve matching document IDs
3. Return those documents

The indexed approach avoids JSON parsing and navigation during query execution.

### Key format for JSON indexes

DefraDB uses a hierarchical key format for JSON index entries:

```
<collection_id>/<index_id>/<json_path>/<json_value>/<doc_id>
```

Example (using numeric collection ID `1` and index ID `1`):

```
/1/1/user.device.model/iPhone/doc_id_1
/1/1/user.location.city/Montreal/doc_id_1
```

This format allows efficient prefix scanning for partial path matches and supports complex queries on nested JSON structures.

## Performance considerations

### Read vs write trade-off

Every index improves read performance but adds write overhead. On reads, an `_eq` filter on an indexed field is O(1) for the lookup, plus O(m) to retrieve the m matching documents. On writes, each indexed field requires updating the index in addition to the document itself — so more indexes means slower writes.

### When to use indexes

Fields that are frequently used in query filters, foreign key relationships, or uniqueness constraints are good candidates. Fields that are rarely queried, change frequently without being filtered, or are in large JSON/array structures with big data volumes are generally poor candidates.

### Composite vs multiple single-field indexes

A composite index like `@index(includes: [{field: "status"}, {field: "date"}])` is best when queries regularly filter on both fields together. Multiple single-field indexes offer more flexibility when queries filter on either field independently, at the cost of slightly slower multi-field queries.

## Direction and ordering

Index direction (ASC or DESC) plays a significant role primarily for **composite indexes**. For single-field indexes, the index fetcher can traverse entries in reverse order just as efficiently as the default order, so direction has minimal practical impact there.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Single-field direction example is misplaced under the "composite index" direction sub-section.

Line 322 says "For composite indexes, specifying direction can matter:" but the code block that immediately follows (Lines 324–328) demonstrates a single-field index (publishedAt: DateTime @index(direction: DESC)). The composite example doesn't appear until Lines 332–337. This contradicts the preceding prose (Line 320) which explicitly states direction has minimal impact on single-field indexes.

Either move the single-field snippet to illustrate the single-field case earlier, or replace it with a composite index example that matches the section heading.

🔧 Proposed fix — replace misplaced single-field example with the composite one
 For composite indexes, specifying direction can matter:
 
-```graphql
-type Article {
-  publishedAt: DateTime `@index`(direction: DESC)
-}
-```
-
-Each field in a composite index can have its own direction:
-
 ```graphql
 `@index`(includes: [
   {field: "status", direction: ASC},
   {field: "publishedAt", direction: DESC}
 ])
</details>





Also applies to: 322-322, 324-324, 328-328

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/defradb/Concepts/secondary-index.md at line 320, The single-field
example showing publishedAt: DateTime @index(direction: DESC) is incorrectly
placed under the "composite index" direction subsection; replace that
single-field snippet with a composite-index example that matches the heading —
use an @index(includes: [...]) example showing per-field directions (e.g.,
{field: "status", direction: ASC}, {field: "publishedAt", direction: DESC}) and
move or re-add the single-field publishedAt example to the earlier single-field
discussion if you want to keep it as an illustration; ensure the composite
@index(includes: ...) snippet replaces the current single-field code block so
the prose and example align.


</details>

<!-- fingerprinting:phantom:triton:churro -->

<!-- This is an auto-generated comment by CodeRabbit -->


For composite indexes, specifying direction can matter:

```graphql
type Article {
publishedAt: DateTime @index(direction: DESC)
}
```

Each field in a composite index can have its own direction:

```graphql
@index(includes: [
{field: "status", direction: ASC},
{field: "publishedAt", direction: DESC}
])
```

When the index direction matches the query's sort order, DefraDB can use the index directly without a separate sorting step.

## Managing indexes

Indexes can be added or deleted at any time using CLI commands or the embedded client. GraphQL-based index management is not yet available.

Refer to the CLI reference for commands to create and drop indexes on existing collections.

## Limitations and considerations

### Query pattern dependency

Indexes only help queries that use the indexed fields. If your query patterns change, you may need to adjust your indexing strategy.

### Write amplification

Heavy indexing can significantly slow down write operations. Monitor write performance and adjust your indexing strategy if writes become a bottleneck.

### Storage overhead

Large collections with many indexes — especially on JSON or array fields — can consume significant disk space. Plan storage capacity accordingly.
Loading