diff --git a/docs/indexing/fts-index.mdx b/docs/indexing/fts-index.mdx
index 64fac40..056589d 100644
--- a/docs/indexing/fts-index.mdx
+++ b/docs/indexing/fts-index.mdx
@@ -34,6 +34,8 @@ Check FTS index status using the API:
+`wait_for_index(...)` waits until the named FTS index exists and `index_stats(...)` reports `num_unindexed_rows == 0`. It can time out if writes keep adding rows faster than the index catches up. If a table has multiple FTS indexes, specify the target text column when querying instead of relying on implicit selection.
+
### Asynchronous API
When using async connections (`connect_async`), use `create_index` with the `FTS` configuration:
@@ -48,6 +50,8 @@ When using async connections (`connect_async`), use `create_index` with the `FTS
The `create_fts_index` method is not available on `AsyncTable`. Use `create_index` with `FTS` config instead.
+In TypeScript, create an FTS index with `table.createIndex("text", { config: lancedb.Index.fts() })` and query it with `table.query().nearestToText(...)`.
+
## Configuration Options
### FTS Parameters
diff --git a/docs/indexing/gpu-indexing.mdx b/docs/indexing/gpu-indexing.mdx
index 325a217..70240a1 100644
--- a/docs/indexing/gpu-indexing.mdx
+++ b/docs/indexing/gpu-indexing.mdx
@@ -26,6 +26,8 @@ into a synchronous process by waiting until the index is built.
+`wait_for_index(...)` waits for the index to exist and for `index_stats(...)` to report `num_unindexed_rows == 0`. It can time out if the table is receiving continuous writes while the build is trying to catch up.
+
## Manual GPU indexing in LanceDB OSS
You can use the Python SDK to manually create the `IVF_PQ` index on a GPU. You'll need
@@ -62,4 +64,3 @@ to enable GPU training on your device.
If you encounter the error `AssertionError: Torch not compiled with CUDA enabled`,
you need to [install PyTorch with CUDA support](https://pytorch.org/get-started/locally/).
-
diff --git a/docs/indexing/index.mdx b/docs/indexing/index.mdx
index 35d5634..a6e72a8 100644
--- a/docs/indexing/index.mdx
+++ b/docs/indexing/index.mdx
@@ -39,6 +39,12 @@ LanceDB provides a comprehensive suite of indexing strategies for different data
TypeScript currently doesn't support `IvfSq` (IVF with Scalar Quantization).
+
+**Operational checks**
+
+For vector indexes, use the same distance metric when creating the index and searching it. After appends or other writes, use `optimize()` to fold new rows into existing indexes, then check `index_stats(...)` or `wait_for_index(...)` if you need to confirm the index has caught up. `wait_for_index(...)` waits until the named indexes exist and report `num_unindexed_rows == 0`; it can time out if writes keep adding unindexed rows.
+
+
### Quantization Types
Vector indexes can use different quantization methods to compress vectors and improve search performance:
diff --git a/docs/indexing/quantization.mdx b/docs/indexing/quantization.mdx
index d6fda72..a6057aa 100644
--- a/docs/indexing/quantization.mdx
+++ b/docs/indexing/quantization.mdx
@@ -15,12 +15,15 @@ Use quantization when:
LanceDB currently exposes multiple quantized vector index types, including:
- `IVF_PQ` -- Inverted File index with Product Quantization (default). See the [vector indexing guide](/indexing/vector-index) for `IVF_PQ` examples.
+- `IVF_SQ` -- Inverted File index with Scalar Quantization. This is available in Python and Rust; TypeScript does not currently expose `IvfSq`.
- `IVF_RQ` -- Inverted File index with **RaBitQ** quantization (binary, 1 bit per dimension). Requires vector dimensions divisible by `8`. See [below](#rabitq-quantization) for details.
- `IVF_HNSW_SQ` -- IVF partitions with an **HNSW graph per partition** plus **Scalar Quantization**. Strong recall/latency/size trade-off for most workloads.
- `IVF_HNSW_PQ` -- IVF partitions with an **HNSW graph per partition** plus **Product Quantization**. Prefer when PQ-level compression matters and you still want HNSW-style in-partition search.
Two axes are being combined here: whether partitions are searched flatly or via an HNSW graph (`IVF_*` vs. `IVF_HNSW_*`), and which quantizer compresses the vectors (`PQ`, `RQ`, or `SQ`). `IVF_PQ` is the default and works well in many cases. For more drastic compression, RaBitQ (`IVF_RQ`) is a reasonable option. For higher recall at low latency, the HNSW-backed variants are usually the right pick. The ["Choose the Right Index"](/indexing/vector-index#choose-the-right-index) table on the vector indexing page is the canonical decision tool.
+Use the same distance metric when training the index and running queries against it. For IVF-based indexes, `num_partitions` controls the number of groups and `sample_rate` controls how many training vectors are sampled per partition, so the training sample is roughly `sample_rate * num_partitions`.
+
## RaBitQ quantization
RaBitQ is a binary quantization method that represents each normalized embedding using **1 bit per dimension**, plus a couple of small corrective scalars. In practice, a 1,024-dimensional `float32` vector that would normally take 4 KB can be compressed to roughly a few hundred bytes with RaBitQ, while still maintaining reasonable recall.
diff --git a/docs/indexing/reindexing.mdx b/docs/indexing/reindexing.mdx
index 76c9ffd..b90516a 100644
--- a/docs/indexing/reindexing.mdx
+++ b/docs/indexing/reindexing.mdx
@@ -23,7 +23,7 @@ Table optimization performs three maintenance operations:
1. **Compaction**: merges small fragments into larger ones to improve read performance
2. **Pruning/Cleanup**: removes files from versions older than a retention window (7 days by default)
-3. **Index update**: adds newly-ingested data to existing indexes
+3. **Index update**: adds newly-ingested data to existing vector, scalar, and FTS indexes
@@ -36,7 +36,7 @@ Table optimization performs three maintenance operations:
LanceDB Enterprise support incremental reindexing through an automated background process. When new data is added to a table, the system automatically triggers a new index build. As the dataset grows, indexes are asynchronously updated in the background.
- While indexes are being rebuilt, queries use brute force methods on unindexed rows, which may temporarily increase latency. To avoid this, set `fast_search=True` to search only indexed data.
-- Use `index_stats()` to view the number of unindexed rows. This will be zero when indexes are fully up-to-date.
+- Use `index_stats()` to view the number of unindexed rows. This will be zero when indexes are fully up-to-date. If you call `wait_for_index(...)`, it polls the same status and can time out while continuous writes keep adding unindexed rows.
The benefit of using LanceDB Enterprise is that it automates the reindexing process
and operates continuously in the background, minimizing the impact on latency under high loads.
@@ -57,4 +57,3 @@ If you need to reclaim space more aggressively in OSS, use a shorter retention w
```
-
diff --git a/docs/indexing/scalar-index.mdx b/docs/indexing/scalar-index.mdx
index e41980b..686c8da 100644
--- a/docs/indexing/scalar-index.mdx
+++ b/docs/indexing/scalar-index.mdx
@@ -57,6 +57,8 @@ If you are using LanceDB Enterprise, the `create_scalar_index` API returns immed
+`wait_for_index(...)` waits until the named scalar indexes exist and `index_stats(...)` reports `num_unindexed_rows == 0`. If a table is receiving steady writes, that fully indexed state may not stabilize before the timeout.
+
### 3. Update the Index
Updating the table data (adding, deleting, or modifying records) requires that you also update the scalar index. This can be done by calling `optimize`, which will trigger an update to the existing scalar index.
@@ -139,4 +141,3 @@ LanceDB supports scalar indexes on UUID columns (stored as `FixedSizeBinary(16)`
{ScalarIndexUuidUpsert}
-
diff --git a/docs/indexing/vector-index.mdx b/docs/indexing/vector-index.mdx
index 6054595..5eea636 100644
--- a/docs/indexing/vector-index.mdx
+++ b/docs/indexing/vector-index.mdx
@@ -53,6 +53,8 @@ You can call `create_index()` with different parameters to create a new index --
Although the `create_index` API returns immediately, the building of the vector index is asynchronous. To wait until all data is fully indexed, you can specify the `wait_timeout` parameter.
+Use the same distance metric for index creation and search. Once a vector index exists, queries use the metric stored with that index. If you need to confirm an async build or refresh is finished, `wait_for_index(...)` waits for the named index to exist and for `index_stats(...)` to report `num_unindexed_rows == 0`; it can time out if new writes keep arriving.
+
## Choose the Right Index
Use this table as a quick starting point for choosing the right index type and quantization method for your use case:
diff --git a/docs/reranking/custom-reranker.mdx b/docs/reranking/custom-reranker.mdx
index da9015d..a9bbd64 100644
--- a/docs/reranking/custom-reranker.mdx
+++ b/docs/reranking/custom-reranker.mdx
@@ -16,6 +16,9 @@ cover, and only override the ones you need. The base class leaves `rerank_vector
overridden raises `NotImplementedError` rather than silently returning unsorted results. That's a
useful guard, but worth knowing about before you wire up a query path you didn't plan for.
+The Python base class exposes hybrid, vector-only, and FTS-only rerank hooks. TypeScript and Rust
+currently expose the custom reranker interface for hybrid reranking.
+
## Interface
The `Reranker` base interface comes with a `merge_results()` method that can be used to combine the
@@ -23,7 +26,8 @@ results of semantic and full-text search. This is a vanilla merging algorithm th
the results and removes the duplicates without taking the scores into consideration. It only keeps the
first copy of the row encountered. This works well in cases that don't require the scores of semantic
and full-text search to combine the results. If you want to use the scores or want to support
-`return_score="all"`, you'll need to implement your own merging algorithm.
+`return_score="all"`, you'll need to implement your own merging algorithm. The base
+`return_score` option accepts only `"relevance"` and `"all"`.
Whichever methods you override, your reranker has one job on the way out: attach a
`_relevance_score` column with the most relevant rows at the top. LanceDB will reject the result
diff --git a/docs/reranking/eval.mdx b/docs/reranking/eval.mdx
index 9fa06f2..6e37155 100644
--- a/docs/reranking/eval.mdx
+++ b/docs/reranking/eval.mdx
@@ -26,6 +26,11 @@ score-based path most readers encounter first; `LinearCombinationReranker` is an
score-based strategy you opt into explicitly.
+By default, rerankers return `_relevance_score`. Pass `return_score="all"` when a reranker
+supports it, and you also need the original vector or FTS scores for debugging.
+Evaluation code can rely on returned rows being ordered by descending `_relevance_score`. Empty
+reranked result sets still include the `_relevance_score` column.
+
The hybrid `rerank(...)` method also accepts a `normalize` argument that controls how the raw
vector and FTS scores are made comparable before reranking:
diff --git a/docs/reranking/index.mdx b/docs/reranking/index.mdx
index b3c7d05..0cdcca7 100644
--- a/docs/reranking/index.mdx
+++ b/docs/reranking/index.mdx
@@ -14,6 +14,9 @@ with models from Cohere, Sentence-Transformers, and more.
To use a reranker, you perform a search and then pass the results to the `rerank()` method.
+Note that `CohereReranker()` requires the `cohere` package and either
+`COHERE_API_KEY` in the environment or an `api_key` argument.
+
```python Python icon="python"
import lancedb
@@ -42,6 +45,17 @@ LanceDB supports several rerankers out of the box. Here are a few examples:
You can find more details about these and other rerankers in the [integrations](/integrations/reranking) section.
+Python also includes score-based rerankers such as `RRFReranker`, `LinearCombinationReranker`,
+and `MRRReranker`, plus provider rerankers for OpenAI, Jina, Voyage AI, Answer.AI, and Cohere.
+Provider rerankers usually need the provider package installed and either an API key argument or
+the provider-specific environment variable.
+
+Rerankers add `_relevance_score` and return rows ordered by descending relevance. Python rerankers
+accept `return_score="relevance"` or `return_score="all"` − use `"all"` when you want to keep the
+original vector distance or FTS score columns for debugging. Model-based rerankers read from
+`column="text"` by default, so either return that column in the search results or pass a different
+column.
+
**SDK coverage differs across languages**
@@ -84,4 +98,4 @@ the `deduplicate` flag.
LanceDB also allows you to create custom rerankers by extending the base `Reranker` class. The custom reranker
should implement the `rerank` method that takes a list of search results and returns a reranked list of
-search results. This is covered in more detail in the [creating custom rerankers](/reranking/custom-reranker/) section.
\ No newline at end of file
+search results. This is covered in more detail in the [creating custom rerankers](/reranking/custom-reranker/) section.
diff --git a/docs/search/filtering.mdx b/docs/search/filtering.mdx
index 99a7b9a..6602430 100644
--- a/docs/search/filtering.mdx
+++ b/docs/search/filtering.mdx
@@ -12,6 +12,8 @@ with filtering capabilities even on datasets containing billions of records.
**Pre-filtering** means LanceDB applies the metadata `where(...)` condition before running vector search, so the search only considers rows that already match the filter. **Post-filtering** means LanceDB runs vector search first and only then filters the returned candidates. Pre-filtering is enabled by default. In practice, pre-filtering is better when the filter is part of the result contract; post-filtering can be lower-latency for expensive or non-indexable filters, but it can return fewer than `limit` rows, or even zero, if the nearest neighbors do not pass the filter.
+On hybrid queries, the same `where(...)` filter is applied to both the vector and full-text halves of the query. The prefilter or postfilter choice controls whether that happens before each subquery scores candidates or after the subquery top-k is produced.
+
## Example: Metadata Filtering
To illustrate filtering capabilities, let's try four data points with combinations of vectors and metadata:
@@ -228,4 +230,4 @@ For a column of type LIST(T), you can use `LABEL_LIST` to create a scalar index.
Both **pre-filtering** and **post-filtering** can yield false positives. For pre-filtering, if the filter is too selective, it might eliminate relevant items that the vector search would have otherwise identified as a good match. In this case, increasing `nprobes` parameter will help reduce such false positives. It is recommended to call `bypass_vector_index()` if you know that the filter is highly selective.
-Similarly, a highly selective post-filter can lead to false positives. Increasing both `nprobes` and `refine_factor` can mitigate this issue. When deciding between pre-filtering and post-filtering, pre-filtering is generally the safer choice if you're uncertain.
\ No newline at end of file
+Similarly, a highly selective post-filter can lead to false positives. Increasing both `nprobes` and `refine_factor` can mitigate this issue. When deciding between pre-filtering and post-filtering, pre-filtering is generally the safer choice if you're uncertain.
diff --git a/docs/search/full-text-search.mdx b/docs/search/full-text-search.mdx
index de77d72..c80a9ed 100644
--- a/docs/search/full-text-search.mdx
+++ b/docs/search/full-text-search.mdx
@@ -130,6 +130,8 @@ If you want to specify which columns to search use `fts_columns="text"`
LanceDB automatically searches on the existing FTS index if the input to the search is of type `str`. If you provide a vector as input, LanceDB will search the ANN index instead.
+If a table has more than one FTS index, specify the indexed text column in the query. In Python you can use `fts_columns` or the query builder's `nearest_to_text(..., columns=...)`; in TypeScript, use `query().nearestToText(..., columns)`. The newer Lance-native FTS does not accept legacy Tantivy-only index parameters.
+
### Keeping the index up to date
Rows you add after building an FTS index aren't part of the index until you optimize the table. Until then, queries fall back to a flat scan over the unindexed fragments to keep results complete, which slows them down as the unindexed tail grows. Call `table.optimize()` to fold new rows into the existing index — it's the same operation used for vector indexes:
diff --git a/docs/search/hybrid-search.mdx b/docs/search/hybrid-search.mdx
index e6bf4da..d5785f3 100644
--- a/docs/search/hybrid-search.mdx
+++ b/docs/search/hybrid-search.mdx
@@ -244,6 +244,10 @@ text_query = "flower moon"
Hybrid queries inherit the same builder API as vector and FTS queries, so the same knobs for filtering, distance bounds, and row identity apply. These compose with `.rerank(...)` and the explicit `.vector()` / `.text()` form shown above.
+
+Always set `.limit(...)` on production hybrid queries. Without an explicit cap, the query builder does not give you a useful top-k contract to tune, and it may materialize more rows than you intended before reranking.
+
+
### Returning row IDs
Pass `with_row_id(True)` (Python) or `withRowId()` (TypeScript) to include the internal `_rowid` column in the results. This is useful for joining hybrid results back to a primary table, or for deduping across multiple queries:
diff --git a/docs/search/index.mdx b/docs/search/index.mdx
index e5c822b..739ce0a 100644
--- a/docs/search/index.mdx
+++ b/docs/search/index.mdx
@@ -13,3 +13,10 @@ icon: "list"
| [Hybrid Search](/search/hybrid-search/) | Combines vector and full-text search with reranking |
| [Filtering](/search/filtering/) | Filter results based on metadata fields |
| [SQL Queries](/search/sql/index) | SQL query capabilities for data exploration and analytics |
+
+## Before you search
+
+- Vector search can run without an ANN index as an exhaustive scan. That's useful while prototyping, but build a vector index before relying on low-latency searches over larger tables.
+- Full-text and hybrid text search require an FTS index on the text column you query. If a table has multiple FTS indexes, specify the target column. FTS also supports phrase, boolean, boosted, multi-match, and fuzzy query forms when you need more than plain terms.
+- Multivector search currently uses cosine similarity and accepts either one query vector or a matrix of query vectors; every query vector must match the inner dimension of the multivector column.
+- Set an explicit `.limit(...)` for production queries. Query builders also support controls such as prefilter/postfilter, distance ranges, row-id inclusion, offset pagination, and Arrow/Pandas/list result materialization.
diff --git a/docs/search/multivector-search.mdx b/docs/search/multivector-search.mdx
index 65900d3..429ce25 100644
--- a/docs/search/multivector-search.mdx
+++ b/docs/search/multivector-search.mdx
@@ -19,6 +19,8 @@ Each item in your dataset can have a column containing multiple vectors, which L
Currently, only the `cosine` metric is supported for multivector search. The vector value type can be `float16`, `float32`, or `float64`.
+Each query vector must match the inner vector dimension in the multivector column. This applies to both single-vector queries and multi-vector query matrices.
+
## Computing Similarity
MaxSim (Maximum Similarity) is a key concept in late-interaction models that:
diff --git a/docs/search/optimize-queries.mdx b/docs/search/optimize-queries.mdx
index 512dd64..3b9d7b2 100644
--- a/docs/search/optimize-queries.mdx
+++ b/docs/search/optimize-queries.mdx
@@ -33,6 +33,11 @@ Executes the query and provides detailed runtime metrics, including:
Together, these tools offer a comprehensive view of query performance, from planning to execution. Use `explain_plan` to verify your query structure and `analyze_plan` to measure and optimize actual performance.
+Metadata filters are prefiltered by default, which usually shows the filter pushed into the
+`LanceScan` or index scan. If you set `prefilter=False`, expect a separate `FilterExec` after
+search instead; that can be useful for some expensive filters, but it changes both latency and
+the number of rows available after filtering.
+
## Reading the Execution Plan
To demonstrate query performance analysis, we'll use a table containing 1.2M rows sampled from the [Wikipedia dataset](https://huggingface.co/datasets/wikimedia/wikipedia). Initially, the table has no indices, allowing us to observe the impact of optimization.
diff --git a/docs/search/sql/fts-sql.mdx b/docs/search/sql/fts-sql.mdx
index f45aa43..91bc70e 100644
--- a/docs/search/sql/fts-sql.mdx
+++ b/docs/search/sql/fts-sql.mdx
@@ -15,6 +15,8 @@ thoroughly and being prepared to update your queries as newer versions of LanceD
LanceDB provides support for full-text search via SQL queries using the `fts()` User-Defined Table Function (UDTF). This allows you to incorporate keyword-based search (based on BM25) in your SQL queries for powerful text retrieval.
+The SQL `fts()` table function expects exactly two string literals: the table name and the JSON FTS query. Build the JSON query in your application, pass it as a SQL string literal, and keep filtering, grouping, or joining in the surrounding SQL.
+
## Table Setup
First, set up your FlightSQL client connection. See [SQL Queries documentation](/search/sql) for detailed client setup instructions.
diff --git a/docs/search/sql/index.mdx b/docs/search/sql/index.mdx
index db80afe..6551982 100644
--- a/docs/search/sql/index.mdx
+++ b/docs/search/sql/index.mdx
@@ -41,6 +41,8 @@ you can use a wide variety of SQL syntax and functions to query your data. For
information on the SQL syntax and functions supported by DataFusion, please refer to the
[DataFusion documentation](https://datafusion.apache.org/user-guide/sql/index.html).
+The FlightSQL endpoint executes one SQL statement per request and is intended for queries. Use the LanceDB SDKs for DDL and table-management operations such as creating tables, adding columns, or building indexes.
+
### Setting Up the Client
Establish a connection to your LanceDB Enterprise SQL endpoint using your preferred FlightSQL client:
@@ -152,4 +154,6 @@ for (const row of (await plan.collectToObjects()) as Array<{ plan_type: string;
```
-The operators that show up in the SQL plan are the same ones documented on the [Optimize Query Performance](/search/optimize-queries) page (`LanceScan`, `ScalarIndexQuery`, `KNNVectorDistance`, `ANNIvfPartition`, and so on), so the same reasoning about index coverage and filter pushdown applies — just read the plan from a SQL client instead of a query builder.
\ No newline at end of file
+The operators that show up in the SQL plan are the same ones documented on the [Optimize Query Performance](/search/optimize-queries) page (`LanceScan`, `ScalarIndexQuery`, `KNNVectorDistance`, `ANNIvfPartition`, and so on), so the same reasoning about index coverage and filter pushdown applies - just read the plan from a SQL client instead of a query builder.
+
+For full-text search from SQL, use the dedicated [`fts()` table function](/search/sql/fts-sql). It takes two string-literal arguments: the table name and the JSON-encoded FTS query, which can include operator-style terms, OR, phrase, and fuzzy query forms.
diff --git a/docs/search/vector-search.mdx b/docs/search/vector-search.mdx
index e3731b4..7f3046d 100644
--- a/docs/search/vector-search.mdx
+++ b/docs/search/vector-search.mdx
@@ -64,6 +64,8 @@ const results2 = await (
Here you can see the same search but using `cosine` similarity instead of `l2` distance. The result focuses on vector direction rather than absolute distance, which works better for normalized embeddings.
+Set `.limit(...)` on vector searches you run in applications. You can page through results with `.offset(...)` and include LanceDB's internal row id with `.with_row_id()` / `.withRowId()` when you need a stable handle for follow-up operations.
+
## Vector Search With ANN Index
Instead of performing an exhaustive search on the entire database for each and every query, approximate nearest neighbour (ANN) algorithms use an index to narrow down the search space, which significantly reduces query latency.
@@ -271,6 +273,7 @@ the filter condition after obtaining the nearest neighbors based on vector simil
Use multivector search when your documents contain multiple embeddings and you need sophisticated matching between query and document vector pairs. The late interaction approach finds the most relevant combinations across all available embeddings and provides nuanced similarity scoring.
Only `cosine` similarity is supported as the distance metric for multivector search operations.
+Every query vector must match the inner dimension of the multivector column; LanceDB rejects mismatched query dimensions rather than guessing how to reshape them.
```python Python icon="python"
diff --git a/docs/snippets/search.mdx b/docs/snippets/search.mdx
index 2c61812..591ef0b 100644
--- a/docs/snippets/search.mdx
+++ b/docs/snippets/search.mdx
@@ -8,10 +8,10 @@ export const PyBasicHybridSearch = "data = [\n {\"text\": \"rebel spaceships
export const PyBasicHybridSearchAsync = "uri = \"data/sample-lancedb\"\nasync_db = await lancedb.connect_async(uri)\ndata = [\n {\"text\": \"rebel spaceships striking from a hidden base\"},\n {\"text\": \"have won their first victory against the evil Galactic Empire\"},\n {\"text\": \"during the battle rebel spies managed to steal secret plans\"},\n {\"text\": \"to the Empire's ultimate weapon the Death Star\"},\n]\nasync_tbl = await async_db.create_table(\"documents_async\", schema=Documents)\n# ingest docs with auto-vectorization\nawait async_tbl.add(data)\n# Create a fts index before the hybrid search\nawait async_tbl.create_index(\"text\", config=FTS())\ntext_query = \"flower moon\"\n# hybrid search with default re-ranker\nawait (await async_tbl.search(\"flower moon\", query_type=\"hybrid\")).to_pandas()\n";
-export const PyClassDocuments = "class Documents(LanceModel):\n vector: Vector(embeddings.ndims()) = embeddings.VectorField()\n text: str = embeddings.SourceField()\n";
-
export const PyClassDefinition = "class Metadata(BaseModel):\n source: str\n timestamp: datetime\n\n\nclass Document(BaseModel):\n content: str\n meta: Metadata\n\n\nclass LanceSchema(LanceModel):\n id: str\n vector: Vector(1536)\n payload: Document\n";
+export const PyClassDocuments = "class Documents(LanceModel):\n vector: Vector(embeddings.ndims()) = embeddings.VectorField()\n text: str = embeddings.SourceField()\n";
+
export const PyCreateTableAsyncWithNestedSchema = "# Let's add 100 sample rows to our dataset\ndata = [\n LanceSchema(\n id=f\"id{i}\",\n vector=np.random.randn(1536),\n payload=Document(\n content=f\"document{i}\",\n meta=Metadata(source=f\"source{i % 10}\", timestamp=datetime.now()),\n ),\n )\n for i in range(100)\n]\n\nasync_tbl = await async_db.create_table(\n \"documents_async\", data=data, mode=\"overwrite\"\n)\n";
export const PyCreateTableWithNestedSchema = "# Let's add 100 sample rows to our dataset\ndata = [\n LanceSchema(\n id=f\"id{i}\",\n vector=np.random.randn(1536),\n payload=Document(\n content=f\"document{i}\",\n meta=Metadata(source=f\"source{i % 10}\", timestamp=datetime.now()),\n ),\n )\n for i in range(100)\n]\n\n# Synchronous client\ntbl = db.create_table(\"documents\", data=data, mode=\"overwrite\")\n";
diff --git a/docs/tables/consistency.mdx b/docs/tables/consistency.mdx
index fa248a4..1f84c4d 100644
--- a/docs/tables/consistency.mdx
+++ b/docs/tables/consistency.mdx
@@ -97,6 +97,11 @@ To manually check for updates, use `checkout_latest` / `checkoutLatest`:
+For reproducible reads, you can also pin a table to a specific snapshot with `checkout(...)` or
+a tag, restore a table to a prior version, then return to the live table with
+`checkout_latest` / `checkoutLatest`. See
+[Versioning](/tables/versioning/) for the full version and tag workflow.
+
## Handle bad vectors
diff --git a/docs/tables/create.mdx b/docs/tables/create.mdx
index 6f9f663..60682bc 100644
--- a/docs/tables/create.mdx
+++ b/docs/tables/create.mdx
@@ -70,6 +70,8 @@ Depending on the SDK, LanceDB can ingest arrays of records, Arrow tables or reco
You can provide a list of objects to create a table. The Python and TypeScript SDKs
support lists/arrays of dictionaries, while the Rust SDK supports lists of structs.
+In Python, pass a list or other batch-like object; a single bare `dict` or single
+`LanceModel` is rejected.
@@ -136,6 +138,11 @@ You can define a custom Arrow schema for the table. This is useful when you want
+An explicit schema is also where you control nullability. If later writes omit a
+non-nullable column, or provide actual nulls for it, ingestion fails; nullable columns can be
+omitted or written with null values. Without an explicit schema, Python infers list-like vector
+values as fixed-size `float32` vector fields from the observed dimension.
+
### From an Arrow Table
You can also create LanceDB tables directly from Arrow tables.
Rust uses an Arrow `RecordBatchReader` for the same Arrow-native ingest flow.
diff --git a/docs/tables/index.mdx b/docs/tables/index.mdx
index eb54017..e8ee192 100644
--- a/docs/tables/index.mdx
+++ b/docs/tables/index.mdx
@@ -262,7 +262,10 @@ initial testing).
-If you want to avoid overwriting an existing table, omit the overwrite mode.
+If you want to avoid overwriting an existing table, omit the overwrite mode. For append-only
+ingestion into a table that already exists, open the table and call `add(...)` instead of
+`create_table(...)`. For repeatable setup with `exist_ok` / `existOk`, see
+[Handle existing tables](/tables/create#handle-existing-tables).
### From Pandas DataFrames
diff --git a/docs/tables/schema.mdx b/docs/tables/schema.mdx
index 83f48ac..0426d1c 100644
--- a/docs/tables/schema.mdx
+++ b/docs/tables/schema.mdx
@@ -69,6 +69,10 @@ LanceDB supports three primary schema evolution operations:
Schema evolution operations are applied immediately but do not typically require rewriting all data. However, data type changes may involve more substantial operations.
+Each schema evolution operation commits a new table version and returns status metadata such as
+the committed `version`. Run these operations from a mutable table handle; if you checked out an
+older version for reads, call `checkout_latest` / `checkoutLatest` before modifying the schema.
+
## Add new columns
You can add new columns to a table with the [`add_columns`](https://lancedb.github.io/lancedb/python/python/#lancedb.table.Table.add_columns)
@@ -234,6 +238,9 @@ You can alter columns to contain NULL values:
+Changing a column to nullable affects future writes and merges too: missing values are accepted
+only when the target column is nullable.
+
### Multiple changes at once
Apply several alterations in a single operation:
@@ -366,4 +373,3 @@ Remove several columns at once for efficiency:
Dropping columns cannot be undone. Make sure you have backups or are certain before removing columns.
-
diff --git a/docs/tables/update.mdx b/docs/tables/update.mdx
index f18368e..34b0142 100644
--- a/docs/tables/update.mdx
+++ b/docs/tables/update.mdx
@@ -133,6 +133,14 @@ table creation patterns (Pandas, Polars, Pydantic, iterators, etc.) -- see the [
| `merge_insert` | `.when_matched_update_all()` + `.when_not_matched_insert_all()` | You want both behaviors together (often called **upsert**: update existing keys **and** insert missing keys in the same operation). |
| `merge_insert` | `.when_not_matched_by_source_delete(...)` | You want to remove target rows that are missing from the incoming source set. |
+Write operations return status metadata. For example, `update` reports updated row count and
+committed `version`, `delete` reports deleted row count and `version`, and `merge_insert` reports
+inserted, updated, deleted, retry-attempt, and `version` fields. Writes require a mutable table
+handle; after checking out an older version for reads, call `checkout_latest` / `checkoutLatest`
+before modifying data.
+The committed `version` advances even for writes that affect zero rows, such as a delete predicate
+that matches nothing.
+
## Update rows
Use `update` when you already know which target rows to modify and you do not need to compare against an incoming dataset.
@@ -204,6 +212,10 @@ In merge operations, rows are split into three groups:
- **Not matched**: key exists only in source.
- **Not matched by source**: key exists only in target.
+Conditional merge clauses can compare old and new values. Use the `target.` prefix for the
+existing table row and `source.` for the incoming row, for example
+`target.last_update < source.last_update`.
+
**Use scalar indexes to speed up merge insert**
diff --git a/docs/tables/versioning.mdx b/docs/tables/versioning.mdx
index 0933c01..61fd91e 100644
--- a/docs/tables/versioning.mdx
+++ b/docs/tables/versioning.mdx
@@ -270,6 +270,12 @@ On a fresh table, the snippets in this guide produce this version sequence:
Read-only and checkout operations shown here (`list_versions`/`listVersions`, `version`, `checkout`, `checkout_latest`/`checkoutLatest`) do not create new versions.
+The version metadata fields can differ by backend. Direct table-backed version listing exposes a
+timestamp, while namespace-backed listing may expose fields such as `manifest_path`,
+`manifest_size`, `e_tag`, and `timestamp_millis`. In deployments that use managed versioning,
+prefer the table version APIs exposed by LanceDB Enterprise or the namespace service instead of
+mixing in lower-level Lance file operations.
+
**System Operations**
diff --git a/skills/docs-writer/SKILL.md b/skills/docs-writer/SKILL.md
index 1a06fd7..c8c1909 100644
--- a/skills/docs-writer/SKILL.md
+++ b/skills/docs-writer/SKILL.md
@@ -18,6 +18,8 @@ Code examples on docs pages are **not** written directly into MDX. They live ins
LLM writing at times feels very formulaic, using very similar phrasing. The goal is to make the docs feel approachable and human, not like a dry manual that was written by a robot. Avoid repeating the same sentence structures, vary your word choice, and inject a bit of personality where appropriate. The content should be clear and accurate, but also engaging to read.
+When closing audit findings, keep the fix as small as the reader's need allows. Fold related gaps into existing paragraphs, notes, or lists; avoid adding a new section for every missing edge case. Favor usage-critical guidance, prerequisites, and common failure modes over exhaustive parameter inventories. The docs should help users move confidently, not force them through implementation minutiae.
+
Avoid the following extremely common patterns:
- "It's not this, it's that."
- "Paying the ___ tax" (e.g., "paying the import tax", "paying the setup tax") − the words "pay" and "tax" are heavily overused by AI
diff --git a/workflows/docs-audit/.env.example b/workflows/docs-audit/.env.example
new file mode 100644
index 0000000..6f6ee11
--- /dev/null
+++ b/workflows/docs-audit/.env.example
@@ -0,0 +1,10 @@
+OPENAI_API_KEY=your_openai_api_key_here
+
+DOCS_AUDIT_DB_URI=db://docs-audit
+DOCS_AUDIT_OPENAI_MODEL=gpt-5.5
+DOCS_AUDIT_OPENAI_REASONING_EFFORT=high
+DOCS_AUDIT_EMBEDDING_MODEL=text-embedding-3-large
+
+LANCEDB_API_KEY=...
+LANCEDB_HOST_OVERRIDE=https://...
+LANCEDB_REGION=us-east-1
diff --git a/workflows/docs-audit/README.md b/workflows/docs-audit/README.md
index 6bce640..19d3009 100644
--- a/workflows/docs-audit/README.md
+++ b/workflows/docs-audit/README.md
@@ -8,11 +8,12 @@ This workspace orchestrates a weekly documentation-gap audit across three local
The goal is to find what is missing from the docs, especially conceptual and imperative guidance that exists in code, tests, UI copy, request schemas, config comments, or integration scenarios but is not conveyed clearly in the public docs.
-This is a research workflow, not a production service. The design favors:
+This is a research workflow with a scheduled cloud runner. The design favors:
- compact deterministic preprocessing
-- page-scoped LLM work by the running agent
+- page-scoped LLM work through the OpenAI API
- saved local artifacts for inspection and reuse
+- durable storage of completed reports and parsed findings
- simple extension through manifests
## Non-goals
@@ -22,7 +23,7 @@ This workspace does not:
- clone or vendor source code from the watched repos
- attempt to enforce a hard token quota in the agent runtime
- produce doc fixes automatically
-- behave like a production CI system
+- automatically author or rewrite area manifests during scheduled runs
## Watched Repos
@@ -48,20 +49,24 @@ Each weekly run follows the same sequence:
- then include rotating extra pages for broader coverage
- if no pages changed, the rotating extra pages become the selected pages
- the rotation walks through the pages in manifest order and advances as rotating pages are added
-7. Use page-scoped LLM passes on the selected page bundles to extract:
+7. Use OpenAI API-driven page-scoped LLM passes on the selected page bundles to extract:
- code claims
- doc claims
- candidate gaps and final markdown observations
8. Save artifacts under a timestamped run directory.
9. Mark the run complete and update state.
-10. Surface the final markdown report through an inbox item.
+10. Parse the completed `report.md` into durable findings.
+11. Embed each public finding with OpenAI embeddings.
+12. Store the run and findings in LanceDB Enterprise under `db://docs-audit`.
+13. Surface a concise summary from the filtered public findings.
## Workspace Layout
- `config.toml`: repo paths, enabled areas, selection rules, and output paths
- `manifests/`: docs-area manifests
- `prompts/`: reusable agent prompt templates
-- `scripts/`: deterministic extraction, refresh, selection, and state utilities
+- `docs_audit/`: deterministic runner, OpenAI helpers, report parser, and Enterprise storage code
+- `scripts/run_weekly_audit.py`: user-facing weekly audit entrypoint for local and EC2 cron runs
- `state/`: lightweight run state and rotation cursor
- `artifacts/`: per-run evidence bundles, LLM outputs, and reports
- `README.md`: maintainer-oriented workflow and extension guide
@@ -85,7 +90,7 @@ The deterministic layer intentionally keeps evidence compact so the LLM does not
## LLM-Assisted Layer
-The semantic layer runs through the automation prompt. For each selected page bundle, the LLM should:
+The semantic layer runs through the OpenAI API. For each selected page bundle, the LLM should:
1. infer normalized code claims from the evidence bundle
2. infer normalized doc claims from the docs bundle
@@ -99,12 +104,16 @@ The saved artifacts should include:
- candidate gaps
- final markdown report
+The scheduled cloud workflow should use existing manifest files only. Manifest authoring and manifest
+maintenance are manual maintainer activities; they may use `skills/area-manifest-authoring/SKILL.md`,
+but the weekly cloud run should not edit manifests as part of normal execution.
+
## Running a Manual Audit
From this workspace root:
```bash
-uv run python scripts/run_audit.py select-areas --refresh --advance
+uv run python -m docs_audit.deterministic_runner select-areas --refresh --advance
```
This chooses a bounded list of enabled area manifests for the weekly run. The selector uses
@@ -113,7 +122,7 @@ weekly slots are filled by rotating through `enabled_areas`. Use the printed `se
for the per-area `prepare` commands.
```bash
-uv run python scripts/run_audit.py prepare --area indexing
+uv run python -m docs_audit.deterministic_runner prepare --area indexing
```
`--area` is the manifest name, not a hardcoded value in the script. The runner loads:
@@ -123,7 +132,7 @@ uv run python scripts/run_audit.py prepare --area indexing
So `--area indexing` maps to `manifests/indexing.toml`. If you add `manifests/search.toml`, you would run:
```bash
-uv run python scripts/run_audit.py prepare --area search
+uv run python -m docs_audit.deterministic_runner prepare --area search
```
This creates a pending run directory under `artifacts/pending//` and prints a JSON summary to stdout.
@@ -156,7 +165,7 @@ selected area through `[selection].rotation_extra_pages`.
After the LLM phase writes the expected outputs into that pending run directory, complete the run with:
```bash
-uv run python scripts/run_audit.py complete --run-id
+uv run python -m docs_audit.deterministic_runner complete --run-id
```
Completion publishes the directory to `artifacts/runs//`. Directories under `artifacts/runs/`
@@ -167,19 +176,19 @@ Run multiple area `prepare` commands sequentially. The intended workflow is one
To clean up old generated run artifacts, use:
```bash
-uv run python scripts/run_audit.py cleanup --days 30
+uv run python -m docs_audit.deterministic_runner cleanup --days 30
```
The retention window is configurable with `--days`, and you can preview deletions without removing anything:
```bash
-uv run python scripts/run_audit.py cleanup --days 14 --dry-run
+uv run python -m docs_audit.deterministic_runner cleanup --days 14 --dry-run
```
For manual testing of the fallback path, you can simulate an unrefreshable repo:
```bash
-uv run python scripts/run_audit.py prepare \
+uv run python -m docs_audit.deterministic_runner prepare \
--area indexing \
--refresh \
--simulate-refresh-failure docs
@@ -324,7 +333,7 @@ A practical workflow:
After adding a new manifest, run:
```bash
-uv run python scripts/run_audit.py prepare --area
+uv run python -m docs_audit.deterministic_runner prepare --area
```
Then inspect:
@@ -345,25 +354,206 @@ The runner is designed so new docs areas should generally require a new manifest
## Weekly Automation
-The weekly automation should use this workspace as its cwd and follow `prompts/weekly_automation.md`.
+The weekly automation should use this workspace as its cwd and follow the deterministic selection and
+prepare/complete flow described above. In the cloud, the semantic pass is performed with OpenAI API
+credentials rather than a Codex Desktop agent.
The automation should:
-- review each enabled area manifest before running the audit
-- use `skills/area-manifest-authoring/SKILL.md` to detect docs-page drift and newly relevant evidence files in the watched repos
-- update a manifest when the area boundary or source mapping has materially changed
-- run the deterministic prepare step
+- load the enabled area manifests as read-only workflow inputs
+- run `select-areas --refresh --advance`
+- run `prepare` sequentially for each selected area
- inspect the generated selected page bundles
-- perform the page-scoped LLM passes
+- perform the page-scoped LLM passes through the OpenAI API
- write outputs under the run directory
- keep `report.md` limited to the missing-doc summary itself, not routine workflow or refresh-status narration
- call the completion step
-- return a concise markdown summary for the inbox item
+- parse the completed `report.md` into finding records
+- filter out findings that should not be exposed to end users, including helm chart and enterprise deployment observations
+- write the completed run and public findings to LanceDB Enterprise
+- return a concise markdown summary from the stored public findings
+
+The cloud runner needs these secrets or environment variables:
+
+- `OPENAI_API_KEY`: used for page-level semantic passes and finding embeddings
+- `DOCS_AUDIT_OPENAI_MODEL`: chat/reasoning model for page-level audits; defaults to `gpt-5.5`
+- `DOCS_AUDIT_OPENAI_REASONING_EFFORT`: reasoning effort for page-level audits; defaults to `high`
+- `DOCS_AUDIT_EMBEDDING_MODEL`: embedding model for finding search vectors
+- `LANCEDB_API_KEY`: LanceDB Enterprise API key
+- `LANCEDB_HOST_OVERRIDE`: LanceDB Enterprise host URL
+- `LANCEDB_REGION`: LanceDB Enterprise region, usually `us-east-1`
+- `DOCS_AUDIT_DB_URI`: optional override for the Enterprise database URI; defaults to `db://docs-audit`
+
+Use GPT-5.5 with high reasoning for the page-audit semantic pass. The audit is intentionally
+judgment-heavy: the model has to compare compact evidence bundles against docs claims, avoid
+implementation summaries, and emit only missing public documentation observations. Embeddings are a
+separate step used only after `report.md` is complete and parsed.
+
+The LanceDB Enterprise connection should follow the same remote-only pattern used by neighboring
+internal tooling:
+
+```python
+lancedb.connect(
+ uri="db://docs-audit",
+ api_key=LANCEDB_API_KEY,
+ host_override=LANCEDB_HOST_OVERRIDE,
+ region=LANCEDB_REGION,
+)
+```
+
+## EC2 Cron Deployment
+
+An EC2 cron job is a suitable deployment target for this workflow. The instance should keep the
+watched repositories checked out side by side so the relative paths in `config.toml` continue to
+resolve:
+
+```text
+/opt/lancedb-docs-audit/
+ lancedb/
+ docs/
+ workflows/docs-audit/
+ sophon/
+```
+
+From the `docs` checkout, the docs-audit workspace still expects:
+
+- `../../../lancedb`
+- `../..`
+- `../../../sophon`
+
+If the EC2 checkout layout differs, update `workflows/docs-audit/config.toml` instead of adding
+path translation logic to the runner.
+
+Create a local environment file at `workflows/docs-audit/.env` on the instance. `.env` files are
+ignored by this repo and must not be committed:
+
+```bash
+OPENAI_API_KEY=...
+DOCS_AUDIT_OPENAI_MODEL=gpt-5.5
+DOCS_AUDIT_OPENAI_REASONING_EFFORT=high
+DOCS_AUDIT_EMBEDDING_MODEL=text-embedding-3-large
+
+LANCEDB_API_KEY=...
+LANCEDB_HOST_OVERRIDE=https://...
+LANCEDB_REGION=us-east-1
+DOCS_AUDIT_DB_URI=db://docs-audit
+```
+
+Cron should call a single cloud-runner entrypoint from the docs-audit workspace. Use a lock so a slow
+run cannot overlap the next scheduled run, and write logs outside the repo:
+
+```cron
+17 13 * * 1 cd /opt/lancedb-docs-audit/docs/workflows/docs-audit && flock -n /tmp/docs-audit.lock uv run python scripts/run_weekly_audit.py >> /var/log/docs-audit/weekly.log 2>&1
+```
+
+The cloud runner should:
+
+- load `workflows/docs-audit/.env` before reading configuration
+- use GPT-5.5 with high reasoning for page-level audit calls
+- use OpenAI embeddings only after `report.md` has been generated and parsed
+- write completed runs and findings to `db://docs-audit`
+- exit non-zero when refresh, OpenAI, parsing, or Enterprise writes fail
+
+The EC2 instance needs outbound network access to Git remotes, the OpenAI API, and the LanceDB
+Enterprise host. Prefer instance IAM or deploy keys for repository access, and keep OpenAI and
+LanceDB credentials in the local `.env` or the instance's secret-management layer.
+
+## Testing the Cloud Runner Locally
+
+Copy `workflows/docs-audit/.env.example` to `workflows/docs-audit/.env` and fill in the secrets.
+
+The runner has four practical modes:
+
+| Command | Selects areas | Calls GPT-5.5 | Calls embeddings | Writes LanceDB | Use for |
+| --- | --- | --- | --- | --- | --- |
+| `run_weekly_audit.py --ingest-run-dir artifacts/runs/ --skip-write` | No | No | No | No | Cheapest parser/report smoke test |
+| `run_weekly_audit.py --no-refresh --no-advance --skip-write` | Yes | Yes | No | No | Local report-generation test |
+| `run_weekly_audit.py --ingest-run-dir artifacts/runs/` | No | No | Yes | Yes | Backfill/test Enterprise writes for one completed run |
+| `run_weekly_audit.py` | Yes | Yes | Yes | Yes | Real weekly EC2 cron run |
+
+Argument meanings:
+
+- `--ingest-run-dir`: bypasses weekly selection and report generation; parses an existing completed run.
+- `--skip-write`: skips finding embeddings and LanceDB Enterprise writes. It does not skip GPT-5.5 if the command is generating a new report.
+- `--no-refresh`: skips `git pull --ff-only` during weekly area selection.
+- `--no-advance`: prevents the area rotation cursor from moving, which makes local tests repeatable.
+
+To run the cloud workflow locally without refreshing repos or writing to LanceDB Enterprise:
+
+```bash
+cd workflows/docs-audit
+uv run python scripts/run_weekly_audit.py --no-refresh --no-advance --skip-write
+```
+
+This still calls the OpenAI page-audit model and writes a completed local run artifact, but it skips
+finding embeddings and Enterprise writes.
+
+To backfill or test parsing for an existing completed run without OpenAI or Enterprise calls:
+
+```bash
+cd workflows/docs-audit
+uv run python scripts/run_weekly_audit.py \
+ --ingest-run-dir artifacts/runs/ \
+ --skip-write
+```
+
+To test the Enterprise write path for an existing completed run, omit `--skip-write`. That path
+uses OpenAI embeddings and writes to `docs_audit_runs` and `docs_audit_findings`.
+
+## Enterprise Storage
+
+The durable audit output is the completed `report.md`. Intermediate files under `llm_outputs/` and
+`page_bundles/` are useful for inspection, but they are not the primary historical record.
+
+Store completed report data in two LanceDB Enterprise tables under `db://docs-audit`:
+
+### `docs_audit_runs`
+
+One row per completed run.
+
+| Column | Purpose |
+| --- | --- |
+| `run_id` | Primary run identifier |
+| `completed_at` | Run completion timestamp |
+| `areas` | Selected areas for the run |
+| `report_text` | Raw completed `report.md` text, stored once per run |
+| `report_path` | Original artifact path or cloud artifact URL |
+| `repo_shas` | Docs, LanceDB, and Sophon commit SHAs |
+| `selected_pages` | Pages audited in the run |
+| `changed_pages` | Pages whose evidence fingerprints changed |
+| `refresh` | Watched repo refresh metadata |
+| `metadata` | Extra run metadata |
+
+### `docs_audit_findings`
+
+One row per parsed missing-doc observation from the completed report.
+
+| Column | Purpose |
+| --- | --- |
+| `id` | Stable finding id, such as `{run_id}:{finding_index}` |
+| `run_id` | Link back to `docs_audit_runs` |
+| `completed_at` | Copied from the run for time filtering |
+| `area` | Docs audit area |
+| `page_id` | Manifest page id when it can be resolved |
+| `page_title` | Human-readable page or report heading |
+| `page_path` | Docs source path when it can be resolved |
+| `report_heading` | Heading from `report.md` |
+| `finding_index` | Finding order in the report |
+| `finding_text` | Parsed missing-doc observation |
+| `finding_hash` | Hash of normalized area, page, and finding text |
+| `visibility_class` | `public-doc-gap` or `excluded` |
+| `embedding_text` | Compact text sent to the OpenAI embeddings API |
+| `embedding` | Vector used for later semantic search |
+| `metadata` | Small finding-level extras |
+
+For reruns or backfills of the same `run_id`, delete existing rows for that `run_id` from both tables
+and replace them. Do not rewrite historical rows during routine weekly runs.
## Maintainer Notes
- Keep reports focused on what is missing in the docs, not on implementation summaries or fix proposals.
- Do not spend report tokens on routine success status such as clean repo refreshes.
- Prefer evidence from doc comments, tested snippets, request schemas, UI copy, config comments, and integration tests over deep implementation internals.
-- If a new feature lands and the docs area should notice it, update the manifest first.
+- If a new feature lands and the docs area should notice it, update the manifest manually first.
+- The scheduled cloud workflow should not use `skills/area-manifest-authoring/SKILL.md`; maintainers can use that skill manually when adding or refreshing manifests.
- If the semantic pass grows too expensive, reduce weekly selection breadth before shrinking evidence quality.
diff --git a/workflows/docs-audit/config.toml b/workflows/docs-audit/config.toml
index 8df3eb3..3ab5b58 100644
--- a/workflows/docs-audit/config.toml
+++ b/workflows/docs-audit/config.toml
@@ -1,7 +1,7 @@
version = 1
enabled_areas = [
- "indexing",
- "search",
+ # "indexing",
+ # "search",
"table-operations",
"reranking",
"embeddings",
diff --git a/workflows/docs-audit/docs_audit/__init__.py b/workflows/docs-audit/docs_audit/__init__.py
new file mode 100644
index 0000000..0b384c4
--- /dev/null
+++ b/workflows/docs-audit/docs_audit/__init__.py
@@ -0,0 +1,2 @@
+"""Cloud runner helpers for the docs-audit workflow."""
+
diff --git a/workflows/docs-audit/docs_audit/config.py b/workflows/docs-audit/docs_audit/config.py
new file mode 100644
index 0000000..85cfb32
--- /dev/null
+++ b/workflows/docs-audit/docs_audit/config.py
@@ -0,0 +1,67 @@
+from __future__ import annotations
+
+import os
+from dataclasses import dataclass
+from pathlib import Path
+
+
+ROOT = Path(__file__).resolve().parent.parent
+
+
+def _unquote_env_value(value: str) -> str:
+ stripped = value.strip()
+ if len(stripped) >= 2 and stripped[0] == stripped[-1] and stripped[0] in {"'", '"'}:
+ return stripped[1:-1]
+ return stripped
+
+
+def load_env_file(path: Path | None = None) -> None:
+ env_path = path or ROOT / ".env"
+ if not env_path.exists():
+ return
+ for raw_line in env_path.read_text(encoding="utf-8").splitlines():
+ line = raw_line.strip()
+ if not line or line.startswith("#"):
+ continue
+ if line.startswith("export "):
+ line = line[len("export ") :].strip()
+ if "=" not in line:
+ continue
+ key, raw_value = line.split("=", 1)
+ key = key.strip()
+ if not key or key in os.environ:
+ continue
+ os.environ[key] = _unquote_env_value(raw_value)
+
+
+@dataclass(frozen=True)
+class Settings:
+ openai_api_key: str
+ audit_model: str
+ audit_reasoning_effort: str
+ embedding_model: str
+ openai_timeout_seconds: int
+ lancedb_api_key: str
+ lancedb_host_override: str
+ lancedb_region: str
+ docs_audit_db_uri: str
+
+
+def settings_from_env() -> Settings:
+ return Settings(
+ openai_api_key=(os.getenv("OPENAI_API_KEY") or "").strip(),
+ audit_model=(os.getenv("DOCS_AUDIT_OPENAI_MODEL") or "gpt-5.5").strip(),
+ audit_reasoning_effort=(
+ os.getenv("DOCS_AUDIT_OPENAI_REASONING_EFFORT") or "high"
+ ).strip(),
+ embedding_model=(
+ os.getenv("DOCS_AUDIT_EMBEDDING_MODEL") or "text-embedding-3-large"
+ ).strip(),
+ openai_timeout_seconds=int(os.getenv("DOCS_AUDIT_OPENAI_TIMEOUT_SECONDS", "900")),
+ lancedb_api_key=(os.getenv("LANCEDB_API_KEY") or "").strip(),
+ lancedb_host_override=(os.getenv("LANCEDB_HOST_OVERRIDE") or "").strip(),
+ lancedb_region=(os.getenv("LANCEDB_REGION") or "us-east-1").strip()
+ or "us-east-1",
+ docs_audit_db_uri=(os.getenv("DOCS_AUDIT_DB_URI") or "db://docs-audit").strip(),
+ )
+
diff --git a/workflows/docs-audit/scripts/run_audit.py b/workflows/docs-audit/docs_audit/deterministic_runner.py
similarity index 100%
rename from workflows/docs-audit/scripts/run_audit.py
rename to workflows/docs-audit/docs_audit/deterministic_runner.py
diff --git a/workflows/docs-audit/docs_audit/enterprise_store.py b/workflows/docs-audit/docs_audit/enterprise_store.py
new file mode 100644
index 0000000..14300b1
--- /dev/null
+++ b/workflows/docs-audit/docs_audit/enterprise_store.py
@@ -0,0 +1,205 @@
+from __future__ import annotations
+
+import json
+import time
+from datetime import datetime, timezone
+from typing import Any
+from urllib.parse import urlparse
+
+import lancedb
+import pyarrow as pa
+
+
+RUNS_TABLE = "docs_audit_runs"
+FINDINGS_TABLE = "docs_audit_findings"
+TABLE_READY_TIMEOUT_SECONDS = 30.0
+TABLE_READY_SLEEP_SECONDS = 0.5
+TABLE_READY_MAX_ATTEMPTS = int(TABLE_READY_TIMEOUT_SECONDS / TABLE_READY_SLEEP_SECONDS)
+
+
+RUNS_SCHEMA = pa.schema(
+ [
+ pa.field("run_id", pa.string()),
+ pa.field("completed_at", pa.timestamp("us", tz="UTC")),
+ pa.field("areas", pa.list_(pa.string())),
+ pa.field("report_text", pa.string()),
+ pa.field("report_path", pa.string()),
+ pa.field("repo_shas", pa.string()),
+ pa.field("selected_pages", pa.list_(pa.string())),
+ pa.field("changed_pages", pa.list_(pa.string())),
+ pa.field("refresh", pa.string()),
+ pa.field("metadata", pa.string()),
+ ]
+)
+
+
+def findings_schema(embedding_dimension: int) -> pa.Schema:
+ return pa.schema(
+ [
+ pa.field("id", pa.string()),
+ pa.field("run_id", pa.string()),
+ pa.field("completed_at", pa.timestamp("us", tz="UTC")),
+ pa.field("area", pa.string()),
+ pa.field("page_id", pa.string()),
+ pa.field("page_title", pa.string()),
+ pa.field("page_path", pa.string()),
+ pa.field("report_heading", pa.string()),
+ pa.field("finding_index", pa.int64()),
+ pa.field("finding_text", pa.string()),
+ pa.field("finding_hash", pa.string()),
+ pa.field("visibility_class", pa.string()),
+ pa.field("embedding_text", pa.string()),
+ pa.field("embedding", pa.list_(pa.float32(), list_size=embedding_dimension)),
+ pa.field("metadata", pa.string()),
+ ]
+ )
+
+
+class DocsAuditEnterpriseStore:
+ def __init__(
+ self,
+ *,
+ uri: str,
+ api_key: str,
+ host_override: str,
+ region: str,
+ ) -> None:
+ if not api_key:
+ raise RuntimeError("Missing LANCEDB_API_KEY")
+ self._validate_host_override(host_override)
+ self.db = lancedb.connect(
+ uri=uri,
+ api_key=api_key,
+ host_override=host_override,
+ region=region,
+ )
+
+ @staticmethod
+ def _validate_host_override(host_override: str) -> None:
+ if not host_override:
+ raise RuntimeError("Missing LANCEDB_HOST_OVERRIDE")
+ parsed = urlparse(host_override)
+ if parsed.scheme not in {"http", "https"} or not parsed.netloc:
+ raise RuntimeError("Invalid LANCEDB_HOST_OVERRIDE")
+
+ def ensure_tables(self, *, embedding_dimension: int) -> None:
+ self._create_table_ready(RUNS_TABLE, RUNS_SCHEMA)
+ self._create_table_ready(FINDINGS_TABLE, findings_schema(embedding_dimension))
+
+ def replace_run(
+ self,
+ *,
+ run_row: dict[str, Any],
+ finding_rows: list[dict[str, Any]],
+ ) -> dict[str, int]:
+ self._create_table_ready(RUNS_TABLE, RUNS_SCHEMA)
+ runs = self._open_table(RUNS_TABLE)
+ findings = None
+ if finding_rows:
+ embedding_dimension = len(finding_rows[0]["embedding"])
+ self.ensure_tables(embedding_dimension=embedding_dimension)
+ findings = self._open_table(FINDINGS_TABLE)
+ run_id = str(run_row["run_id"])
+ run_id_sql = run_id.replace("'", "''")
+
+ try:
+ runs.delete(f"run_id = '{run_id_sql}'")
+ except Exception as exc:
+ if not self._is_table_not_found_error(exc):
+ raise
+ if findings is not None:
+ try:
+ findings.delete(f"run_id = '{run_id_sql}'")
+ except Exception as exc:
+ if not self._is_table_not_found_error(exc):
+ raise
+
+ runs.add([run_row], mode="append")
+ if finding_rows and findings is not None:
+ findings.add(finding_rows, mode="append")
+ return {"runs": 1, "findings": len(finding_rows)}
+
+ def _create_table_ready(self, table_name: str, schema: pa.Schema) -> None:
+ last_error: Exception | None = None
+ for _attempt in range(TABLE_READY_MAX_ATTEMPTS):
+ try:
+ self.db.create_table(table_name, schema=schema, mode="exist_ok")
+ return
+ except Exception as exc:
+ last_error = exc
+ if self._is_terminal_table_error(exc):
+ raise RuntimeError(
+ f"Terminal error while ensuring table '{table_name}': {exc}"
+ ) from exc
+ time.sleep(TABLE_READY_SLEEP_SECONDS)
+ raise RuntimeError(
+ f"Timed out ensuring table '{table_name}' after "
+ f"{TABLE_READY_TIMEOUT_SECONDS:.0f}s. Last error: {last_error}"
+ ) from last_error
+
+ def _open_table(self, table_name: str):
+ last_error: Exception | None = None
+ for _attempt in range(TABLE_READY_MAX_ATTEMPTS):
+ try:
+ return self.db.open_table(table_name)
+ except Exception as exc:
+ last_error = exc
+ if self._is_terminal_table_error(exc):
+ raise RuntimeError(
+ f"Terminal error while opening table '{table_name}': {exc}"
+ ) from exc
+ time.sleep(TABLE_READY_SLEEP_SECONDS)
+ raise RuntimeError(
+ f"Timed out opening table '{table_name}' after "
+ f"{TABLE_READY_TIMEOUT_SECONDS:.0f}s. Last error: {last_error}"
+ ) from last_error
+
+ @staticmethod
+ def _is_terminal_table_error(exc: Exception) -> bool:
+ msg = str(exc).lower()
+ terminal_tokens = (
+ "401",
+ "403",
+ "unauthorized",
+ "forbidden",
+ "permission denied",
+ "invalid api key",
+ "invalid url",
+ "relativeurlwithoutbase",
+ "schema",
+ "type mismatch",
+ "invalid type",
+ )
+ transient_tokens = (
+ "404",
+ "503",
+ "table not found",
+ "was not found",
+ "_versions",
+ "service unavailable",
+ "temporarily unavailable",
+ "retry limit",
+ "timed out",
+ )
+ if any(token in msg for token in transient_tokens):
+ return False
+ return any(token in msg for token in terminal_tokens)
+
+ @staticmethod
+ def _is_table_not_found_error(exc: Exception) -> bool:
+ msg = str(exc).lower()
+ return "not found" in msg or "_versions" in msg
+
+
+def parse_timestamp(value: str | None) -> datetime:
+ if not value:
+ return datetime.now(timezone.utc)
+ normalized = value.replace("Z", "+00:00")
+ parsed = datetime.fromisoformat(normalized)
+ if parsed.tzinfo is None:
+ return parsed.replace(tzinfo=timezone.utc)
+ return parsed.astimezone(timezone.utc)
+
+
+def json_string(value: Any) -> str:
+ return json.dumps(value, sort_keys=True, separators=(",", ":"))
diff --git a/workflows/docs-audit/docs_audit/openai_client.py b/workflows/docs-audit/docs_audit/openai_client.py
new file mode 100644
index 0000000..7591c57
--- /dev/null
+++ b/workflows/docs-audit/docs_audit/openai_client.py
@@ -0,0 +1,82 @@
+from __future__ import annotations
+
+import json
+import urllib.error
+import urllib.request
+from typing import Any
+
+
+OPENAI_API_BASE = "https://api.openai.com/v1"
+
+
+class OpenAIClient:
+ def __init__(self, *, api_key: str, timeout_seconds: int = 900) -> None:
+ if not api_key:
+ raise RuntimeError("Missing OPENAI_API_KEY")
+ self.api_key = api_key
+ self.timeout_seconds = timeout_seconds
+
+ def _post_json(self, path: str, payload: dict[str, Any]) -> dict[str, Any]:
+ body = json.dumps(payload).encode("utf-8")
+ request = urllib.request.Request(
+ f"{OPENAI_API_BASE}{path}",
+ data=body,
+ headers={
+ "Authorization": f"Bearer {self.api_key}",
+ "Content-Type": "application/json",
+ },
+ method="POST",
+ )
+ try:
+ with urllib.request.urlopen(request, timeout=self.timeout_seconds) as response:
+ return json.loads(response.read().decode("utf-8"))
+ except urllib.error.HTTPError as exc:
+ detail = exc.read().decode("utf-8", errors="replace")
+ raise RuntimeError(f"OpenAI API request failed: {exc.code} {detail}") from exc
+
+ def response_text(
+ self,
+ *,
+ model: str,
+ reasoning_effort: str,
+ input_text: str,
+ ) -> str:
+ payload = {
+ "model": model,
+ "input": input_text,
+ "reasoning": {"effort": reasoning_effort},
+ "store": False,
+ }
+ data = self._post_json("/responses", payload)
+ text = extract_response_text(data)
+ if not text.strip():
+ raise RuntimeError("OpenAI response did not include output text")
+ return text
+
+ def embeddings(self, *, model: str, inputs: list[str]) -> list[list[float]]:
+ if not inputs:
+ return []
+ data = self._post_json("/embeddings", {"model": model, "input": inputs})
+ rows = sorted(data.get("data", []), key=lambda item: item.get("index", 0))
+ embeddings = [row.get("embedding") for row in rows]
+ if len(embeddings) != len(inputs) or any(not item for item in embeddings):
+ raise RuntimeError("OpenAI embeddings response did not match input count")
+ return embeddings
+
+
+def extract_response_text(data: dict[str, Any]) -> str:
+ top_level = data.get("output_text")
+ if isinstance(top_level, str) and top_level.strip():
+ return top_level
+
+ chunks: list[str] = []
+ for item in data.get("output", []):
+ if item.get("type") != "message":
+ continue
+ for content in item.get("content", []):
+ if content.get("type") in {"output_text", "text"}:
+ text = content.get("text")
+ if isinstance(text, str):
+ chunks.append(text)
+ return "\n".join(chunks)
+
diff --git a/workflows/docs-audit/docs_audit/report_parser.py b/workflows/docs-audit/docs_audit/report_parser.py
new file mode 100644
index 0000000..1754b7a
--- /dev/null
+++ b/workflows/docs-audit/docs_audit/report_parser.py
@@ -0,0 +1,106 @@
+from __future__ import annotations
+
+import hashlib
+import re
+from dataclasses import dataclass
+
+from .visibility import visibility_class
+
+
+HEADING_RE = re.compile(r"^(#{1,6})\s+(.+?)\s*$")
+BULLET_RE = re.compile(r"^\s*[-*]\s+(.+?)\s*$")
+
+
+@dataclass(frozen=True)
+class Finding:
+ id: str
+ run_id: str
+ area: str
+ report_heading: str
+ finding_index: int
+ finding_text: str
+ finding_hash: str
+ visibility_class: str
+ page_id: str | None = None
+ page_title: str | None = None
+ page_path: str | None = None
+
+
+def normalize_finding_text(text: str) -> str:
+ return " ".join(text.strip().split())
+
+
+def finding_hash(area: str, page_path: str | None, heading: str, text: str) -> str:
+ normalized = "\n".join(
+ [
+ area.strip().casefold(),
+ (page_path or "").strip().casefold(),
+ heading.strip().casefold(),
+ normalize_finding_text(text).casefold(),
+ ]
+ )
+ return "sha256:" + hashlib.sha256(normalized.encode("utf-8")).hexdigest()
+
+
+def parse_report_findings(
+ report_text: str,
+ *,
+ run_id: str,
+ area: str,
+ page_lookup: dict[str, dict[str, str]] | None = None,
+) -> list[Finding]:
+ page_lookup = page_lookup or {}
+ current_heading = ""
+ findings: list[Finding] = []
+ pending_index = 0
+
+ for line in report_text.splitlines():
+ heading_match = HEADING_RE.match(line)
+ if heading_match:
+ level = len(heading_match.group(1))
+ title = heading_match.group(2).strip()
+ if level >= 2:
+ current_heading = title
+ continue
+
+ bullet_match = BULLET_RE.match(line)
+ if not bullet_match:
+ continue
+
+ text = normalize_finding_text(bullet_match.group(1))
+ if not text:
+ continue
+
+ pending_index += 1
+ page_info = page_lookup.get(current_heading.casefold(), {})
+ page_path = page_info.get("page_path")
+ digest = finding_hash(area, page_path, current_heading, text)
+ findings.append(
+ Finding(
+ id=f"{run_id}:{pending_index:03d}",
+ run_id=run_id,
+ area=area,
+ report_heading=current_heading,
+ finding_index=pending_index,
+ finding_text=text,
+ finding_hash=digest,
+ visibility_class=visibility_class(text),
+ page_id=page_info.get("page_id"),
+ page_title=page_info.get("page_title") or current_heading or None,
+ page_path=page_path,
+ )
+ )
+
+ return findings
+
+
+def embedding_text(finding: Finding) -> str:
+ parts = [
+ f"Area: {finding.area}",
+ f"Page: {finding.page_title or finding.report_heading or 'Unknown'}",
+ ]
+ if finding.page_path:
+ parts.append(f"Docs path: {finding.page_path}")
+ parts.append(f"Finding: {finding.finding_text}")
+ return "\n".join(parts)
+
diff --git a/workflows/docs-audit/docs_audit/visibility.py b/workflows/docs-audit/docs_audit/visibility.py
new file mode 100644
index 0000000..35ec476
--- /dev/null
+++ b/workflows/docs-audit/docs_audit/visibility.py
@@ -0,0 +1,22 @@
+from __future__ import annotations
+
+
+EXCLUDED_TOPIC_TERMS = (
+ "helm",
+ "helm chart",
+ "kubernetes chart",
+ "enterprise deployment",
+ "enterprise deploy",
+)
+
+
+def visibility_class(text: str) -> str:
+ normalized = text.casefold()
+ if any(term in normalized for term in EXCLUDED_TOPIC_TERMS):
+ return "excluded"
+ return "public-doc-gap"
+
+
+def is_public_finding(text: str) -> bool:
+ return visibility_class(text) == "public-doc-gap"
+
diff --git a/workflows/docs-audit/prompts/weekly_automation.md b/workflows/docs-audit/prompts/weekly_automation.md
index 1d7d669..1ec6749 100644
--- a/workflows/docs-audit/prompts/weekly_automation.md
+++ b/workflows/docs-audit/prompts/weekly_automation.md
@@ -6,7 +6,8 @@ You are running the weekly docs-gap audit from this workspace root.
Produce a concise markdown report that lists only what is missing from the docs for the selected pages. Focus on conceptual and imperative gaps, not implementation summaries or fix proposals.
-This workflow also includes manifest maintenance. Before each audit run, review the enabled area manifests to see whether the docs pages or evidence sources have drifted and update them when needed.
+The scheduled workflow uses existing enabled area manifests as read-only inputs. Manifest authoring
+and manifest maintenance are manual maintainer activities.
## Files to read first
@@ -14,7 +15,6 @@ This workflow also includes manifest maintenance. Before each audit run, review
- `AGENTS.md`
- `config.toml`
- `prompts/page_audit_guidelines.md`
-- `skills/area-manifest-authoring/SKILL.md`
Then select the area manifests for this run using the deterministic area selector.
@@ -22,58 +22,40 @@ Then select the area manifests for this run using the deterministic area selecto
1. Read `config.toml` and determine the enabled areas from `enabled_areas`.
2. Select the areas for this weekly run:
- - `uv run python scripts/run_audit.py select-areas --refresh --advance`
+ - `uv run python -m docs_audit.deterministic_runner select-areas --refresh --advance`
- Use the printed `selected_areas` list for the rest of this workflow.
- The selector refreshes watched repos once, detects changed enabled areas, and fills the remaining weekly slots by area rotation.
- Do not run unselected enabled areas in this weekly pass.
-3. For each selected area, run a manifest maintenance pass before `prepare`.
- - Use `skills/area-manifest-authoring/SKILL.md` as the procedure.
- - Read the current `manifests/.toml`.
- - Check whether the docs area boundary has changed:
- - new or renamed docs pages in the same docs section
- - stale page paths
- - page IDs that no longer match the current docs layout
- - Check whether the evidence mapping has drifted:
- - new snippets, tests, request schemas, config files, UI surfaces, or public API files related to the area
- - stale source paths that should be removed
- - source blocks whose `applies_to` mapping is now too broad or too narrow
- - Keep the manifest compact. Do not add files just because they mention the topic; add them only if they are likely to expose user-visible behavior the docs may be missing.
- - If the manifest changes, save the updated `manifests/.toml` before preparing the run.
-4. Run the deterministic prepare step for each selected area.
+3. Run the deterministic prepare step for each selected area.
- Run prepare commands sequentially, one area at a time. Do not parallelize `prepare`.
- Repos were already refreshed by `select-areas`, so skip `--refresh` here:
- - `uv run python scripts/run_audit.py prepare --area `
-5. Read the JSON summary printed by each `prepare` command and locate each pending run directory.
+ - `uv run python -m docs_audit.deterministic_runner prepare --area `
+4. Read the JSON summary printed by each `prepare` command and locate each pending run directory.
- Use the printed `run_dir`; it should point under `artifacts/pending/`.
- Do not create or write directly under `artifacts/runs/` before completion.
-6. For each pending run directory, read `selected_pages.json` and the corresponding files in `page_bundles/`.
-7. For each selected page bundle:
+5. For each pending run directory, read `selected_pages.json` and the corresponding files in `page_bundles/`.
+6. For each selected page bundle:
- apply `prompts/page_audit_guidelines.md` as the page-level review rubric
- infer normalized code claims from the evidence bundle
- infer normalized doc claims from the docs bundle
- identify only the missing documentation
-8. Write semantic outputs under `llm_outputs/` in each pending run directory.
+7. Write semantic outputs under `llm_outputs/` in each pending run directory.
- one file per page for code claims
- one file per page for doc claims
- one file per page for candidate gaps
-9. Write `report.md` in each pending run directory.
+8. Write `report.md` in each pending run directory.
- `report.md` is the docs-gap summary only.
- Do not include refresh status, manifest-maintenance notes, selected-pages bookkeeping, or any other workflow narration in `report.md`.
- Include operational notes only if they materially affected audit quality, such as an unrefreshable repo, missing source files, or a manifest ambiguity that changes confidence in the findings.
-10. Complete each run:
- - `uv run python scripts/run_audit.py complete --run-id `
+ - Do not include helm chart or enterprise deployment findings.
+9. Complete each run:
+ - `uv run python -m docs_audit.deterministic_runner complete --run-id `
- Completion publishes the pending directory to `artifacts/runs/` and updates `artifacts/latest_run.json`.
- Only completed runs with `report.md` should appear under `artifacts/runs/`.
+10. Parse the completed `report.md` into findings, generate embeddings, and write the completed run
+ plus public findings to LanceDB Enterprise.
11. Return a concise markdown summary suitable for the inbox item.
-## Manifest maintenance rules
-
-- Prefer updating the manifest when the docs area or user-facing evidence has clearly evolved.
-- Prefer stability over churn. Do not rewrite a manifest just to reorganize it.
-- Prefer compact source lists over exhaustive source lists.
-- Prefer user-facing evidence over internal implementation detail.
-- If you find a likely new source file but its relevance is ambiguous, mention it in the final summary as a follow-up risk instead of forcing it into the manifest.
-
## Report rules
- Describe only what is missing from the docs.
diff --git a/workflows/docs-audit/scripts/run_weekly_audit.py b/workflows/docs-audit/scripts/run_weekly_audit.py
new file mode 100644
index 0000000..e83d52b
--- /dev/null
+++ b/workflows/docs-audit/scripts/run_weekly_audit.py
@@ -0,0 +1,521 @@
+#!/usr/bin/env python3
+from __future__ import annotations
+
+import argparse
+import json
+import subprocess
+import sys
+import textwrap
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+ROOT = Path(__file__).resolve().parent.parent
+if str(ROOT) not in sys.path:
+ sys.path.insert(0, str(ROOT))
+
+from docs_audit.config import load_env_file, settings_from_env
+from docs_audit.enterprise_store import (
+ DocsAuditEnterpriseStore,
+ json_string,
+ parse_timestamp,
+)
+from docs_audit.openai_client import OpenAIClient
+from docs_audit.report_parser import Finding, embedding_text, parse_report_findings
+from docs_audit.visibility import is_public_finding
+
+
+def log(message: str) -> None:
+ timestamp = datetime.now(timezone.utc).isoformat(timespec="seconds").replace("+00:00", "Z")
+ print(f"[{timestamp}] {message}", flush=True)
+
+
+def run_json_command(args: list[str]) -> dict[str, Any]:
+ log(f"running deterministic step: {' '.join(args)}")
+ completed = subprocess.run(
+ [sys.executable, "-m", "docs_audit.deterministic_runner", *args],
+ cwd=ROOT,
+ text=True,
+ capture_output=True,
+ check=False,
+ )
+ if completed.returncode != 0:
+ raise RuntimeError(
+ f"Command failed: python -m docs_audit.deterministic_runner {' '.join(args)}\n"
+ f"stdout:\n{completed.stdout}\n\nstderr:\n{completed.stderr}"
+ )
+ payload = json.loads(completed.stdout)
+ if args and args[0] == "select-areas":
+ log(
+ "selected areas: "
+ + ", ".join(payload.get("selected_areas", []))
+ + f" (changed: {', '.join(payload.get('changed_areas', [])) or 'none'})"
+ )
+ elif args and args[0] == "prepare":
+ log(
+ f"prepared area={args[args.index('--area') + 1] if '--area' in args else '?'} "
+ f"run={payload.get('run_id')} pages={len(payload.get('selected_pages', []))}"
+ )
+ elif args and args[0] == "complete":
+ log(f"completed run={payload.get('run_id')} dir={payload.get('run_dir')}")
+ return payload
+
+
+def read_json(path: Path) -> Any:
+ return json.loads(path.read_text(encoding="utf-8"))
+
+
+def write_json(path: Path, payload: Any) -> None:
+ path.parent.mkdir(parents=True, exist_ok=True)
+ path.write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")
+
+
+def extract_json_object(text: str) -> dict[str, Any]:
+ stripped = text.strip()
+ if stripped.startswith("```"):
+ lines = stripped.splitlines()
+ if lines and lines[0].startswith("```"):
+ lines = lines[1:]
+ if lines and lines[-1].startswith("```"):
+ lines = lines[:-1]
+ stripped = "\n".join(lines).strip()
+ try:
+ data = json.loads(stripped)
+ except json.JSONDecodeError:
+ start = stripped.find("{")
+ end = stripped.rfind("}")
+ if start < 0 or end < start:
+ raise
+ data = json.loads(stripped[start : end + 1])
+ if not isinstance(data, dict):
+ raise RuntimeError("Expected the page audit response to be a JSON object")
+ return data
+
+
+def list_of_strings(value: Any) -> list[str]:
+ if not isinstance(value, list):
+ return []
+ return [" ".join(str(item).strip().split()) for item in value if str(item).strip()]
+
+
+def page_audit_prompt(
+ *,
+ guidelines: str,
+ page_bundle: dict[str, Any],
+) -> str:
+ return f"""You are running one page-scoped pass of the LanceDB docs-gap audit.
+
+Apply these page audit guidelines:
+
+{guidelines}
+
+Return only a JSON object with this exact shape:
+
+{{
+ "code_claims": ["stable user-visible claims from the evidence bundle"],
+ "doc_claims": ["claims already present in the docs signals"],
+ "candidate_gaps": ["missing-doc candidates, including lower-confidence candidates"],
+ "report_observations": ["final concise missing-doc observations for report.md"]
+}}
+
+Rules:
+- Include only missing documentation observations.
+- Do not propose documentation patches.
+- Do not summarize implementation details unless they expose user-visible behavior missing from docs.
+- Do not include helm chart or enterprise deployment observations.
+- If there are no material gaps, use an empty report_observations array.
+- Keep report_observations self-contained and suitable as markdown bullets.
+
+Page bundle JSON:
+
+```json
+{json.dumps(page_bundle, indent=2, sort_keys=True)}
+```
+"""
+
+
+def audit_page(
+ *,
+ client: OpenAIClient,
+ model: str,
+ reasoning_effort: str,
+ guidelines: str,
+ page_bundle: dict[str, Any],
+) -> dict[str, list[str]]:
+ response_text = client.response_text(
+ model=model,
+ reasoning_effort=reasoning_effort,
+ input_text=page_audit_prompt(guidelines=guidelines, page_bundle=page_bundle),
+ )
+ data = extract_json_object(response_text)
+ return {
+ "code_claims": list_of_strings(data.get("code_claims")),
+ "doc_claims": list_of_strings(data.get("doc_claims")),
+ "candidate_gaps": list_of_strings(data.get("candidate_gaps")),
+ "report_observations": [
+ text for text in list_of_strings(data.get("report_observations")) if is_public_finding(text)
+ ],
+ }
+
+
+def write_page_outputs(
+ *,
+ llm_dir: Path,
+ page_id: str,
+ output: dict[str, list[str]],
+) -> None:
+ write_json(llm_dir / f"{page_id}.code_claims.json", output["code_claims"])
+ write_json(llm_dir / f"{page_id}.doc_claims.json", output["doc_claims"])
+ write_json(llm_dir / f"{page_id}.candidate_gaps.json", output["candidate_gaps"])
+ write_json(llm_dir / f"{page_id}.report_observations.json", output["report_observations"])
+
+
+def build_report(page_outputs: list[tuple[dict[str, Any], dict[str, list[str]]]]) -> str:
+ lines = ["# Missing Documentation Observations", ""]
+ wrote_any = False
+ for bundle, output in page_outputs:
+ observations = output["report_observations"]
+ if not observations:
+ continue
+ wrote_any = True
+ lines.extend([f"## {bundle['page_title']}", ""])
+ for observation in observations:
+ lines.append(f"- {observation}")
+ lines.append("")
+ if not wrote_any:
+ lines.append("No material missing documentation observations for the selected pages.")
+ lines.append("")
+ return "\n".join(lines)
+
+
+def page_lookup_from_bundles(bundles: list[dict[str, Any]]) -> dict[str, dict[str, str]]:
+ lookup: dict[str, dict[str, str]] = {}
+ for bundle in bundles:
+ title = str(bundle.get("page_title") or "")
+ if not title:
+ continue
+ lookup[title.casefold()] = {
+ "page_id": str(bundle.get("page_id") or ""),
+ "page_title": title,
+ "page_path": str(bundle.get("page_path") or ""),
+ }
+ return lookup
+
+
+def audit_pending_run(
+ *,
+ run_dir: Path,
+ client: OpenAIClient,
+ model: str,
+ reasoning_effort: str,
+) -> None:
+ selected = read_json(run_dir / "selected_pages.json")
+ guidelines = (ROOT / "prompts" / "page_audit_guidelines.md").read_text(encoding="utf-8")
+ page_outputs: list[tuple[dict[str, Any], dict[str, list[str]]]] = []
+ log(f"auditing run={run_dir.name} selected_pages={len(selected['selected_pages'])}")
+ for page_id in selected["selected_pages"]:
+ bundle = read_json(run_dir / "page_bundles" / f"{page_id}.json")
+ log(
+ f"auditing page run={run_dir.name} page={page_id} "
+ f"model={model} effort={reasoning_effort}"
+ )
+ output = audit_page(
+ client=client,
+ model=model,
+ reasoning_effort=reasoning_effort,
+ guidelines=guidelines,
+ page_bundle=bundle,
+ )
+ write_page_outputs(llm_dir=run_dir / "llm_outputs", page_id=page_id, output=output)
+ log(
+ f"page audited run={run_dir.name} page={page_id} "
+ f"observations={len(output['report_observations'])}"
+ )
+ page_outputs.append((bundle, output))
+ report_text = build_report(page_outputs)
+ (run_dir / "report.md").write_text(report_text, encoding="utf-8")
+ log(f"wrote report run={run_dir.name} bytes={len(report_text.encode('utf-8'))}")
+
+
+def prepare_selected_runs(*, refresh: bool, advance: bool) -> list[Path]:
+ log(f"selecting areas refresh={refresh} advance={advance}")
+ select_args = ["select-areas"]
+ if refresh:
+ select_args.append("--refresh")
+ if advance:
+ select_args.append("--advance")
+ selection = run_json_command(select_args)
+ run_dirs: list[Path] = []
+ for area in selection["selected_areas"]:
+ log(f"preparing area={area}")
+ prepared = run_json_command(["prepare", "--area", area])
+ run_dirs.append(Path(prepared["run_dir"]))
+ return run_dirs
+
+
+def complete_run(run_dir: Path) -> Path:
+ run_id = run_dir.name
+ log(f"completing run={run_id}")
+ completed = run_json_command(["complete", "--run-id", run_id])
+ return Path(completed["run_dir"])
+
+
+def repo_shas(metadata: dict[str, Any]) -> dict[str, str]:
+ shas: dict[str, str] = {}
+ for item in metadata.get("refresh", []):
+ repo = item.get("repo")
+ sha = item.get("sha_after") or item.get("sha_before")
+ if repo and sha:
+ shas[str(repo)] = str(sha)
+ return shas
+
+
+def run_row_from_metadata(
+ *,
+ metadata: dict[str, Any],
+ report_text: str,
+ report_path: Path,
+) -> dict[str, Any]:
+ return {
+ "run_id": metadata["run_id"],
+ "completed_at": parse_timestamp(metadata.get("completed_at")),
+ "areas": [metadata["area"]],
+ "report_text": report_text,
+ "report_path": str(report_path),
+ "repo_shas": json_string(repo_shas(metadata)),
+ "selected_pages": [str(item) for item in metadata.get("selected_pages", [])],
+ "changed_pages": [str(item) for item in metadata.get("changed_pages", [])],
+ "refresh": json_string(metadata.get("refresh", [])),
+ "metadata": json_string(metadata),
+ }
+
+
+def finding_rows(
+ *,
+ findings: list[Finding],
+ embeddings: list[list[float]],
+ completed_at: Any,
+) -> list[dict[str, Any]]:
+ rows = []
+ for finding, vector in zip(findings, embeddings, strict=True):
+ rows.append(
+ {
+ "id": finding.id,
+ "run_id": finding.run_id,
+ "completed_at": completed_at,
+ "area": finding.area,
+ "page_id": finding.page_id or "",
+ "page_title": finding.page_title or "",
+ "page_path": finding.page_path or "",
+ "report_heading": finding.report_heading,
+ "finding_index": finding.finding_index,
+ "finding_text": finding.finding_text,
+ "finding_hash": finding.finding_hash,
+ "visibility_class": finding.visibility_class,
+ "embedding_text": embedding_text(finding),
+ "embedding": [float(value) for value in vector],
+ "metadata": json_string({}),
+ }
+ )
+ return rows
+
+
+def debug_finding_rows(rows: list[dict[str, Any]]) -> list[dict[str, Any]]:
+ debug_rows: list[dict[str, Any]] = []
+ for row in rows:
+ item = dict(row)
+ completed_at = item.get("completed_at")
+ if hasattr(completed_at, "isoformat"):
+ item["completed_at"] = completed_at.isoformat()
+ item["embedding"] = []
+ debug_rows.append(item)
+ return debug_rows
+
+
+def ingest_completed_run(
+ *,
+ run_dir: Path,
+ openai_client: OpenAIClient | None,
+ embedding_model: str,
+ store: DocsAuditEnterpriseStore | None,
+) -> dict[str, Any]:
+ metadata = read_json(run_dir / "metadata.json")
+ log(f"parsing completed report run={metadata['run_id']} area={metadata['area']}")
+ report_path = run_dir / "report.md"
+ report_text = report_path.read_text(encoding="utf-8")
+ bundles = [
+ read_json(path)
+ for path in sorted((run_dir / "page_bundles").glob("*.json"))
+ ]
+ findings = [
+ finding
+ for finding in parse_report_findings(
+ report_text,
+ run_id=metadata["run_id"],
+ area=metadata["area"],
+ page_lookup=page_lookup_from_bundles(bundles),
+ )
+ if finding.visibility_class == "public-doc-gap"
+ ]
+ texts = [embedding_text(finding) for finding in findings]
+ log(f"parsed findings run={metadata['run_id']} public_findings={len(findings)}")
+ vectors: list[list[float]]
+ if store is None:
+ log(f"skipping embeddings and Enterprise write run={metadata['run_id']}")
+ vectors = [[] for _finding in findings]
+ else:
+ if openai_client is None:
+ raise RuntimeError("OpenAI client is required when writing embedded findings")
+ log(
+ f"generating embeddings run={metadata['run_id']} "
+ f"count={len(texts)} model={embedding_model}"
+ )
+ vectors = openai_client.embeddings(model=embedding_model, inputs=texts)
+ run_row = run_row_from_metadata(
+ metadata=metadata,
+ report_text=report_text,
+ report_path=report_path,
+ )
+ rows = finding_rows(
+ findings=findings,
+ embeddings=vectors,
+ completed_at=run_row["completed_at"],
+ )
+ write_json(run_dir / "llm_outputs" / "parsed_findings.json", debug_finding_rows(rows))
+ stored = {"runs": 0, "findings": 0}
+ if store is not None:
+ log(f"writing Enterprise rows run={metadata['run_id']} findings={len(rows)}")
+ stored = store.replace_run(run_row=run_row, finding_rows=rows)
+ log(
+ f"wrote Enterprise rows run={metadata['run_id']} "
+ f"runs={stored['runs']} findings={stored['findings']}"
+ )
+ return {
+ "run_id": metadata["run_id"],
+ "area": metadata["area"],
+ "findings": len(findings),
+ "stored": stored,
+ "report_path": str(report_path),
+ }
+
+
+def build_summary(results: list[dict[str, Any]]) -> str:
+ total = sum(int(result["findings"]) for result in results)
+ lines = [f"Docs audit completed: {len(results)} run(s), {total} public finding(s)."]
+ for result in results:
+ lines.append(
+ f"- {result['area']} `{result['run_id']}`: {result['findings']} finding(s), report {result['report_path']}"
+ )
+ return "\n".join(lines)
+
+
+def build_parser() -> argparse.ArgumentParser:
+ parser = argparse.ArgumentParser(
+ description="OpenAI API-driven docs-gap audit runner",
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ epilog=textwrap.dedent(
+ """\
+ Common modes:
+
+ No-API parser smoke test:
+ uv run python scripts/run_weekly_audit.py --ingest-run-dir artifacts/runs/ --skip-write
+
+ Local audit generation without Enterprise writes:
+ uv run python scripts/run_weekly_audit.py --no-refresh --no-advance --skip-write
+
+ Backfill one completed run into Enterprise:
+ uv run python scripts/run_weekly_audit.py --ingest-run-dir artifacts/runs/
+
+ Weekly EC2 cron mode:
+ uv run python scripts/run_weekly_audit.py
+
+ Cost/network behavior:
+ --ingest-run-dir with --skip-write does not call OpenAI and does not write to LanceDB.
+ --skip-write skips embeddings and LanceDB writes, but audit generation still calls GPT-5.5.
+ Omitting --no-refresh allows select-areas to run git pull --ff-only on watched repos.
+ Omitting --no-advance updates the weekly area rotation cursor.
+ """
+ ),
+ )
+ parser.add_argument(
+ "--no-refresh",
+ action="store_true",
+ help="Skip git pull --ff-only during area selection. Useful for local tests.",
+ )
+ parser.add_argument(
+ "--no-advance",
+ action="store_true",
+ help="Do not advance the area rotation cursor. Useful for repeatable local tests.",
+ )
+ parser.add_argument(
+ "--skip-write",
+ action="store_true",
+ help="Do not generate embeddings or write LanceDB Enterprise rows. Audit generation may still call OpenAI.",
+ )
+ parser.add_argument(
+ "--ingest-run-dir",
+ type=Path,
+ help="Parse an existing completed run directory instead of selecting areas or generating a new report.",
+ )
+ return parser
+
+
+def main() -> int:
+ args = build_parser().parse_args()
+ log("starting weekly docs audit runner")
+ load_env_file()
+ settings = settings_from_env()
+ log(
+ f"mode ingest_run_dir={args.ingest_run_dir or 'none'} "
+ f"skip_write={args.skip_write} no_refresh={args.no_refresh} no_advance={args.no_advance}"
+ )
+ needs_openai = args.ingest_run_dir is None or not args.skip_write
+ openai_client = None
+ if needs_openai:
+ log("initializing OpenAI client")
+ openai_client = OpenAIClient(
+ api_key=settings.openai_api_key,
+ timeout_seconds=settings.openai_timeout_seconds,
+ )
+ store = None
+ if not args.skip_write:
+ log(f"connecting to LanceDB Enterprise uri={settings.docs_audit_db_uri}")
+ store = DocsAuditEnterpriseStore(
+ uri=settings.docs_audit_db_uri,
+ api_key=settings.lancedb_api_key,
+ host_override=settings.lancedb_host_override,
+ region=settings.lancedb_region,
+ )
+
+ completed_dirs: list[Path] = []
+ if args.ingest_run_dir is not None:
+ completed_dirs = [args.ingest_run_dir]
+ else:
+ pending_dirs = prepare_selected_runs(refresh=not args.no_refresh, advance=not args.no_advance)
+ if openai_client is None:
+ raise RuntimeError("OpenAI client is required for audit generation")
+ for pending_dir in pending_dirs:
+ audit_pending_run(
+ run_dir=pending_dir,
+ client=openai_client,
+ model=settings.audit_model,
+ reasoning_effort=settings.audit_reasoning_effort,
+ )
+ completed_dirs.append(complete_run(pending_dir))
+
+ results = [
+ ingest_completed_run(
+ run_dir=run_dir,
+ openai_client=openai_client,
+ embedding_model=settings.embedding_model,
+ store=store,
+ )
+ for run_dir in completed_dirs
+ ]
+ print(build_summary(results))
+ log("weekly docs audit runner finished")
+ return 0
+
+
+if __name__ == "__main__":
+ raise SystemExit(main())
diff --git a/workflows/docs-audit/skills/area-manifest-authoring/SKILL.md b/workflows/docs-audit/skills/area-manifest-authoring/SKILL.md
index 247ac2b..fb47e4f 100644
--- a/workflows/docs-audit/skills/area-manifest-authoring/SKILL.md
+++ b/workflows/docs-audit/skills/area-manifest-authoring/SKILL.md
@@ -128,7 +128,7 @@ If you create a new area, also update `enabled_areas` in `config.toml` if the au
After drafting the manifest, run:
```bash
-uv run python scripts/run_audit.py prepare --area
+uv run python -m docs_audit.deterministic_runner prepare --area
```
Inspect:
diff --git a/workflows/docs-audit/skills/docs-writer/SKILL.md b/workflows/docs-audit/skills/docs-writer/SKILL.md
index 021901e..b79c161 100644
--- a/workflows/docs-audit/skills/docs-writer/SKILL.md
+++ b/workflows/docs-audit/skills/docs-writer/SKILL.md
@@ -42,11 +42,14 @@ If the source code disagrees with the gap report, trust the source and flag the
Edit the affected MDX pages directly. Keep the change scoped to the gap; don't sweep in unrelated improvements unless the user asked for them.
+Default to minimally invasive edits. The goal is to help readers use LanceDB correctly, not to reproduce every detail from the audit report. When several findings point at the same concept, fold them into one short paragraph, note, or existing list item instead of adding separate sections. Prefer practical usage guidance, prerequisites, and failure modes over exhaustive API inventory. Add a new section only when the surrounding page has no natural place for the information.
+
For prose:
- **Placement**: put new sections where readers will encounter the concept naturally, not in the next empty slot at the bottom of the page.
- **Depth and tone**: match the heading depth and voice of surrounding sections.
- **Cross-links**: link to related pages with anchor links when it helps the reader, without spraying too many links and making the prose look cluttered.
+- **Brevity**: make the smallest change that closes the gap for a working user. Avoid verbose caveat lists unless the caveat changes what the user should do.
For code examples:
diff --git a/workflows/docs-audit/tests/conftest.py b/workflows/docs-audit/tests/conftest.py
new file mode 100644
index 0000000..e4370f1
--- /dev/null
+++ b/workflows/docs-audit/tests/conftest.py
@@ -0,0 +1,10 @@
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+ROOT = Path(__file__).resolve().parents[1]
+if str(ROOT) not in sys.path:
+ sys.path.insert(0, str(ROOT))
+
diff --git a/workflows/docs-audit/tests/test_enterprise_store.py b/workflows/docs-audit/tests/test_enterprise_store.py
new file mode 100644
index 0000000..dd318c1
--- /dev/null
+++ b/workflows/docs-audit/tests/test_enterprise_store.py
@@ -0,0 +1,51 @@
+from __future__ import annotations
+
+import pytest
+
+from docs_audit import enterprise_store
+from docs_audit.enterprise_store import DocsAuditEnterpriseStore
+
+
+def test_store_requires_api_key() -> None:
+ with pytest.raises(RuntimeError, match="Missing LANCEDB_API_KEY"):
+ DocsAuditEnterpriseStore(
+ uri="db://docs-audit",
+ api_key="",
+ host_override="https://enterprise.example.com",
+ region="us-east-1",
+ )
+
+
+def test_store_rejects_invalid_host_override() -> None:
+ with pytest.raises(RuntimeError, match="Invalid LANCEDB_HOST_OVERRIDE"):
+ DocsAuditEnterpriseStore(
+ uri="db://docs-audit",
+ api_key="key",
+ host_override="enterprise.example.com",
+ region="us-east-1",
+ )
+
+
+def test_store_uses_enterprise_connection_kwargs(monkeypatch) -> None:
+ captured: dict[str, object] = {}
+
+ def fake_connect(**kwargs):
+ captured.update(kwargs)
+ return object()
+
+ monkeypatch.setattr(enterprise_store.lancedb, "connect", fake_connect)
+
+ DocsAuditEnterpriseStore(
+ uri="db://docs-audit",
+ api_key="enterprise-key",
+ host_override="https://enterprise-host.example.com",
+ region="us-west-2",
+ )
+
+ assert captured == {
+ "uri": "db://docs-audit",
+ "api_key": "enterprise-key",
+ "host_override": "https://enterprise-host.example.com",
+ "region": "us-west-2",
+ }
+
diff --git a/workflows/docs-audit/tests/test_openai_client.py b/workflows/docs-audit/tests/test_openai_client.py
new file mode 100644
index 0000000..a8632d4
--- /dev/null
+++ b/workflows/docs-audit/tests/test_openai_client.py
@@ -0,0 +1,24 @@
+from __future__ import annotations
+
+from docs_audit.openai_client import extract_response_text
+
+
+def test_extract_response_text_prefers_top_level_output_text() -> None:
+ assert extract_response_text({"output_text": "hello"}) == "hello"
+
+
+def test_extract_response_text_reads_message_content() -> None:
+ payload = {
+ "output": [
+ {
+ "type": "message",
+ "content": [
+ {"type": "output_text", "text": "first"},
+ {"type": "output_text", "text": "second"},
+ ],
+ }
+ ]
+ }
+
+ assert extract_response_text(payload) == "first\nsecond"
+
diff --git a/workflows/docs-audit/tests/test_report_parser.py b/workflows/docs-audit/tests/test_report_parser.py
new file mode 100644
index 0000000..f6c3585
--- /dev/null
+++ b/workflows/docs-audit/tests/test_report_parser.py
@@ -0,0 +1,47 @@
+from __future__ import annotations
+
+from docs_audit.report_parser import embedding_text, parse_report_findings
+
+
+def test_parse_report_findings_resolves_page_metadata() -> None:
+ report = """# Missing Documentation Observations
+
+## Full-Text Search Index
+
+- Model-backed FTS tokenizers are missing from the tokenizer documentation.
+- The n-gram parameter names are inconsistent with the Python API.
+"""
+ findings = parse_report_findings(
+ report,
+ run_id="run-1",
+ area="indexing",
+ page_lookup={
+ "full-text search index": {
+ "page_id": "fts-index",
+ "page_title": "Full-Text Search Index",
+ "page_path": "docs/indexing/fts.mdx",
+ }
+ },
+ )
+
+ assert [finding.finding_index for finding in findings] == [1, 2]
+ assert findings[0].id == "run-1:001"
+ assert findings[0].page_id == "fts-index"
+ assert findings[0].visibility_class == "public-doc-gap"
+ assert findings[0].finding_hash.startswith("sha256:")
+ assert "Area: indexing" in embedding_text(findings[0])
+ assert "Docs path: docs/indexing/fts.mdx" in embedding_text(findings[0])
+
+
+def test_parse_report_findings_marks_excluded_topics() -> None:
+ report = """# Missing Documentation Observations
+
+## Deployment
+
+- The helm chart values for enterprise deployment are not documented.
+"""
+ findings = parse_report_findings(report, run_id="run-1", area="storage")
+
+ assert len(findings) == 1
+ assert findings[0].visibility_class == "excluded"
+
diff --git a/workflows/docs-audit/tests/test_run_weekly_audit.py b/workflows/docs-audit/tests/test_run_weekly_audit.py
new file mode 100644
index 0000000..31cafc9
--- /dev/null
+++ b/workflows/docs-audit/tests/test_run_weekly_audit.py
@@ -0,0 +1,23 @@
+from __future__ import annotations
+
+from datetime import datetime, timezone
+
+from scripts.run_weekly_audit import debug_finding_rows
+
+
+def test_debug_finding_rows_serializes_datetime_and_strips_embedding() -> None:
+ rows = [
+ {
+ "id": "run:001",
+ "completed_at": datetime(2026, 5, 18, tzinfo=timezone.utc),
+ "embedding": [1.0, 2.0],
+ }
+ ]
+
+ assert debug_finding_rows(rows) == [
+ {
+ "id": "run:001",
+ "completed_at": "2026-05-18T00:00:00+00:00",
+ "embedding": [],
+ }
+ ]