Add RFC for Vector Search by gggrace14 · Pull Request #56 · prestodb/rfcs

gggrace14 · 2026-02-10T15:50:38Z

This RFC proposes the syntax and SPI to natively support Approximate Nearest Neighbor (ANN) vector search within Presto.

NivinCS · 2026-02-15T05:49:32Z

Thanks, @gggrace14 , for the write-up

The proposal currently does not describe the lifecycle management of vector indexes with respect to underlying table mutations. In a lakehouse environment such as Iceberg, operations like appends, partition backfills, deletes, file rewrites (compaction), and snapshot expiration can make the index stale or physically invalid, especially if the index stores file-level row references.

Could we clarify how index validity is expected to be maintained across such operations? Specifically:

How is index staleness detected?
Are indexes expected to be snapshot-bound?
How should index cleanup or rebuild be handled after file rewrites or snapshot expiry?
Is automatic invalidation or maintenance of index metadata in scope?

This would help ensure correctness and avoid cases where ANN search may return outdated or invalid results.

NivinCS · 2026-02-15T06:02:17Z

RFC-0023-vector-search.md

+"I need the index definition created now, but the index should only build when I instruct."
+
+```sql
+CREATE VECTOR INDEX vector_index ON TABLE candidates_table(id, embedding) WITH (


After creating the vector index metadata without building it for any partitions, the expected query behavior is not entirely clear. For example, if a user runs:

SELECT * FROM candidates_table ORDER BY embedding <-> query_vector LIMIT 10;

when the index is defined but not yet populated, should the engine fall back to a full scan, ignore the index, emit a warning, or fail the query? It would be helpful to clarify the intended behavior in this scenario to ensure predictable query execution.

NivinCS · 2026-02-15T06:20:22Z

RFC-0023-vector-search.md

+
+**ON `TABLE candidates_table(id, embedding)` clause**
+- `embedding` is the embedding column to create the index on.
+- `id` is the unique identifier column that the user needs to provide.


It would be beneficial to keep the PRIMARY KEY / ROW_ID specification optional rather than mandatory for vector index creation.

From the Iceberg V3 perspective, tables with Row Level Lineage enabled can provide a stable row-level identifier via $row_id. In such cases, the identifier column can be made optional, allowing the connector to automatically fall back to $row_id when a user-defined id is not specified.

This would avoid requiring users to explicitly provide a unique identifier for tables where lineage-based row identity is already available.

NivinCS · 2026-02-15T06:50:50Z

RFC-0023-vector-search.md

+    - Alternatively, use the hidden column $row_id for id and remove the requirement from users.
+
+**`UPDATING FOR` clause**
+- For each of the index partitions mapped to the range, an index will be created.


It may be beneficial to make the partition specification in UPDATE VECTOR INDEX (e.g., FOR PARTITION (...)) optional rather than mandatory.

Instead of requiring users to manually specify partition ranges for index updates, the connector could automatically detect changed partitions and perform incremental index refresh, similar to materialized view incremental refresh.

This would reduce the need for manual orchestration and improve usability for incremental index maintenance.
cc : @tdcmeehan

NivinCS · 2026-02-15T07:33:34Z

RFC-0023-vector-search.md

+
+#### OSS Implementation
+
+OSS implementation will be exact vector search and implement a ConnectorPlanOptimizer to rewrite to a plan with MAX_BY().


It will be beneficial to include a reusable toolkit/library as part of the proposal to handle internal query rewrites for vector search operations.

In many cases, executing ANN search would otherwise require users to explicitly join the base table with the corresponding vector index using a row identifier (e.g., user-defined id or $row_id). This can expose index implementation details and make query syntax more complex, requiring users to understand how embeddings and index data are related internally.

Providing a reusable component that transparently rewrites vector search predicates (e.g., ANN_SEARCH(...)) into the appropriate join between the base table and vector index would enable a simpler and more seamless user experience. This would allow users to express ANN queries without manually specifying join logic, while the underlying connector leverages the index automatically.

Such a library could be integrated by connectors (e.g., Iceberg) to perform these rewrites behind the scenes, avoiding duplication of rewrite logic across connectors and keeping index implementation details abstracted from end users.
cc : @tdcmeehan

aditi-pandit

Thanks @gggrace14 for this writeup. Overall I'm in favor of this proposal.

I'm also curious since FAISS functions were recently exposed in C++.... Are you planning to do any follow up work with the library ?

aditi-pandit · 2026-02-18T00:22:46Z

RFC-0023-vector-search.md

+
+##### Journey 4: Search with implicit index lookup
+
+"I want to run ANN search automatically using the index registered on my table."


Does the optimizer know about the Index as such ? The use of the index should be abstracted by VECTOR_SEARCH function, no ?

aditi-pandit · 2026-02-18T00:39:04Z

RFC-0023-vector-search.md

+
+#### Update Partitioned Index
+
+```sql


Are we missing the SPI for these ? Can you elaborate ?

aditi-pandit · 2026-02-18T00:40:58Z

RFC-0023-vector-search.md

+```sql
+SET SESSION vector_search_index = 'di:vector_index:nprobe=50';
+
+SELECT


Have you considered implementing VECTOR_SEARCH with a TABLE function ? It is easier to express the input tables, partitioning and parameters with it.

aditi-pandit · 2026-02-18T00:45:12Z

RFC-0023-vector-search.md

+
+#### Execution and SPI
+
+Our proposal is to introduce generic SPI nodes: VectorSearchNode and VectorSearchOptions, which together provide an abstract interface for vector search. This abstract interface will only include mandatory parameters, allowing actual implementations to extend it.


Can you give more details about VectorSearchNode and VectorSearchOptions. Will VectorSearchNode be a PlanNode ?

bibith4 · 2026-02-19T06:23:26Z

RFC-0023-vector-search.md

+- The individual table partition is large.
+- A new partition is added every day or every hour.
+- Past partitions are occasionally backfilled.
+


It would be helpful to clarify the behavior when base table partitions are dropped, since index partitions are mapped to them. Should corresponding index partitions be automatically cleaned up or require manual action?

Additionally, it may be beneficial to define lifecycle DDL such as DROP VECTOR INDEX for operational usability.

bibith4 · 2026-02-19T06:33:11Z

RFC-0023-vector-search.md

+Update only the index partition mapped to the base table partition ds = '2026-01-04'.
+
+```sql
+UPDATE VECTOR INDEX vector_index


It would be helpful to clarify query behavior during index build or refresh. For example, can queries read partially built partitions?

Dilli-Babu-Godari · 2026-02-16T13:26:45Z

RFC-0023-vector-search.md

+CREATE VECTOR INDEX vector_index ON TABLE candidates_table(id, embedding) WITH (
+    index_type = 'ivf_rabitq4',
+    distance_metric = 'cosine',
+    index_options = 'nlist=100000,nb=8',


what do the parameters nlist and nb represent?

How do these parameters affect indexing performance, recall, and query latency?

What guidelines should users follow to choose optimal values for different dataset sizes and vector dimensions?

Is there any automatic tuning mechanism available, or are users expected to benchmark and manually determine the best configuration?

gggrace14 · 2026-03-03T06:08:21Z

Thanks, @gggrace14 , for the write-up

The proposal currently does not describe the lifecycle management of vector indexes with respect to underlying table mutations. In a lakehouse environment such as Iceberg, operations like appends, partition backfills, deletes, file rewrites (compaction), and snapshot expiration can make the index stale or physically invalid, especially if the index stores file-level row references.

Could we clarify how index validity is expected to be maintained across such operations? Specifically:

How is index staleness detected?

Are indexes expected to be snapshot-bound?

How should index cleanup or rebuild be handled after file rewrites or snapshot expiry?

Is automatic invalidation or maintenance of index metadata in scope?

This would help ensure correctness and avoid cases where ANN search may return outdated or invalid results.

Hi @NivinCS , in general it is designed to define the minimum required parts to be SPI and leave individual connectors with the max flexibility to define its own behavior.

Specifically with respect to how index validity is maintained, we could leave it to individual ConnectorPlanOptimizer to detect if the index partition is stale and return reasonable result when the ConnectorPlanOptimizer plans/rewrites the vector search query.

For SPI, thinking through your questions, I think what could be added here is a getVectorSearchIndexStatus() method in the SPI class MetadataManager and ConnectorMetadata, similar to how materialized view detects staleness. Let me know if there are something else in your mind that need to be added to SPI.

I could describe below what we will put into the HiveConnector to maintain the index partition validity at Meta, by answering your above questions. Meta has an internal HiveConnector extension. I think you can implement the Iceberg behaviors in the counterpart classes of IcebergConnector. Again I think these should belong to the connectors but not SPI.

1. How is index staleness detected?
By looking at the metadata of index partition on HiveMetastore. As index partition is just a Hive table partition, if the index partition is stale, we'll drop the partition from the HiveMetastore. We can also put a dirty mark on the metadata of the index partition.

We'll have a pub-sub service that subscribes to the metadata of the candidate table (base table). If the corresponding base table partition is updated/overwritten/deleted, it will trigger the dropping of the index partition.

Then in the search path, we have a VectorSearchRewritePlanOptimizer that implements a ConnectorPlanOptimizer. In the VectorSearchRewritePlanOptimizer, we compare the available partitions of the index vs the base table in the query range. If an index partition is available, we'll use it to get the kNN of the partition. Otherwise, we will build the index on the fly. Another option for an unavailable index partition is to fall back to the exact search. And then we'll aggregate across all partitions in the query range and compute the kNN from all kNNs from each partion.

2. Are indexes expected to be snapshot-bound?
I might not have enough context for Iceberg. However, I think you can decide the reasonable behavior in IcebergConnector.

3. How should index cleanup or rebuild be handled after file rewrites or snapshot expiry?
As mentioned above, we will have a pub-sub service. Once it detects the update or deletion event of the base table from HiveMetastore, it will trigger at least two actions in sequence, 1) drop the corresponding index partition; 2) call UPDATE VECTOR INDEX vector_index WHERE <filter>, where corresponds to the dropped partition.

4. Is automatic invalidation or maintenance of index metadata in scope?
No, the automatic invalidation is out of the scope of Presto as well as this RFC. The index validity is stored as the index metadata. And when and how to update the index metadata is out of the scope of Presto. Presto only provides the SQL as interface to create and update the index.

prestodb-ci added the from:Meta PRs from Meta label Feb 10, 2026

gggrace14 force-pushed the vsrfc branch 4 times, most recently from 44b3f03 to 6b4e9f7 Compare February 10, 2026 16:14

Add RFC for Vector Search

0df52e7

gggrace14 force-pushed the vsrfc branch from 6b4e9f7 to 0df52e7 Compare February 10, 2026 17:04

NivinCS reviewed Feb 15, 2026

View reviewed changes

aditi-pandit reviewed Feb 18, 2026

View reviewed changes

bibith4 reviewed Feb 19, 2026

View reviewed changes

Dilli-Babu-Godari reviewed Feb 19, 2026

View reviewed changes

skyelves mentioned this pull request Feb 25, 2026

feat: Add syntax support for CREATE VECTOR INDEX [DO NOT MERGE] prestodb/presto#27027

Open

gggrace14 marked this pull request as ready for review February 26, 2026 21:41


		#### OSS Implementation

		OSS implementation will be exact vector search and implement a ConnectorPlanOptimizer to rewrite to a plan with MAX_BY().


		##### Journey 4: Search with implicit index lookup

		"I want to run ANN search automatically using the index registered on my table."


		#### Execution and SPI

		Our proposal is to introduce generic SPI nodes: VectorSearchNode and VectorSearchOptions, which together provide an abstract interface for vector search. This abstract interface will only include mandatory parameters, allowing actual implementations to extend it.

Conversation

gggrace14 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NivinCS commented Feb 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aditi-pandit left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gggrace14 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gggrace14 commented Feb 10, 2026 •

edited

Loading

gggrace14 commented Mar 3, 2026 •

edited

Loading