Conversation
44b3f03 to
6b4e9f7
Compare
|
Thanks, @gggrace14 , for the write-up The proposal currently does not describe the lifecycle management of vector indexes with respect to underlying table mutations. In a lakehouse environment such as Iceberg, operations like appends, partition backfills, deletes, file rewrites (compaction), and snapshot expiration can make the index stale or physically invalid, especially if the index stores file-level row references. Could we clarify how index validity is expected to be maintained across such operations? Specifically:
This would help ensure correctness and avoid cases where ANN search may return outdated or invalid results. |
| "I need the index definition created now, but the index should only build when I instruct." | ||
|
|
||
| ```sql | ||
| CREATE VECTOR INDEX vector_index ON TABLE candidates_table(id, embedding) WITH ( |
There was a problem hiding this comment.
After creating the vector index metadata without building it for any partitions, the expected query behavior is not entirely clear. For example, if a user runs:
SELECT *
FROM candidates_table
ORDER BY embedding <-> query_vector
LIMIT 10;
when the index is defined but not yet populated, should the engine fall back to a full scan, ignore the index, emit a warning, or fail the query? It would be helpful to clarify the intended behavior in this scenario to ensure predictable query execution.
|
|
||
| **ON `TABLE candidates_table(id, embedding)` clause** | ||
| - `embedding` is the embedding column to create the index on. | ||
| - `id` is the unique identifier column that the user needs to provide. |
There was a problem hiding this comment.
It would be beneficial to keep the PRIMARY KEY / ROW_ID specification optional rather than mandatory for vector index creation.
From the Iceberg V3 perspective, tables with Row Level Lineage enabled can provide a stable row-level identifier via $row_id. In such cases, the identifier column can be made optional, allowing the connector to automatically fall back to $row_id when a user-defined id is not specified.
This would avoid requiring users to explicitly provide a unique identifier for tables where lineage-based row identity is already available.
| - Alternatively, use the hidden column $row_id for id and remove the requirement from users. | ||
|
|
||
| **`UPDATING FOR` clause** | ||
| - For each of the index partitions mapped to the range, an index will be created. |
There was a problem hiding this comment.
It may be beneficial to make the partition specification in UPDATE VECTOR INDEX (e.g., FOR PARTITION (...)) optional rather than mandatory.
Instead of requiring users to manually specify partition ranges for index updates, the connector could automatically detect changed partitions and perform incremental index refresh, similar to materialized view incremental refresh.
This would reduce the need for manual orchestration and improve usability for incremental index maintenance.
cc : @tdcmeehan
|
|
||
| #### OSS Implementation | ||
|
|
||
| OSS implementation will be exact vector search and implement a ConnectorPlanOptimizer to rewrite to a plan with MAX_BY(). |
There was a problem hiding this comment.
It will be beneficial to include a reusable toolkit/library as part of the proposal to handle internal query rewrites for vector search operations.
In many cases, executing ANN search would otherwise require users to explicitly join the base table with the corresponding vector index using a row identifier (e.g., user-defined id or $row_id). This can expose index implementation details and make query syntax more complex, requiring users to understand how embeddings and index data are related internally.
Providing a reusable component that transparently rewrites vector search predicates (e.g., ANN_SEARCH(...)) into the appropriate join between the base table and vector index would enable a simpler and more seamless user experience. This would allow users to express ANN queries without manually specifying join logic, while the underlying connector leverages the index automatically.
Such a library could be integrated by connectors (e.g., Iceberg) to perform these rewrites behind the scenes, avoiding duplication of rewrite logic across connectors and keeping index implementation details abstracted from end users.
cc : @tdcmeehan
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks @gggrace14 for this writeup. Overall I'm in favor of this proposal.
I'm also curious since FAISS functions were recently exposed in C++.... Are you planning to do any follow up work with the library ?
|
|
||
| ##### Journey 4: Search with implicit index lookup | ||
|
|
||
| "I want to run ANN search automatically using the index registered on my table." |
There was a problem hiding this comment.
Does the optimizer know about the Index as such ? The use of the index should be abstracted by VECTOR_SEARCH function, no ?
|
|
||
| #### Update Partitioned Index | ||
|
|
||
| ```sql |
There was a problem hiding this comment.
Are we missing the SPI for these ? Can you elaborate ?
| ```sql | ||
| SET SESSION vector_search_index = 'di:vector_index:nprobe=50'; | ||
|
|
||
| SELECT |
There was a problem hiding this comment.
Have you considered implementing VECTOR_SEARCH with a TABLE function ? It is easier to express the input tables, partitioning and parameters with it.
|
|
||
| #### Execution and SPI | ||
|
|
||
| Our proposal is to introduce generic SPI nodes: VectorSearchNode and VectorSearchOptions, which together provide an abstract interface for vector search. This abstract interface will only include mandatory parameters, allowing actual implementations to extend it. |
There was a problem hiding this comment.
Can you give more details about VectorSearchNode and VectorSearchOptions. Will VectorSearchNode be a PlanNode ?
| - The individual table partition is large. | ||
| - A new partition is added every day or every hour. | ||
| - Past partitions are occasionally backfilled. | ||
|
|
There was a problem hiding this comment.
It would be helpful to clarify the behavior when base table partitions are dropped, since index partitions are mapped to them. Should corresponding index partitions be automatically cleaned up or require manual action?
Additionally, it may be beneficial to define lifecycle DDL such as DROP VECTOR INDEX for operational usability.
| Update only the index partition mapped to the base table partition ds = '2026-01-04'. | ||
|
|
||
| ```sql | ||
| UPDATE VECTOR INDEX vector_index |
There was a problem hiding this comment.
It would be helpful to clarify query behavior during index build or refresh. For example, can queries read partially built partitions?
| CREATE VECTOR INDEX vector_index ON TABLE candidates_table(id, embedding) WITH ( | ||
| index_type = 'ivf_rabitq4', | ||
| distance_metric = 'cosine', | ||
| index_options = 'nlist=100000,nb=8', |
There was a problem hiding this comment.
what do the parameters nlist and nb represent?
- How do these parameters affect indexing performance, recall, and query latency?
- What guidelines should users follow to choose optimal values for different dataset sizes and vector dimensions?
- Is there any automatic tuning mechanism available, or are users expected to benchmark and manually determine the best configuration?
Hi @NivinCS , in general it is designed to define the minimum required parts to be SPI and leave individual connectors with the max flexibility to define its own behavior. Specifically with respect to how index validity is maintained, we could leave it to individual ConnectorPlanOptimizer to detect if the index partition is stale and return reasonable result when the ConnectorPlanOptimizer plans/rewrites the vector search query. For SPI, thinking through your questions, I think what could be added here is a I could describe below what we will put into the HiveConnector to maintain the index partition validity at Meta, by answering your above questions. Meta has an internal HiveConnector extension. I think you can implement the Iceberg behaviors in the counterpart classes of IcebergConnector. Again I think these should belong to the connectors but not SPI. 1. How is index staleness detected? We'll have a pub-sub service that subscribes to the metadata of the candidate table (base table). If the corresponding base table partition is updated/overwritten/deleted, it will trigger the dropping of the index partition. Then in the search path, we have a VectorSearchRewritePlanOptimizer that implements a ConnectorPlanOptimizer. In the VectorSearchRewritePlanOptimizer, we compare the available partitions of the index vs the base table in the query range. If an index partition is available, we'll use it to get the kNN of the partition. Otherwise, we will build the index on the fly. Another option for an unavailable index partition is to fall back to the exact search. And then we'll aggregate across all partitions in the query range and compute the kNN from all kNNs from each partion. 2. Are indexes expected to be snapshot-bound? 3. How should index cleanup or rebuild be handled after file rewrites or snapshot expiry? 4. Is automatic invalidation or maintenance of index metadata in scope? |
This RFC proposes the syntax and SPI to natively support Approximate Nearest Neighbor (ANN) vector search within Presto.