background processes simulator application for re-building vector indexes#32
Open
galsalomon66 wants to merge 10 commits intolancedb:mainfrom
Open
background processes simulator application for re-building vector indexes#32galsalomon66 wants to merge 10 commits intolancedb:mainfrom
galsalomon66 wants to merge 10 commits intolancedb:mainfrom
Conversation
Collaborator
|
@galsalomon66 please split to 2 PRs:
|
Contributor
Author
…es to enable the rebuild of all indexes, while put and remove vectors runs concurrently. (the relevant functionalities reside on examples/s3vector_concurrent_service.cpp, the other c++ files are not relevant) per each put/remove the index-state is verified for the level of its non-indexed vectors, upon configurable threshold, the index is rebuilt a-sync (after the completion of put/remove request). the search request uses the latest built index, and execute a brute-force (scan and verify distance per each non-indexed vector). all of the mentioned operations should not be blocked by any other operation. include/lancedb.h : Adds the LanceDBIndexStats struct and the lancedb_table_index_stats() C API declaration. The struct exposes num_indexed_rows, num_unindexed_rows, and num_indices. The function retrieves index statistics for a named index directly from the LanceDB manifest. examples/s3vector_concurrent_service.cpp : demonstrating an S3 vector concurrent service with dynamic schema, index rebuild, and flock-based coordination. tests/create_table_insert_vectors_and_query_demo.sh : shell test/demo that creates a bucket, creates an index, inserts vectors, and queries with/without filters. Signed-off-by: gal salomon <gal.salomon@gmail.com>
…ata is part of the table metadata. this is instead of saving it into local-filesystem. Signed-off-by: gal salomon <gal.salomon@gmail.com>
When S3V_BACKEND=s3, LanceDB connects with s3:// URIs and S3 storage options are passed to the connection builder.
the application can specify S3 storage options via environment variables:
- S3V_ENDPOINT
- S3V_REGION
- S3V_ACCESS_KEY_ID
- S3V_SECRET_ACCESS_KEY
- S3V_BUCKET
LanceDBHelper::connect() adds S3 storage options (endpoint, aws_region, aws_access_key_id, etc.) when in S3 mode.
Index state (config, dimension, thresholds, build flags) stores the index state in LanceDB table metadata using lancedb_table_set_metadata()/lancedb_table_get_metadata() with key s3v_index_state.
The state is now embedded in the Lance manifest — a single S3 object per version.
Added LanceDBHelper::save_table_metadata() and load_table_metadata() helpers.
S3Lock — Distributed Lock via S3 Conditional Writes.
S3Lock class implements Base_lock using S3 conditional PUT (If-None-Match: *). Lock objects are stored at s3://<bucket>/.locks/<index>_index.lock.
Lock protocol:
- lock_exclusive(): PUT lock object with If-None-Match: * — only one process succeeds. Retries with configurable interval.
- unlock(): DELETE the lock object.
- Crash recovery: Lock objects contain a timestamp. check_stale_and_reclaim() reads the lock, checks if it exceeds TTL (default 5 min), deletes and re-acquires if stale.
this provides a simple distributed locking mechanism without external services, suitable for coordinating index builds across multiple processes or machines while executing concurrent operations on the same LanceDB table.
it should be noted that the lancedb internals enable concurrent put-vector/get-vector/query-vector/delete-vector while maintaining consistency and correctness. the s3lock is not related to the vector operations.
The S3Lock is specifically for coordinating index builds, which are critical sections that require exclusive access to ensure the integrity of the index state.
Backend selection in FrontendLocker:
- LOCAL mode: FileLock (flock-based, for testing/debugging)
- S3 mode: S3Lock (distributed, production-ready
added S3 HTTP Layer — libcurl + SigV4 Signing
the S3 operations (lock objects, bucket creation) need authenticated HTTP requests with AWS SigV4 signing. Using the full AWS SDK would add many libraries as dependencies.
this layer using a Lightweight S3 HTTP client using libcurl + manual SigV4 signing (OpenSSL HMAC-SHA256).
added s3_http.h/cpp All S3 HTTP operations:
- s3_put_object() (with optional If-None-Match)
- s3_get_object()
- s3_delete_object()
- s3_create_bucket()
- s3_head_bucket()
Signed-off-by: gal salomon <gal.salomon@gmail.com>
Added lancedb_query_explain_plan() and lancedb_vector_query_explain_plan(). both calls explain_plan(verbose) instead of execute(). the query object remains valid and can be executed afterwards. The plan string is returned via plan_out and must be freed by the caller with lancedb_free_string(). Also added handle_error to the imports. include/lancedb.h — Added declarations for lancedb_query_explain_plan() and lancedb_vector_query_explain_plan() examples/s3vector_concurrent_service.cpp — Application changes 1. explainPlan flag in QueryVectors: When "explainPlan": true is set in the request, lancedb_vector_query_explain_plan() is called before lancedb_vector_query_execute(). The plan string is returned in the response as "queryPlan". 2. filterOnly mode in QueryVectors: When "filterOnly": true is set, the handler uses lancedb_query_new() (regular query) instead of lancedb_vector_query_new() (vector query), skipping vector similarity search entirely. queryVector is not required in this mode. A filter is required. The distanceMetric field is omitted from the response since no distance computation occurs. 3. LanceDBHelper::filter_only_query(): New static method that builds and executes a regular (non-vector) query with scalar filters. Supports explainPlan via lancedb_query_explain_plan(). 4. LanceDBHelper::query_vectors(): Added bool explain_plan and std::string& explain_plan_output parameters. Signed-off-by: gal salomon <gal.salomon@gmail.com>
…fter table creation), list all indexes with detailed info, and drop indexes by name — all via the existing CLI interface. New struct: LanceDBIndexInfo — struct with name, index_type, columns, num_columns New functions: - lancedb_table_list_indices_detailed() — calls the LanceDB SDK's list_indices() and returns full IndexConfig data (name, type, columns) instead of just names - lancedb_free_index_list_detailed() — frees the struct array and all nested allocations - sdk_index_type_to_c() — maps the SDK's IndexType enum to the C API's LanceDBIndexType examples/s3vector_concurrent_service.cpp : adding CreateScalarIndex each column can have a different type (BTREE, BITMAP, LABELLIST;default BTREE) ListScalarIndexes : Lists all indexes on the table with name, type, and columns DropScalarIndex : Drops an index by its auto-generated name (e.g., category_idx) Unit tests (tests/test_scalar_index.cpp) : several unit tests covering the new lancedb_table_list_indices_detailed C binding Signed-off-by: gal salomon <gal.salomon@gmail.com>
2badb0e to
1eec173
Compare
Signed-off-by: gal salomon <gal.salomon@gmail.com>
…ke is missing ${TEST_ENV_PREFIX} that cause a failure in the test run
Signed-off-by: gal salomon <gal.salomon@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
this utility provides the basic simulation for the background processes to enable the rebuild of all indexes, while put and remove vectors runs concurrently.
(the relevant functionalities reside on examples/s3vector_concurrent_service.cpp)
per each put/remove the index-state is verified for the level of its non-indexed vectors, upon configurable threshold, the index is rebuilt a-sync (after the completion of put/remove request).
the search request uses the latest built index, and execute a brute-force (scan and verify distance per each non-indexed vector).
all of the mentioned operations should not be blocked by any other operation.
include/lancedb.h : Adds the LanceDBIndexStats struct and the lancedb_table_index_stats() C API declaration. The struct exposes num_indexed_rows, num_unindexed_rows, and num_indices. The function retrieves index statistics for a named index directly from the LanceDB manifest.
examples/s3vector_concurrent_service.cpp : demonstrating an S3 vector concurrent service with dynamic schema, index rebuild, and flock-based coordination.
tests/create_table_insert_vectors_and_query_demo.sh : shell test/demo that creates a bucket, creates an index, inserts vectors, and queries with/without filters.
TODO : explain-plan is missing in description, create-scalar-index and filter by scalar-index only is missing.