Skip to content

background processes simulator application for re-building vector indexes#32

Open
galsalomon66 wants to merge 10 commits intolancedb:mainfrom
galsalomon66:s3vector_background_process
Open

background processes simulator application for re-building vector indexes#32
galsalomon66 wants to merge 10 commits intolancedb:mainfrom
galsalomon66:s3vector_background_process

Conversation

@galsalomon66
Copy link
Copy Markdown
Contributor

@galsalomon66 galsalomon66 commented Mar 16, 2026

this utility provides the basic simulation for the background processes to enable the rebuild of all indexes, while put and remove vectors runs concurrently.

(the relevant functionalities reside on examples/s3vector_concurrent_service.cpp)

per each put/remove the index-state is verified for the level of its non-indexed vectors, upon configurable threshold, the index is rebuilt a-sync (after the completion of put/remove request).

the search request uses the latest built index, and execute a brute-force (scan and verify distance per each non-indexed vector).

all of the mentioned operations should not be blocked by any other operation.

include/lancedb.h : Adds the LanceDBIndexStats struct and the lancedb_table_index_stats() C API declaration. The struct exposes num_indexed_rows, num_unindexed_rows, and num_indices. The function retrieves index statistics for a named index directly from the LanceDB manifest.

examples/s3vector_concurrent_service.cpp : demonstrating an S3 vector concurrent service with dynamic schema, index rebuild, and flock-based coordination.

tests/create_table_insert_vectors_and_query_demo.sh : shell test/demo that creates a bucket, creates an index, inserts vectors, and queries with/without filters.

TODO : explain-plan is missing in description, create-scalar-index and filter by scalar-index only is missing.

@yuvalif
Copy link
Copy Markdown
Collaborator

yuvalif commented Mar 24, 2026

@galsalomon66 please split to 2 PRs:

  • index stats api + simple unit tests
  • standalone backgrounb application

@galsalomon66
Copy link
Copy Markdown
Contributor Author

#35

…es to enable the rebuild of all indexes, while put and remove vectors runs concurrently.

(the relevant functionalities reside on examples/s3vector_concurrent_service.cpp, the other c++ files are not relevant)

per each put/remove the index-state is verified for the level of its non-indexed vectors, upon configurable threshold, the index is rebuilt a-sync (after the completion of put/remove request).

the search request uses the latest built index, and execute a brute-force (scan and verify distance per each non-indexed vector).

all of the mentioned operations should not be blocked by any other operation.

include/lancedb.h : Adds the LanceDBIndexStats struct and the lancedb_table_index_stats() C API declaration. The struct exposes num_indexed_rows, num_unindexed_rows, and num_indices. The function retrieves index statistics for a named index directly from the LanceDB manifest.
examples/s3vector_concurrent_service.cpp : demonstrating an S3 vector concurrent service with dynamic schema, index rebuild, and flock-based coordination.
 tests/create_table_insert_vectors_and_query_demo.sh : shell test/demo that creates a bucket, creates an index, inserts vectors, and queries with/without filters.

Signed-off-by: gal salomon <gal.salomon@gmail.com>
…ata is part of the table metadata.

this is instead of saving it into local-filesystem.

Signed-off-by: gal salomon <gal.salomon@gmail.com>
When S3V_BACKEND=s3, LanceDB connects with s3:// URIs and S3 storage options are passed to the connection builder.
the application can specify S3 storage options via environment variables:
 - S3V_ENDPOINT
 - S3V_REGION
 - S3V_ACCESS_KEY_ID
 - S3V_SECRET_ACCESS_KEY
 - S3V_BUCKET

LanceDBHelper::connect() adds S3 storage options (endpoint, aws_region, aws_access_key_id, etc.) when in S3 mode.

Index state (config, dimension, thresholds, build flags) stores the index state in LanceDB table metadata using lancedb_table_set_metadata()/lancedb_table_get_metadata() with key s3v_index_state.
The state is now embedded in the Lance manifest — a single S3 object per version.
Added LanceDBHelper::save_table_metadata() and load_table_metadata() helpers.

S3Lock — Distributed Lock via S3 Conditional Writes.
S3Lock class implements Base_lock using S3 conditional PUT (If-None-Match: *). Lock objects are stored at s3://<bucket>/.locks/<index>_index.lock.
Lock protocol:
 - lock_exclusive(): PUT lock object with If-None-Match: * — only one process succeeds. Retries with configurable interval.
 - unlock(): DELETE the lock object.
 - Crash recovery: Lock objects contain a timestamp. check_stale_and_reclaim() reads the lock, checks if it exceeds TTL (default 5 min), deletes and re-acquires if stale.
this provides a simple distributed locking mechanism without external services, suitable for coordinating index builds across multiple processes or machines while executing concurrent operations on the same LanceDB table.

it should be noted that the lancedb internals enable concurrent put-vector/get-vector/query-vector/delete-vector while maintaining consistency and correctness. the s3lock is not related to the vector operations.

The S3Lock is specifically for coordinating index builds, which are critical sections that require exclusive access to ensure the integrity of the index state.

Backend selection in FrontendLocker:
 - LOCAL mode: FileLock (flock-based, for testing/debugging)
 - S3 mode: S3Lock (distributed, production-ready

added S3 HTTP Layer — libcurl + SigV4 Signing
  the S3 operations (lock objects, bucket creation) need authenticated HTTP requests with AWS SigV4 signing. Using the full AWS SDK would add many libraries as dependencies.
  this layer using a Lightweight S3 HTTP client using libcurl + manual SigV4 signing (OpenSSL HMAC-SHA256).

added s3_http.h/cpp  All S3 HTTP operations:
                   - s3_put_object()     (with optional If-None-Match)
                   - s3_get_object()
                   - s3_delete_object()
                   - s3_create_bucket()
                   - s3_head_bucket()

Signed-off-by: gal salomon <gal.salomon@gmail.com>
Added lancedb_query_explain_plan() and lancedb_vector_query_explain_plan(). both calls explain_plan(verbose) instead of execute().
the query object remains valid and can be executed afterwards. The plan string is returned via plan_out and must be freed by the caller with lancedb_free_string().
Also added handle_error to the imports.

include/lancedb.h —
Added declarations for lancedb_query_explain_plan() and lancedb_vector_query_explain_plan()

examples/s3vector_concurrent_service.cpp — Application changes

1. explainPlan flag in QueryVectors: When "explainPlan": true is set in the request, lancedb_vector_query_explain_plan() is called before lancedb_vector_query_execute(). The plan
string is returned in the response as "queryPlan".

2. filterOnly mode in QueryVectors: When "filterOnly": true is set, the handler uses lancedb_query_new() (regular query) instead of lancedb_vector_query_new() (vector query), skipping
vector similarity search entirely. queryVector is not required in this mode. A filter is required. The distanceMetric field is omitted from the response since no distance computation
occurs.

3. LanceDBHelper::filter_only_query(): New static method that builds and executes a regular (non-vector) query with scalar filters.
Supports explainPlan via lancedb_query_explain_plan().

4. LanceDBHelper::query_vectors(): Added bool explain_plan and std::string& explain_plan_output parameters.

Signed-off-by: gal salomon <gal.salomon@gmail.com>
…fter table creation), list all indexes with detailed info, and drop indexes by name — all via the existing CLI interface.

New struct: LanceDBIndexInfo — struct with name, index_type, columns, num_columns

New functions:
- lancedb_table_list_indices_detailed() — calls the LanceDB SDK's list_indices() and returns full IndexConfig data (name, type, columns) instead of just names
- lancedb_free_index_list_detailed() — frees the struct array and all nested allocations
- sdk_index_type_to_c() — maps the SDK's IndexType enum to the C API's LanceDBIndexType

examples/s3vector_concurrent_service.cpp :
adding CreateScalarIndex each column can have a different type (BTREE, BITMAP, LABELLIST;default BTREE)
ListScalarIndexes : Lists all indexes on the table with name, type, and columns
DropScalarIndex :  Drops an index by its auto-generated name (e.g., category_idx)

Unit tests (tests/test_scalar_index.cpp) : several unit tests covering the new lancedb_table_list_indices_detailed C binding

Signed-off-by: gal salomon <gal.salomon@gmail.com>
@galsalomon66 galsalomon66 force-pushed the s3vector_background_process branch from 2badb0e to 1eec173 Compare April 9, 2026 09:33
Signed-off-by: gal salomon <gal.salomon@gmail.com>
Signed-off-by: gal salomon <gal.salomon@gmail.com>
…ke is missing ${TEST_ENV_PREFIX} that cause a failure in the test run

Signed-off-by: gal salomon <gal.salomon@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants