Skip to content

feat(fts): add configurable posting block size#7466

Open
BubbleCal wants to merge 8 commits into
mainfrom
yang/oss-1344-make-fts-index-block-size-configurable
Open

feat(fts): add configurable posting block size#7466
BubbleCal wants to merge 8 commits into
mainfrom
yang/oss-1344-make-fts-index-block-size-configurable

Conversation

@BubbleCal

@BubbleCal BubbleCal commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Feature

Linear: OSS-1344

What is the new feature?

FTS inverted index creation now accepts a block_size parameter for compressed posting blocks. Supported values are 128 and 256.

Why do we need this feature?

The posting block size was previously fixed at 128, which made the block-max granularity impossible to tune for different datasets and query profiles.

How does it work?

  • Adds block_size to InvertedIndexParams, protobuf details, posting-list schema metadata, and cache headers.
  • Uses 128 as the default for newly created indexes.
  • Treats older serialized params, schema metadata, and cache entries that omit block_size as legacy 128.
  • Rejects unsupported values, including 512, with a clear validation error.
  • Uses Lance-owned BitPacker4x for physical 128-value posting blocks and BitPacker8x for physical 256-value posting blocks.
  • Marks block_size=256 as experimental in public API docs because it may introduce breaking changes.
  • Keeps position-stream packing on the legacy 128-value block format.
  • Keeps downgrade compatibility tests on explicit legacy block_size=128, since older wheels cannot read current-created physical 256 FTS posting blocks.
  • Threads the configured block size through FTS build, read, iterator, WAND, cache, and MemWAL flush paths.
  • Exposes the parameter in Python and Java FTS index creation APIs, with docs and focused tests.

Validation

  • cargo fmt --all
  • cargo fmt --all --check
  • git diff --check
  • CARGO_TARGET_DIR=/tmp/lance-target-a479-no512 cargo test -p lance-index block_size -- --nocapture
  • CARGO_TARGET_DIR=/tmp/lance-target-a479-no512 cargo clippy -p lance-index --tests -- -D warnings
  • uv run make build from python/
  • uv run pytest python/tests/test_scalar_index.py::test_create_scalar_index_fts_block_size from python/
  • uv run ruff format --check python/tests/test_scalar_index.py python/lance/dataset.py from python/
  • uv run ruff check python/tests/test_scalar_index.py python/lance/dataset.py from python/
  • CARGO_TARGET_DIR=/tmp/lance-target-a479-merge-main cargo test -p lance-index block_size -- --nocapture
  • CARGO_TARGET_DIR=/tmp/lance-target-a479-merge-main cargo test -p lance-index test_256_posting_block_uses_single_physical_bitpack_chunk -- --nocapture
  • CARGO_TARGET_DIR=/tmp/lance-target-a479-merge-main cargo test -p lance-bitpacking
  • CARGO_TARGET_DIR=/tmp/lance-target-a479-merge-main cargo clippy -p lance-bitpacking -p lance-index --tests -- -D warnings
  • uv run ruff format --check python/tests/compat/test_scalar_indices.py from python/
  • uv run ruff check python/tests/compat/test_scalar_indices.py from python/
  • uv run pytest --run-compat -vvv -s python/tests/compat/test_scalar_indices.py::test_FtsIndex_downgrade --durations=30 from python/
  • CARGO_TARGET_DIR=/tmp/lance-a479-target cargo test -p lance-index test_new_training_request_defaults_missing_block_size_to_128
  • CARGO_TARGET_DIR=/tmp/lance-a479-target cargo test -p lance-index block_size
  • uv run ruff format --check python/lance/dataset.py from python/
  • uv run ruff check python/lance/dataset.py from python/

Not run locally: Java focused test / spotless check, because this machine has no Java Runtime installed (Unable to locate a Java Runtime).

@github-actions

Copy link
Copy Markdown
Contributor

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added A-python Python bindings A-index Vector index, linalg, tokenizer A-java Java bindings + JNI A-format On-disk format: protos and format spec docs enhancement New feature or request labels Jun 25, 2026
@BubbleCal BubbleCal force-pushed the yang/oss-1344-make-fts-index-block-size-configurable branch from dd4ac88 to d9f0acb Compare June 25, 2026 07:09
@BubbleCal BubbleCal force-pushed the yang/oss-1344-make-fts-index-block-size-configurable branch from d9f0acb to 23e4810 Compare June 25, 2026 09:13
@BubbleCal BubbleCal force-pushed the yang/oss-1344-make-fts-index-block-size-configurable branch from 23e4810 to 059ae90 Compare June 25, 2026 09:23
@BubbleCal BubbleCal marked this pull request as ready for review June 29, 2026 10:50
@BubbleCal

Copy link
Copy Markdown
Contributor Author

@claude reivew

@Xuanwo Xuanwo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a compatibility boundary before merge.

  1. block_size=256 changes the persisted posting-block layout, but the stored index version still looks like the existing FTS format. Older readers will ignore the new metadata/details field and try to decode the blocks as legacy 128-doc BitPacker4x blocks. That can fail open as wrong FTS results or decode panics instead of cleanly ignoring the index. We should either bump the FTS/index version for non-legacy block sizes or reject writing 256 until older readers can be gated out.

  2. Legacy segments with no block_size and newly written default-128 segments with block_size=128 are semantically identical, but the multi-segment details check compares the raw protobuf values. Mixed old/new default-128 segments can be rejected as inconsistent. The comparison should canonicalize missing block_size to 128 before comparing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-format On-disk format: protos and format spec docs A-index Vector index, linalg, tokenizer A-java Java bindings + JNI A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants