Skip to content

fix(index): defer scalar index parser selection until operator context is known#7509

Open
ztorchan wants to merge 1 commit into
lance-format:mainfrom
ztorchan:fix-defer_index_selection
Open

fix(index): defer scalar index parser selection until operator context is known#7509
ztorchan wants to merge 1 commit into
lance-format:mainfrom
ztorchan:fix-defer_index_selection

Conversation

@ztorchan

Copy link
Copy Markdown
Contributor

Problem

When a single column carries more than one scalar index (for example a bloom filter and a btree), some queries that should be index-accelerated silently degrade to a full scan.

Concretely, with both a bloom filter and a btree on column x, the query x > 7 is not accelerated by the btree even though the btree can serve the range. Whichever parser happens to be registered first wins, regardless of whether it can actually handle the query shape.

Fixes #7091.

Root cause

Matching a filter leaf to an index involves two distinct decisions:

Decision Question Context required
Reference validation "Does this index cover this column / JSON path?" The full Expr (incl. JSON path) — but not the operator
Capability matching "Can this index serve this query shape?" The column, value, and operator

MultiQueryParser::select (introduced in #7072 to route JSON queries to the correct path) answered both with is_valid_reference alone, committing to the first parser that accepted the reference. Because is_valid_reference
cannot see the operator, it could not tell that the bloom filter is unable to serve a range:

  1. visit_comparison calls maybe_indexed_column(x) — only Column("x") is visible, the > lives on the parent node.
  2. select asks each parser's is_valid_reference; both bloom and btree accept a bare column.
  3. select returns the first one (bloom).
  4. bloom.visit_comparison(x, 7, Gt) returns None (equality only).
  5. The leaf is left unindexed and falls back to a full scan.

This was effectively a regression from #7072: before that change the column's whole MultiQueryParser was returned and its visit_* find_map fallback naturally skipped bloom and picked the btree. #7072 fixed JSON path routing but
bypassed that capability fallback.

Fix

Keep reference validation and capability matching as separate phases:

  • MultiQueryParser::select now collects every parser whose is_valid_reference accepts the expression and returns a borrowed MultiQueryParserView over that subset (plus the referenced data type).
  • MultiQueryParserView behaves exactly like its parent MultiQueryParser — each visit_* fans out to its parsers with "first match wins" semantics — but operates on a borrowed subset instead of owning the parsers. The
    capability decision is therefore deferred to the visit_* calls, where the operator is finally known.

This unifies both scenarios:

Scenario Candidates after select Result
JSON multi-path 1 (path is unique) Same as before — correct routing (#7072 preserved)
bloom + btree multiple visit_* find_map picks by capability (#7091 fixed)

The change is intentionally local: maybe_indexed_column swaps its returned parser type for MultiQueryParserView, and none of the ~7 downstream visitors (visit_comparison, visit_between, visit_in_list, visit_is_bool,
visit_is_null, maybe_range, visit_scalar_fn, visit_like_expr) needed to change — the fan-out/fallback logic stays in one place.

A borrowed view (rather than returning an owned MultiQueryParser) is required because select only has &self, the parsers are stored as Box<dyn ScalarQueryParser> (not Clone), and the dispatch table is shared across all leaves of a filter tree within a query, so it must not be mutated or pruned in place.

Tests

Added test_multi_index_defers_to_operator_capability, which registers a bloom filter first (so the pre-fix "first valid reference wins" behavior would route everything to it) and a btree second on the same column, then asserts:

  • size > 7 is served by the btree as a SargableQuery::Range (the bug scenario — previously a full scan).
  • size = 7 is still served by the bloom filter as BloomFilterQuery::Equals (the fallback does not regress the equality shape the bloom filter supports).

The existing test_multi_json_indices_route_by_path continues to pass, confirming JSON path routing from #7072 is unaffected.

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer bug Something isn't working labels Jun 29, 2026
@ztorchan

Copy link
Copy Markdown
Contributor Author

hi @westonpace, please take a look at this pr

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Defer index parser selection until full operator context is available

1 participant