Skip to content

feat: index and query multi-valued numeric fields end-to-end (PR2/2 of #281)#287

Merged
mosuka merged 1 commit intomainfrom
feat/multi-valued-numeric-lexical-store
Apr 25, 2026
Merged

feat: index and query multi-valued numeric fields end-to-end (PR2/2 of #281)#287
mosuka merged 1 commit intomainfrom
feat/multi-valued-numeric-lexical-store

Conversation

@mosuka
Copy link
Copy Markdown
Owner

@mosuka mosuka commented Apr 25, 2026

Summary

  • Plumb multi-valued numeric fields all the way through the lexical-store indexing pipeline. PR feat: multi-valued numeric DataValue and schema flag (PR1/2 of #281) #286 put the data shape (Int64Array / Float64Array) and the multi_valued schema flag in place; this PR makes range queries actually match the right documents and Lucene-style constant scoring.
  • AnalyzedDocument::point_values now carries a list of points per field. Single-valued fields contribute one point; multi-valued numeric fields contribute one 1D point per element. The BKD writer registers each as a distinct entry, and the BKD reader's range_search already deduplicates doc_ids, so a multi-valued document is reported at most once per query — the Lucene contract.
  • Range query evaluation gains "any value matches" semantics on the fallback path and constant scoring on the scorer.

Why

PR #286 documented a transitional state: Int64Array / Float64Array values round-tripped through the engine but range queries against multi-valued fields could not reliably match. PR2 closes that gap.

What changed

Storage / indexing pipeline

  • laurus/src/lexical/core/analyzed.rspoint_values is now AHashMap<String, Vec<Vec<f64>>> (a list of multi-dimensional points per field).
  • laurus/src/lexical/core/parser.rsDocumentParser::parse emits a separate 1D point per Int64Array / Float64Array element (and an appropriately-shaped point for existing scalar / geo cases).
  • laurus/src/lexical/index/inverted/writer.rs:
    • The inline parser path mirrors DocumentParser; the previous _ => {} wildcard that silently dropped Int64Array / Float64Array is replaced with explicit handlers that build N analyzed terms and N BKD points.
    • write_bkd_trees flattens the new structure: it pushes one (point, doc_id) per element into the BKDWriter, leaning on the existing BKDReader::range_search dedup step for unique-doc-id semantics.

Range query (Lucene parity)

  • laurus/src/lexical/query/range.rs:
    • NumericRangeQuery::count_matching_documents and the fallback matcher recognise Int64Array / Float64Array values via arr.iter().any(...). The BKD path inherits "any match" from dedup.
    • RangeScorer::score returns boost unconditionally — Lucene's PointRangeQuery uses constant scoring, and per-value IDF / proximity weighting interacts poorly with multi-valued "any match" semantics. The legacy range_idf / proximity_score heuristics and the unused lower_bound / upper_bound / range_width / total_docs fields were removed; RangeScorer::new keeps its old signature so existing call sites continue to compile.

doc_values (intentionally untouched)

A multi-valued document stores one FieldValue::Int64Array(...) or Float64Array(...) per (doc_id, field) — sufficient for retrieval and round-trip without restructuring storage. A Lucene-style NumericDocValues / SortedNumericDocValues split is only relevant for sort selectors (MIN / MAX / MEDIAN / SUM), which we left as a future enhancement (issue follow-up).

Tests

laurus/tests/multi_valued_numeric_test.rs (new) exercises the full writer → BKD → reader → range query pipeline:

  • int64_array_any_value_in_range — three docs with varied integer arrays; only those with any value in [80, 100] match (vec![0, 2]), and a disjoint range [10, 20] returns no matches.
  • int64_array_dedups_doc_when_multiple_values_match — a single doc with three values inside [50, 100] is reported exactly once.
  • float64_array_any_value_in_range — same shape for Float64Array.
  • single_valued_field_unchanged_by_multi_valued_changes — regression for the existing single-valued integer path (range query still produces vec![0] for age = 30 in [25, 35]).

All 698 unit tests + every existing integration test (including BKD / range / scoring) pass without modification.

Test plan

  • cargo fmt --check
  • cargo clippy -p laurus -p laurus-server -p laurus-cli -p laurus-mcp -p laurus-python -p laurus-nodejs -p laurus-wasm --all-targets -- -D warnings
  • cargo test -p laurus -p laurus-server -p laurus-cli -p laurus-mcp — 698 unit tests + new e2e tests + existing integration tests all pass

Closes #281

…#281)

PR1 (#286) put `Int64Array` / `Float64Array` and the `multi_valued`
schema flag in place but stopped at the storage boundary — the lexical
store still treated each `(field, doc_id)` as a single BKD point, so
range queries on multi-valued fields could not match more than one
element per document. This PR plumbs multi-value support through the
indexing pipeline and switches range-query scoring to Lucene's
constant-score model.

Storage / indexing pipeline:

- `AnalyzedDocument::point_values` is now
  `AHashMap<String, Vec<Vec<f64>>>` — a list of points per field rather
  than a single point. Single-valued numeric / datetime / geo fields
  contribute one point; multi-valued numeric fields contribute one
  point per element.
- `DocumentParser::parse` (`laurus/src/lexical/core/parser.rs`) and the
  inverted index writer's inline parser
  (`laurus/src/lexical/index/inverted/writer.rs`) emit a separate 1D
  point per `Int64Array` / `Float64Array` element and a single
  appropriately-shaped point for the existing scalar / geo cases.
- `InvertedIndexWriter::write_bkd_trees` flattens the new structure into
  one BKD entry per point. The BKD reader's `range_search` already
  deduplicates `doc_id`s, so a multi-valued document is reported at
  most once per query — the Lucene contract.

Range query semantics (Lucene parity):

- `NumericRangeQuery::count_matching_documents` and the fallback
  matcher (`laurus/src/lexical/query/range.rs`) recognise
  `Int64Array` / `Float64Array` values and apply "any value matches"
  via `arr.iter().any(...)`. The BKD path inherits "any match" from
  the BKD-level dedup.
- `RangeScorer::score` now returns `boost` unconditionally — Lucene
  `PointRangeQuery` uses constant scoring, and per-value IDF /
  proximity weighting interacts poorly with multi-valued "any match"
  semantics. The legacy `range_idf` / `proximity_score` heuristics and
  the unused `lower_bound` / `upper_bound` / `range_width` /
  `total_docs` fields were removed; `RangeScorer::new` keeps its
  signature so existing call sites continue to compile.

doc_values:

- Intentionally left as the existing `(doc_id, FieldValue)` map. A
  multi-valued document stores one `FieldValue::Int64Array(...)` or
  `Float64Array(...)` per `(doc_id, field)`, which is sufficient for
  retrieval and round-trip without restructuring storage. A
  Lucene-style `NumericDocValues` / `SortedNumericDocValues` split
  would only matter for sort selectors (MIN / MAX / MEDIAN / SUM),
  which is left as a future task.

Tests:

- `laurus/tests/multi_valued_numeric_test.rs` (new) exercises the full
  writer → BKD → reader → range query pipeline:
  - `int64_array_any_value_in_range`: 3 docs with varied integer
    arrays and inclusive bounds; only the docs with any value in range
    match.
  - `int64_array_dedups_doc_when_multiple_values_match`: a single doc
    with three values inside the range is reported exactly once.
  - `float64_array_any_value_in_range`: same shape as integer, for
    `Float64Array`.
  - `single_valued_field_unchanged_by_multi_valued_changes`:
    regression for the existing single-valued integer path.
- All 698 unit tests + every existing integration test pass without
  modification, including range / BKD / scoring tests that exercised
  the old IDF + proximity scorer.

Closes #281
@mosuka mosuka merged commit ea43870 into main Apr 25, 2026
22 checks passed
@mosuka mosuka deleted the feat/multi-valued-numeric-lexical-store branch April 25, 2026 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add native support for multi-valued numeric fields (Int64Array / Float64Array)

1 participant