feat: index and query multi-valued numeric fields end-to-end (PR2/2 of #281) by mosuka · Pull Request #287 · mosuka/laurus

mosuka · 2026-04-25T11:08:51Z

Summary

Plumb multi-valued numeric fields all the way through the lexical-store indexing pipeline. PR feat: multi-valued numeric DataValue and schema flag (PR1/2 of #281) #286 put the data shape (Int64Array / Float64Array) and the multi_valued schema flag in place; this PR makes range queries actually match the right documents and Lucene-style constant scoring.
AnalyzedDocument::point_values now carries a list of points per field. Single-valued fields contribute one point; multi-valued numeric fields contribute one 1D point per element. The BKD writer registers each as a distinct entry, and the BKD reader's range_search already deduplicates doc_ids, so a multi-valued document is reported at most once per query — the Lucene contract.
Range query evaluation gains "any value matches" semantics on the fallback path and constant scoring on the scorer.

Why

PR #286 documented a transitional state: Int64Array / Float64Array values round-tripped through the engine but range queries against multi-valued fields could not reliably match. PR2 closes that gap.

What changed

Storage / indexing pipeline

laurus/src/lexical/core/analyzed.rs — point_values is now AHashMap<String, Vec<Vec<f64>>> (a list of multi-dimensional points per field).
laurus/src/lexical/core/parser.rs — DocumentParser::parse emits a separate 1D point per Int64Array / Float64Array element (and an appropriately-shaped point for existing scalar / geo cases).
laurus/src/lexical/index/inverted/writer.rs:
- The inline parser path mirrors DocumentParser; the previous _ => {} wildcard that silently dropped Int64Array / Float64Array is replaced with explicit handlers that build N analyzed terms and N BKD points.
- write_bkd_trees flattens the new structure: it pushes one (point, doc_id) per element into the BKDWriter, leaning on the existing BKDReader::range_search dedup step for unique-doc-id semantics.

Range query (Lucene parity)

laurus/src/lexical/query/range.rs:
- NumericRangeQuery::count_matching_documents and the fallback matcher recognise Int64Array / Float64Array values via arr.iter().any(...). The BKD path inherits "any match" from dedup.
- RangeScorer::score returns boost unconditionally — Lucene's PointRangeQuery uses constant scoring, and per-value IDF / proximity weighting interacts poorly with multi-valued "any match" semantics. The legacy range_idf / proximity_score heuristics and the unused lower_bound / upper_bound / range_width / total_docs fields were removed; RangeScorer::new keeps its old signature so existing call sites continue to compile.

doc_values (intentionally untouched)

A multi-valued document stores one FieldValue::Int64Array(...) or Float64Array(...) per (doc_id, field) — sufficient for retrieval and round-trip without restructuring storage. A Lucene-style NumericDocValues / SortedNumericDocValues split is only relevant for sort selectors (MIN / MAX / MEDIAN / SUM), which we left as a future enhancement (issue follow-up).

Tests

laurus/tests/multi_valued_numeric_test.rs (new) exercises the full writer → BKD → reader → range query pipeline:

int64_array_any_value_in_range — three docs with varied integer arrays; only those with any value in [80, 100] match (vec![0, 2]), and a disjoint range [10, 20] returns no matches.
int64_array_dedups_doc_when_multiple_values_match — a single doc with three values inside [50, 100] is reported exactly once.
float64_array_any_value_in_range — same shape for Float64Array.
single_valued_field_unchanged_by_multi_valued_changes — regression for the existing single-valued integer path (range query still produces vec![0] for age = 30 in [25, 35]).

All 698 unit tests + every existing integration test (including BKD / range / scoring) pass without modification.

Test plan

cargo fmt --check
cargo clippy -p laurus -p laurus-server -p laurus-cli -p laurus-mcp -p laurus-python -p laurus-nodejs -p laurus-wasm --all-targets -- -D warnings
cargo test -p laurus -p laurus-server -p laurus-cli -p laurus-mcp — 698 unit tests + new e2e tests + existing integration tests all pass

Closes #281

…#281) PR1 (#286) put `Int64Array` / `Float64Array` and the `multi_valued` schema flag in place but stopped at the storage boundary — the lexical store still treated each `(field, doc_id)` as a single BKD point, so range queries on multi-valued fields could not match more than one element per document. This PR plumbs multi-value support through the indexing pipeline and switches range-query scoring to Lucene's constant-score model. Storage / indexing pipeline: - `AnalyzedDocument::point_values` is now `AHashMap<String, Vec<Vec<f64>>>` — a list of points per field rather than a single point. Single-valued numeric / datetime / geo fields contribute one point; multi-valued numeric fields contribute one point per element. - `DocumentParser::parse` (`laurus/src/lexical/core/parser.rs`) and the inverted index writer's inline parser (`laurus/src/lexical/index/inverted/writer.rs`) emit a separate 1D point per `Int64Array` / `Float64Array` element and a single appropriately-shaped point for the existing scalar / geo cases. - `InvertedIndexWriter::write_bkd_trees` flattens the new structure into one BKD entry per point. The BKD reader's `range_search` already deduplicates `doc_id`s, so a multi-valued document is reported at most once per query — the Lucene contract. Range query semantics (Lucene parity): - `NumericRangeQuery::count_matching_documents` and the fallback matcher (`laurus/src/lexical/query/range.rs`) recognise `Int64Array` / `Float64Array` values and apply "any value matches" via `arr.iter().any(...)`. The BKD path inherits "any match" from the BKD-level dedup. - `RangeScorer::score` now returns `boost` unconditionally — Lucene `PointRangeQuery` uses constant scoring, and per-value IDF / proximity weighting interacts poorly with multi-valued "any match" semantics. The legacy `range_idf` / `proximity_score` heuristics and the unused `lower_bound` / `upper_bound` / `range_width` / `total_docs` fields were removed; `RangeScorer::new` keeps its signature so existing call sites continue to compile. doc_values: - Intentionally left as the existing `(doc_id, FieldValue)` map. A multi-valued document stores one `FieldValue::Int64Array(...)` or `Float64Array(...)` per `(doc_id, field)`, which is sufficient for retrieval and round-trip without restructuring storage. A Lucene-style `NumericDocValues` / `SortedNumericDocValues` split would only matter for sort selectors (MIN / MAX / MEDIAN / SUM), which is left as a future task. Tests: - `laurus/tests/multi_valued_numeric_test.rs` (new) exercises the full writer → BKD → reader → range query pipeline: - `int64_array_any_value_in_range`: 3 docs with varied integer arrays and inclusive bounds; only the docs with any value in range match. - `int64_array_dedups_doc_when_multiple_values_match`: a single doc with three values inside the range is reported exactly once. - `float64_array_any_value_in_range`: same shape as integer, for `Float64Array`. - `single_valued_field_unchanged_by_multi_valued_changes`: regression for the existing single-valued integer path. - All 698 unit tests + every existing integration test pass without modification, including range / BKD / scoring tests that exercised the old IDF + proximity scorer. Closes #281

mosuka merged commit ea43870 into main Apr 25, 2026
22 checks passed

mosuka deleted the feat/multi-valued-numeric-lexical-store branch April 25, 2026 12:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: index and query multi-valued numeric fields end-to-end (PR2/2 of #281)#287

feat: index and query multi-valued numeric fields end-to-end (PR2/2 of #281)#287
mosuka merged 1 commit intomainfrom
feat/multi-valued-numeric-lexical-store

mosuka commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mosuka commented Apr 25, 2026

Summary

Why

What changed

Storage / indexing pipeline

Range query (Lucene parity)

doc_values (intentionally untouched)

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant