Conversation
…#281) PR1 (#286) put `Int64Array` / `Float64Array` and the `multi_valued` schema flag in place but stopped at the storage boundary — the lexical store still treated each `(field, doc_id)` as a single BKD point, so range queries on multi-valued fields could not match more than one element per document. This PR plumbs multi-value support through the indexing pipeline and switches range-query scoring to Lucene's constant-score model. Storage / indexing pipeline: - `AnalyzedDocument::point_values` is now `AHashMap<String, Vec<Vec<f64>>>` — a list of points per field rather than a single point. Single-valued numeric / datetime / geo fields contribute one point; multi-valued numeric fields contribute one point per element. - `DocumentParser::parse` (`laurus/src/lexical/core/parser.rs`) and the inverted index writer's inline parser (`laurus/src/lexical/index/inverted/writer.rs`) emit a separate 1D point per `Int64Array` / `Float64Array` element and a single appropriately-shaped point for the existing scalar / geo cases. - `InvertedIndexWriter::write_bkd_trees` flattens the new structure into one BKD entry per point. The BKD reader's `range_search` already deduplicates `doc_id`s, so a multi-valued document is reported at most once per query — the Lucene contract. Range query semantics (Lucene parity): - `NumericRangeQuery::count_matching_documents` and the fallback matcher (`laurus/src/lexical/query/range.rs`) recognise `Int64Array` / `Float64Array` values and apply "any value matches" via `arr.iter().any(...)`. The BKD path inherits "any match" from the BKD-level dedup. - `RangeScorer::score` now returns `boost` unconditionally — Lucene `PointRangeQuery` uses constant scoring, and per-value IDF / proximity weighting interacts poorly with multi-valued "any match" semantics. The legacy `range_idf` / `proximity_score` heuristics and the unused `lower_bound` / `upper_bound` / `range_width` / `total_docs` fields were removed; `RangeScorer::new` keeps its signature so existing call sites continue to compile. doc_values: - Intentionally left as the existing `(doc_id, FieldValue)` map. A multi-valued document stores one `FieldValue::Int64Array(...)` or `Float64Array(...)` per `(doc_id, field)`, which is sufficient for retrieval and round-trip without restructuring storage. A Lucene-style `NumericDocValues` / `SortedNumericDocValues` split would only matter for sort selectors (MIN / MAX / MEDIAN / SUM), which is left as a future task. Tests: - `laurus/tests/multi_valued_numeric_test.rs` (new) exercises the full writer → BKD → reader → range query pipeline: - `int64_array_any_value_in_range`: 3 docs with varied integer arrays and inclusive bounds; only the docs with any value in range match. - `int64_array_dedups_doc_when_multiple_values_match`: a single doc with three values inside the range is reported exactly once. - `float64_array_any_value_in_range`: same shape as integer, for `Float64Array`. - `single_valued_field_unchanged_by_multi_valued_changes`: regression for the existing single-valued integer path. - All 698 unit tests + every existing integration test pass without modification, including range / BKD / scoring tests that exercised the old IDF + proximity scorer. Closes #281
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Int64Array/Float64Array) and themulti_valuedschema flag in place; this PR makes range queries actually match the right documents and Lucene-style constant scoring.AnalyzedDocument::point_valuesnow carries a list of points per field. Single-valued fields contribute one point; multi-valued numeric fields contribute one 1D point per element. The BKD writer registers each as a distinct entry, and the BKD reader'srange_searchalready deduplicatesdoc_ids, so a multi-valued document is reported at most once per query — the Lucene contract.Why
PR #286 documented a transitional state: Int64Array / Float64Array values round-tripped through the engine but range queries against multi-valued fields could not reliably match. PR2 closes that gap.
What changed
Storage / indexing pipeline
laurus/src/lexical/core/analyzed.rs—point_valuesis nowAHashMap<String, Vec<Vec<f64>>>(a list of multi-dimensional points per field).laurus/src/lexical/core/parser.rs—DocumentParser::parseemits a separate 1D point perInt64Array/Float64Arrayelement (and an appropriately-shaped point for existing scalar / geo cases).laurus/src/lexical/index/inverted/writer.rs:DocumentParser; the previous_ => {}wildcard that silently droppedInt64Array/Float64Arrayis replaced with explicit handlers that build N analyzed terms and N BKD points.write_bkd_treesflattens the new structure: it pushes one(point, doc_id)per element into theBKDWriter, leaning on the existingBKDReader::range_searchdedupstep for unique-doc-id semantics.Range query (Lucene parity)
laurus/src/lexical/query/range.rs:NumericRangeQuery::count_matching_documentsand the fallback matcher recogniseInt64Array/Float64Arrayvalues viaarr.iter().any(...). The BKD path inherits "any match" fromdedup.RangeScorer::scorereturnsboostunconditionally — Lucene'sPointRangeQueryuses constant scoring, and per-value IDF / proximity weighting interacts poorly with multi-valued "any match" semantics. The legacyrange_idf/proximity_scoreheuristics and the unusedlower_bound/upper_bound/range_width/total_docsfields were removed;RangeScorer::newkeeps its old signature so existing call sites continue to compile.doc_values (intentionally untouched)
A multi-valued document stores one
FieldValue::Int64Array(...)orFloat64Array(...)per(doc_id, field)— sufficient for retrieval and round-trip without restructuring storage. A Lucene-styleNumericDocValues/SortedNumericDocValuessplit is only relevant for sort selectors (MIN / MAX / MEDIAN / SUM), which we left as a future enhancement (issue follow-up).Tests
laurus/tests/multi_valued_numeric_test.rs(new) exercises the full writer → BKD → reader → range query pipeline:int64_array_any_value_in_range— three docs with varied integer arrays; only those with any value in[80, 100]match (vec![0, 2]), and a disjoint range[10, 20]returns no matches.int64_array_dedups_doc_when_multiple_values_match— a single doc with three values inside[50, 100]is reported exactly once.float64_array_any_value_in_range— same shape forFloat64Array.single_valued_field_unchanged_by_multi_valued_changes— regression for the existing single-valued integer path (range query still producesvec![0]forage = 30in[25, 35]).All 698 unit tests + every existing integration test (including BKD / range / scoring) pass without modification.
Test plan
cargo fmt --checkcargo clippy -p laurus -p laurus-server -p laurus-cli -p laurus-mcp -p laurus-python -p laurus-nodejs -p laurus-wasm --all-targets -- -D warningscargo test -p laurus -p laurus-server -p laurus-cli -p laurus-mcp— 698 unit tests + new e2e tests + existing integration tests all passCloses #281