feat: multi-valued numeric DataValue and schema flag (PR1/2 of #281)#286
Merged
feat: multi-valued numeric DataValue and schema flag (PR1/2 of #281)#286
Conversation
Introduce `DataValue::Int64Array` and `DataValue::Float64Array` and the `multi_valued: bool` flag on `IntegerOption` / `FloatOption`. Lucene-style semantics: array values per document, range queries match if any value satisfies the predicate (constant scoring), single values are auto-wrapped into a one-element array, arrays sent to a single-valued field are rejected rather than silently truncating. This PR is the first of two for #281. It establishes the data flow end-to-end (DataValue, type inference, type coercion, proto, REST/gRPC conversion, all six language bindings, documentation). The lexical store internals — `doc_values` Single/Multi split, BKD multi-point registration, range query "any match" evaluation — are deferred to PR2 so the storage contract is changed in isolation. Core (laurus crate): - `DataValue::Int64Array(Vec<i64>)` / `Float64Array(Vec<f64>)` with `as_int64_array` / `as_float64_array` accessors, `From<Vec<i64>>` / `From<Vec<f64>>`, and `DocumentBuilder::add_int64_array` / `add_float64_array`. - `IntegerOption.multi_valued` / `FloatOption.multi_valued` (default `false`, `#[serde(default)]`). - `infer_from_array` now returns `Int64Array` or `Float64Array` (with `multi_valued: true`) instead of "not yet supported". - `coerce_to_integer` / `coerce_to_float` branch on `multi_valued`: multi-valued accepts arrays, single values, bool, and parseable text (auto-wrap); single-valued rejects arrays explicitly. - `coerce_to_vector` accepts `Int64Array` / `Float64Array` (cast element-wise to f32) so REST/JSON callers can still send pre-computed embeddings as plain numeric arrays to vector fields. - All match sites in lexical writer/reader/parser/merge_engine, result_processor, scoring/similarity, examples/common updated. Protocol / server: - New `Int64ArrayValue` and `Float64ArrayValue` proto messages added to the `Value` oneof. - `IntegerOption` / `FloatOption` proto messages gain `multi_valued`. - `laurus-server`: `convert::document` round-trips the new variants; `convert::schema` round-trips the flag; gateway JSON conversion recognises `multi_valued` on input and surfaces array-typed values on output; MCP `convert::proto_value_to_json` handles the new variants. Bindings (six): - `laurus-{python,nodejs,wasm,ruby,php}::Schema::add_integer_field` / `add_float_field` accept a `multi_valued` parameter (default `false`). Each binding's `data_value_to_*` converter emits the new variants using each language's idiomatic array type. - `laurus-cli::create::run_schema` prompts for `multi_valued` for numeric field types; `output::format_data_value` / `data_value_to_json` render arrays. Tests: - New unit tests in `engine::type_inference` for array → multi-valued inference (replaces the "not yet supported" assertion). - New integration test `dynamic_auto_adds_int64_array_field` in `dynamic_schema_test` verifies a JSON `[85, 72, 95]` is auto-added with `multi_valued = true` and survives round-trip retrieval. - All 698 existing unit tests + 10 dynamic-schema integration tests pass. Documentation (EN + JA): - `concepts/schema_and_fields`: type-inference table now lists arrays; new "Multi-valued numeric fields" section explaining the contract. - `laurus-cli/schema_format`: TOML `multi_valued` option documented. - `laurus-server/grpc_api`: option struct signatures updated. - `laurus-{python,nodejs,wasm,ruby,php}/api_reference`: signatures updated with the new parameter and a one-line description. Out of scope (PR2): - `FieldDocValues` Single/Multi split (storage layer). - BKD multi-point registration for the same `(field, doc)` key. - `RangeScorer` constant scoring with "any value matches" evaluation. Refs #281
3 tasks
mosuka
added a commit
that referenced
this pull request
Apr 25, 2026
…#281) (#287) PR1 (#286) put `Int64Array` / `Float64Array` and the `multi_valued` schema flag in place but stopped at the storage boundary — the lexical store still treated each `(field, doc_id)` as a single BKD point, so range queries on multi-valued fields could not match more than one element per document. This PR plumbs multi-value support through the indexing pipeline and switches range-query scoring to Lucene's constant-score model. Storage / indexing pipeline: - `AnalyzedDocument::point_values` is now `AHashMap<String, Vec<Vec<f64>>>` — a list of points per field rather than a single point. Single-valued numeric / datetime / geo fields contribute one point; multi-valued numeric fields contribute one point per element. - `DocumentParser::parse` (`laurus/src/lexical/core/parser.rs`) and the inverted index writer's inline parser (`laurus/src/lexical/index/inverted/writer.rs`) emit a separate 1D point per `Int64Array` / `Float64Array` element and a single appropriately-shaped point for the existing scalar / geo cases. - `InvertedIndexWriter::write_bkd_trees` flattens the new structure into one BKD entry per point. The BKD reader's `range_search` already deduplicates `doc_id`s, so a multi-valued document is reported at most once per query — the Lucene contract. Range query semantics (Lucene parity): - `NumericRangeQuery::count_matching_documents` and the fallback matcher (`laurus/src/lexical/query/range.rs`) recognise `Int64Array` / `Float64Array` values and apply "any value matches" via `arr.iter().any(...)`. The BKD path inherits "any match" from the BKD-level dedup. - `RangeScorer::score` now returns `boost` unconditionally — Lucene `PointRangeQuery` uses constant scoring, and per-value IDF / proximity weighting interacts poorly with multi-valued "any match" semantics. The legacy `range_idf` / `proximity_score` heuristics and the unused `lower_bound` / `upper_bound` / `range_width` / `total_docs` fields were removed; `RangeScorer::new` keeps its signature so existing call sites continue to compile. doc_values: - Intentionally left as the existing `(doc_id, FieldValue)` map. A multi-valued document stores one `FieldValue::Int64Array(...)` or `Float64Array(...)` per `(doc_id, field)`, which is sufficient for retrieval and round-trip without restructuring storage. A Lucene-style `NumericDocValues` / `SortedNumericDocValues` split would only matter for sort selectors (MIN / MAX / MEDIAN / SUM), which is left as a future task. Tests: - `laurus/tests/multi_valued_numeric_test.rs` (new) exercises the full writer → BKD → reader → range query pipeline: - `int64_array_any_value_in_range`: 3 docs with varied integer arrays and inclusive bounds; only the docs with any value in range match. - `int64_array_dedups_doc_when_multiple_values_match`: a single doc with three values inside the range is reported exactly once. - `float64_array_any_value_in_range`: same shape as integer, for `Float64Array`. - `single_valued_field_unchanged_by_multi_valued_changes`: regression for the existing single-valued integer path. - All 698 unit tests + every existing integration test pass without modification, including range / BKD / scoring tests that exercised the old IDF + proximity scorer. Closes #281
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DataValue::Int64Array/Float64Arrayplus themulti_valued: boolflag onIntegerOption/FloatOption. Lucene semantics: array values per document, range queries match if any value satisfies the predicate (constant scoring), single → multi auto-wraps, multi → single is rejected (no silent truncation).DataValueplumbing, type inference, type coercion, proto, REST / gRPC / MCP, and all six language bindings, with English and Japanese documentation refreshed.What changed
Core (
laurus)DataValue::Int64Array(Vec<i64>)andFloat64Array(Vec<f64>), accessors,From<Vec<i64>>/From<Vec<f64>>, andDocumentBuilder::add_int64_array/add_float64_array.IntegerOption.multi_valued/FloatOption.multi_valuedflags (defaultfalse,#[serde(default)]).infer_from_arraynow returns the new variants withmulti_valued: trueinstead of "not yet supported".coerce_to_integer/coerce_to_floatbranch onmulti_valued. Multi-valued accepts arrays, single values, bool, and parseable text (auto-wraps); single-valued rejects arrays explicitly.coerce_to_vectoralso acceptsInt64Array/Float64Arrayso REST/JSON callers can still pass pre-computed embeddings as plain numeric arrays to vector fields.Protocol & server
Int64ArrayValue/Float64ArrayValueproto messages added to theValueoneof.IntegerOption/FloatOptionproto gainmulti_valued.laurus-server::convert::documentround-trips the new variants;convert::schemaround-trips the flag; gateway JSON converters recognisemulti_valuedon input and emit array values on output.laurus-mcp::converthandles them too.Bindings
laurus-{python,nodejs,wasm,ruby,php}::Schema::add_integer_field/add_float_fieldaccept amulti_valuedparameter (defaultfalse). Each binding'sdata_value_to_*converter emits arrays using its language's idiomatic type.laurus-cli::createprompts formulti_valuedon numeric fields;output::format_data_valueanddata_value_to_jsonrender the new variants.Documentation (EN + JA)
concepts/schema_and_fields— type-inference table now lists arrays, plus a new "Multi-valued numeric fields" section that documents the auto-wrap / reject behaviour and the constant-score range-query contract.laurus-cli/schema_format— TOMLmulti_valueddocumented.laurus-server/grpc_api—IntegerOption/FloatOptionfield lists updated.laurus-{python,nodejs,wasm,ruby,php}/api_reference— signatures plus one-line descriptions ofmulti_valued.Out of scope (PR2)
FieldDocValuesSingle/Multi split.(field, doc)key.RangeScorerconstant scoring with "any value matches" semantics.Until PR2 lands, multi-valued integers/floats round-trip through the engine surface (DataValue, schema, ingest, retrieve via stored fields) but range queries against multi-valued fields are not yet evaluated correctly — that requires the storage-layer changes in PR2.
Test plan
cargo fmt --checkcargo clippy -p laurus -p laurus-server -p laurus-cli -p laurus-mcp -p laurus-python -p laurus-nodejs -p laurus-wasm --all-targets -- -D warningscargo test -p laurus -p laurus-server -p laurus-cli -p laurus-mcp— 698 unit tests + 10 dynamic-schema integration tests pass.engine::type_inference::tests::infer_integer_array_to_int64_array/infer_float_array_to_float64_array.dynamic_auto_adds_int64_array_fieldindynamic_schema_test.markdownlint-cli2 \"docs/src/**/*.md\" \"docs/ja/src/**/*.md\"— 0 errors.mdbook build docsandmdbook build docs/jasucceed.Refs #281