feat: multi-valued numeric DataValue and schema flag (PR1/2 of #281) by mosuka · Pull Request #286 · mosuka/laurus

mosuka · 2026-04-25T07:33:52Z

Summary

Add DataValue::Int64Array / Float64Array plus the multi_valued: bool flag on IntegerOption / FloatOption. Lucene semantics: array values per document, range queries match if any value satisfies the predicate (constant scoring), single → multi auto-wraps, multi → single is rejected (no silent truncation).
Wire the new variants and the schema flag end-to-end through DataValue plumbing, type inference, type coercion, proto, REST / gRPC / MCP, and all six language bindings, with English and Japanese documentation refreshed.
Lexical-store internals (doc_values Single/Multi split, BKD multi-point registration, range query "any match" evaluation) are deferred to PR2 so the storage contract changes in isolation.

What changed

Core (`laurus`)

DataValue::Int64Array(Vec<i64>) and Float64Array(Vec<f64>), accessors, From<Vec<i64>> / From<Vec<f64>>, and DocumentBuilder::add_int64_array / add_float64_array.
IntegerOption.multi_valued / FloatOption.multi_valued flags (default false, #[serde(default)]).
infer_from_array now returns the new variants with multi_valued: true instead of "not yet supported".
coerce_to_integer / coerce_to_float branch on multi_valued. Multi-valued accepts arrays, single values, bool, and parseable text (auto-wraps); single-valued rejects arrays explicitly.
coerce_to_vector also accepts Int64Array / Float64Array so REST/JSON callers can still pass pre-computed embeddings as plain numeric arrays to vector fields.

Protocol & server

New Int64ArrayValue / Float64ArrayValue proto messages added to the Value oneof. IntegerOption / FloatOption proto gain multi_valued.
laurus-server::convert::document round-trips the new variants; convert::schema round-trips the flag; gateway JSON converters recognise multi_valued on input and emit array values on output. laurus-mcp::convert handles them too.

Bindings

laurus-{python,nodejs,wasm,ruby,php}::Schema::add_integer_field / add_float_field accept a multi_valued parameter (default false). Each binding's data_value_to_* converter emits arrays using its language's idiomatic type.
laurus-cli::create prompts for multi_valued on numeric fields; output::format_data_value and data_value_to_json render the new variants.

Documentation (EN + JA)

concepts/schema_and_fields — type-inference table now lists arrays, plus a new "Multi-valued numeric fields" section that documents the auto-wrap / reject behaviour and the constant-score range-query contract.
laurus-cli/schema_format — TOML multi_valued documented.
laurus-server/grpc_api — IntegerOption / FloatOption field lists updated.
laurus-{python,nodejs,wasm,ruby,php}/api_reference — signatures plus one-line descriptions of multi_valued.

Out of scope (PR2)

FieldDocValues Single/Multi split.
BKD multi-point registration for the same (field, doc) key.
RangeScorer constant scoring with "any value matches" semantics.

Until PR2 lands, multi-valued integers/floats round-trip through the engine surface (DataValue, schema, ingest, retrieve via stored fields) but range queries against multi-valued fields are not yet evaluated correctly — that requires the storage-layer changes in PR2.

Test plan

cargo fmt --check
cargo clippy -p laurus -p laurus-server -p laurus-cli -p laurus-mcp -p laurus-python -p laurus-nodejs -p laurus-wasm --all-targets -- -D warnings
cargo test -p laurus -p laurus-server -p laurus-cli -p laurus-mcp — 698 unit tests + 10 dynamic-schema integration tests pass.
New unit tests in engine::type_inference::tests::infer_integer_array_to_int64_array / infer_float_array_to_float64_array.
New integration test dynamic_auto_adds_int64_array_field in dynamic_schema_test.
markdownlint-cli2 \"docs/src/**/*.md\" \"docs/ja/src/**/*.md\" — 0 errors.
mdbook build docs and mdbook build docs/ja succeed.

Refs #281

Introduce `DataValue::Int64Array` and `DataValue::Float64Array` and the `multi_valued: bool` flag on `IntegerOption` / `FloatOption`. Lucene-style semantics: array values per document, range queries match if any value satisfies the predicate (constant scoring), single values are auto-wrapped into a one-element array, arrays sent to a single-valued field are rejected rather than silently truncating. This PR is the first of two for #281. It establishes the data flow end-to-end (DataValue, type inference, type coercion, proto, REST/gRPC conversion, all six language bindings, documentation). The lexical store internals — `doc_values` Single/Multi split, BKD multi-point registration, range query "any match" evaluation — are deferred to PR2 so the storage contract is changed in isolation. Core (laurus crate): - `DataValue::Int64Array(Vec<i64>)` / `Float64Array(Vec<f64>)` with `as_int64_array` / `as_float64_array` accessors, `From<Vec<i64>>` / `From<Vec<f64>>`, and `DocumentBuilder::add_int64_array` / `add_float64_array`. - `IntegerOption.multi_valued` / `FloatOption.multi_valued` (default `false`, `#[serde(default)]`). - `infer_from_array` now returns `Int64Array` or `Float64Array` (with `multi_valued: true`) instead of "not yet supported". - `coerce_to_integer` / `coerce_to_float` branch on `multi_valued`: multi-valued accepts arrays, single values, bool, and parseable text (auto-wrap); single-valued rejects arrays explicitly. - `coerce_to_vector` accepts `Int64Array` / `Float64Array` (cast element-wise to f32) so REST/JSON callers can still send pre-computed embeddings as plain numeric arrays to vector fields. - All match sites in lexical writer/reader/parser/merge_engine, result_processor, scoring/similarity, examples/common updated. Protocol / server: - New `Int64ArrayValue` and `Float64ArrayValue` proto messages added to the `Value` oneof. - `IntegerOption` / `FloatOption` proto messages gain `multi_valued`. - `laurus-server`: `convert::document` round-trips the new variants; `convert::schema` round-trips the flag; gateway JSON conversion recognises `multi_valued` on input and surfaces array-typed values on output; MCP `convert::proto_value_to_json` handles the new variants. Bindings (six): - `laurus-{python,nodejs,wasm,ruby,php}::Schema::add_integer_field` / `add_float_field` accept a `multi_valued` parameter (default `false`). Each binding's `data_value_to_*` converter emits the new variants using each language's idiomatic array type. - `laurus-cli::create::run_schema` prompts for `multi_valued` for numeric field types; `output::format_data_value` / `data_value_to_json` render arrays. Tests: - New unit tests in `engine::type_inference` for array → multi-valued inference (replaces the "not yet supported" assertion). - New integration test `dynamic_auto_adds_int64_array_field` in `dynamic_schema_test` verifies a JSON `[85, 72, 95]` is auto-added with `multi_valued = true` and survives round-trip retrieval. - All 698 existing unit tests + 10 dynamic-schema integration tests pass. Documentation (EN + JA): - `concepts/schema_and_fields`: type-inference table now lists arrays; new "Multi-valued numeric fields" section explaining the contract. - `laurus-cli/schema_format`: TOML `multi_valued` option documented. - `laurus-server/grpc_api`: option struct signatures updated. - `laurus-{python,nodejs,wasm,ruby,php}/api_reference`: signatures updated with the new parameter and a one-line description. Out of scope (PR2): - `FieldDocValues` Single/Multi split (storage layer). - BKD multi-point registration for the same `(field, doc)` key. - `RangeScorer` constant scoring with "any value matches" evaluation. Refs #281

…#281) (#287) PR1 (#286) put `Int64Array` / `Float64Array` and the `multi_valued` schema flag in place but stopped at the storage boundary — the lexical store still treated each `(field, doc_id)` as a single BKD point, so range queries on multi-valued fields could not match more than one element per document. This PR plumbs multi-value support through the indexing pipeline and switches range-query scoring to Lucene's constant-score model. Storage / indexing pipeline: - `AnalyzedDocument::point_values` is now `AHashMap<String, Vec<Vec<f64>>>` — a list of points per field rather than a single point. Single-valued numeric / datetime / geo fields contribute one point; multi-valued numeric fields contribute one point per element. - `DocumentParser::parse` (`laurus/src/lexical/core/parser.rs`) and the inverted index writer's inline parser (`laurus/src/lexical/index/inverted/writer.rs`) emit a separate 1D point per `Int64Array` / `Float64Array` element and a single appropriately-shaped point for the existing scalar / geo cases. - `InvertedIndexWriter::write_bkd_trees` flattens the new structure into one BKD entry per point. The BKD reader's `range_search` already deduplicates `doc_id`s, so a multi-valued document is reported at most once per query — the Lucene contract. Range query semantics (Lucene parity): - `NumericRangeQuery::count_matching_documents` and the fallback matcher (`laurus/src/lexical/query/range.rs`) recognise `Int64Array` / `Float64Array` values and apply "any value matches" via `arr.iter().any(...)`. The BKD path inherits "any match" from the BKD-level dedup. - `RangeScorer::score` now returns `boost` unconditionally — Lucene `PointRangeQuery` uses constant scoring, and per-value IDF / proximity weighting interacts poorly with multi-valued "any match" semantics. The legacy `range_idf` / `proximity_score` heuristics and the unused `lower_bound` / `upper_bound` / `range_width` / `total_docs` fields were removed; `RangeScorer::new` keeps its signature so existing call sites continue to compile. doc_values: - Intentionally left as the existing `(doc_id, FieldValue)` map. A multi-valued document stores one `FieldValue::Int64Array(...)` or `Float64Array(...)` per `(doc_id, field)`, which is sufficient for retrieval and round-trip without restructuring storage. A Lucene-style `NumericDocValues` / `SortedNumericDocValues` split would only matter for sort selectors (MIN / MAX / MEDIAN / SUM), which is left as a future task. Tests: - `laurus/tests/multi_valued_numeric_test.rs` (new) exercises the full writer → BKD → reader → range query pipeline: - `int64_array_any_value_in_range`: 3 docs with varied integer arrays and inclusive bounds; only the docs with any value in range match. - `int64_array_dedups_doc_when_multiple_values_match`: a single doc with three values inside the range is reported exactly once. - `float64_array_any_value_in_range`: same shape as integer, for `Float64Array`. - `single_valued_field_unchanged_by_multi_valued_changes`: regression for the existing single-valued integer path. - All 698 unit tests + every existing integration test pass without modification, including range / BKD / scoring tests that exercised the old IDF + proximity scorer. Closes #281

mosuka merged commit 1aef56c into main Apr 25, 2026
22 checks passed

mosuka deleted the feat/multi-valued-numeric-data-value branch April 25, 2026 10:44

mosuka mentioned this pull request Apr 25, 2026

feat: index and query multi-valued numeric fields end-to-end (PR2/2 of #281) #287

Merged

3 tasks

mosuka mentioned this pull request Apr 25, 2026

refactor: route HTTP gateway JSON through engine type inference #288

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-valued numeric DataValue and schema flag (PR1/2 of #281)#286

feat: multi-valued numeric DataValue and schema flag (PR1/2 of #281)#286
mosuka merged 1 commit intomainfrom
feat/multi-valued-numeric-data-value

mosuka commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mosuka commented Apr 25, 2026

Summary

What changed

Core (laurus)

Protocol & server

Bindings

Documentation (EN + JA)

Out of scope (PR2)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Core (`laurus`)