Skip to content

feat: multi-valued numeric DataValue and schema flag (PR1/2 of #281)#286

Merged
mosuka merged 1 commit intomainfrom
feat/multi-valued-numeric-data-value
Apr 25, 2026
Merged

feat: multi-valued numeric DataValue and schema flag (PR1/2 of #281)#286
mosuka merged 1 commit intomainfrom
feat/multi-valued-numeric-data-value

Conversation

@mosuka
Copy link
Copy Markdown
Owner

@mosuka mosuka commented Apr 25, 2026

Summary

  • Add DataValue::Int64Array / Float64Array plus the multi_valued: bool flag on IntegerOption / FloatOption. Lucene semantics: array values per document, range queries match if any value satisfies the predicate (constant scoring), single → multi auto-wraps, multi → single is rejected (no silent truncation).
  • Wire the new variants and the schema flag end-to-end through DataValue plumbing, type inference, type coercion, proto, REST / gRPC / MCP, and all six language bindings, with English and Japanese documentation refreshed.
  • Lexical-store internals (doc_values Single/Multi split, BKD multi-point registration, range query "any match" evaluation) are deferred to PR2 so the storage contract changes in isolation.

What changed

Core (laurus)

  • DataValue::Int64Array(Vec<i64>) and Float64Array(Vec<f64>), accessors, From<Vec<i64>> / From<Vec<f64>>, and DocumentBuilder::add_int64_array / add_float64_array.
  • IntegerOption.multi_valued / FloatOption.multi_valued flags (default false, #[serde(default)]).
  • infer_from_array now returns the new variants with multi_valued: true instead of "not yet supported".
  • coerce_to_integer / coerce_to_float branch on multi_valued. Multi-valued accepts arrays, single values, bool, and parseable text (auto-wraps); single-valued rejects arrays explicitly.
  • coerce_to_vector also accepts Int64Array / Float64Array so REST/JSON callers can still pass pre-computed embeddings as plain numeric arrays to vector fields.

Protocol & server

  • New Int64ArrayValue / Float64ArrayValue proto messages added to the Value oneof. IntegerOption / FloatOption proto gain multi_valued.
  • laurus-server::convert::document round-trips the new variants; convert::schema round-trips the flag; gateway JSON converters recognise multi_valued on input and emit array values on output. laurus-mcp::convert handles them too.

Bindings

  • laurus-{python,nodejs,wasm,ruby,php}::Schema::add_integer_field / add_float_field accept a multi_valued parameter (default false). Each binding's data_value_to_* converter emits arrays using its language's idiomatic type.
  • laurus-cli::create prompts for multi_valued on numeric fields; output::format_data_value and data_value_to_json render the new variants.

Documentation (EN + JA)

  • concepts/schema_and_fields — type-inference table now lists arrays, plus a new "Multi-valued numeric fields" section that documents the auto-wrap / reject behaviour and the constant-score range-query contract.
  • laurus-cli/schema_format — TOML multi_valued documented.
  • laurus-server/grpc_apiIntegerOption / FloatOption field lists updated.
  • laurus-{python,nodejs,wasm,ruby,php}/api_reference — signatures plus one-line descriptions of multi_valued.

Out of scope (PR2)

  • FieldDocValues Single/Multi split.
  • BKD multi-point registration for the same (field, doc) key.
  • RangeScorer constant scoring with "any value matches" semantics.

Until PR2 lands, multi-valued integers/floats round-trip through the engine surface (DataValue, schema, ingest, retrieve via stored fields) but range queries against multi-valued fields are not yet evaluated correctly — that requires the storage-layer changes in PR2.

Test plan

  • cargo fmt --check
  • cargo clippy -p laurus -p laurus-server -p laurus-cli -p laurus-mcp -p laurus-python -p laurus-nodejs -p laurus-wasm --all-targets -- -D warnings
  • cargo test -p laurus -p laurus-server -p laurus-cli -p laurus-mcp — 698 unit tests + 10 dynamic-schema integration tests pass.
  • New unit tests in engine::type_inference::tests::infer_integer_array_to_int64_array / infer_float_array_to_float64_array.
  • New integration test dynamic_auto_adds_int64_array_field in dynamic_schema_test.
  • markdownlint-cli2 \"docs/src/**/*.md\" \"docs/ja/src/**/*.md\" — 0 errors.
  • mdbook build docs and mdbook build docs/ja succeed.

Refs #281

Introduce `DataValue::Int64Array` and `DataValue::Float64Array` and the
`multi_valued: bool` flag on `IntegerOption` / `FloatOption`. Lucene-style
semantics: array values per document, range queries match if any value
satisfies the predicate (constant scoring), single values are auto-wrapped
into a one-element array, arrays sent to a single-valued field are
rejected rather than silently truncating.

This PR is the first of two for #281. It establishes the data flow
end-to-end (DataValue, type inference, type coercion, proto, REST/gRPC
conversion, all six language bindings, documentation). The lexical store
internals — `doc_values` Single/Multi split, BKD multi-point registration,
range query "any match" evaluation — are deferred to PR2 so the storage
contract is changed in isolation.

Core (laurus crate):

- `DataValue::Int64Array(Vec<i64>)` / `Float64Array(Vec<f64>)` with
  `as_int64_array` / `as_float64_array` accessors, `From<Vec<i64>>` /
  `From<Vec<f64>>`, and `DocumentBuilder::add_int64_array` /
  `add_float64_array`.
- `IntegerOption.multi_valued` / `FloatOption.multi_valued` (default
  `false`, `#[serde(default)]`).
- `infer_from_array` now returns `Int64Array` or `Float64Array` (with
  `multi_valued: true`) instead of "not yet supported".
- `coerce_to_integer` / `coerce_to_float` branch on `multi_valued`:
  multi-valued accepts arrays, single values, bool, and parseable text
  (auto-wrap); single-valued rejects arrays explicitly.
- `coerce_to_vector` accepts `Int64Array` / `Float64Array` (cast
  element-wise to f32) so REST/JSON callers can still send pre-computed
  embeddings as plain numeric arrays to vector fields.
- All match sites in lexical writer/reader/parser/merge_engine,
  result_processor, scoring/similarity, examples/common updated.

Protocol / server:

- New `Int64ArrayValue` and `Float64ArrayValue` proto messages added to
  the `Value` oneof.
- `IntegerOption` / `FloatOption` proto messages gain `multi_valued`.
- `laurus-server`: `convert::document` round-trips the new variants;
  `convert::schema` round-trips the flag; gateway JSON conversion
  recognises `multi_valued` on input and surfaces array-typed values on
  output; MCP `convert::proto_value_to_json` handles the new variants.

Bindings (six):

- `laurus-{python,nodejs,wasm,ruby,php}::Schema::add_integer_field` /
  `add_float_field` accept a `multi_valued` parameter (default `false`).
  Each binding's `data_value_to_*` converter emits the new variants
  using each language's idiomatic array type.
- `laurus-cli::create::run_schema` prompts for `multi_valued` for
  numeric field types; `output::format_data_value` /
  `data_value_to_json` render arrays.

Tests:

- New unit tests in `engine::type_inference` for array → multi-valued
  inference (replaces the "not yet supported" assertion).
- New integration test `dynamic_auto_adds_int64_array_field` in
  `dynamic_schema_test` verifies a JSON `[85, 72, 95]` is auto-added
  with `multi_valued = true` and survives round-trip retrieval.
- All 698 existing unit tests + 10 dynamic-schema integration tests pass.

Documentation (EN + JA):

- `concepts/schema_and_fields`: type-inference table now lists arrays;
  new "Multi-valued numeric fields" section explaining the contract.
- `laurus-cli/schema_format`: TOML `multi_valued` option documented.
- `laurus-server/grpc_api`: option struct signatures updated.
- `laurus-{python,nodejs,wasm,ruby,php}/api_reference`: signatures
  updated with the new parameter and a one-line description.

Out of scope (PR2):

- `FieldDocValues` Single/Multi split (storage layer).
- BKD multi-point registration for the same `(field, doc)` key.
- `RangeScorer` constant scoring with "any value matches" evaluation.

Refs #281
@mosuka mosuka merged commit 1aef56c into main Apr 25, 2026
22 checks passed
@mosuka mosuka deleted the feat/multi-valued-numeric-data-value branch April 25, 2026 10:44
mosuka added a commit that referenced this pull request Apr 25, 2026
…#281) (#287)

PR1 (#286) put `Int64Array` / `Float64Array` and the `multi_valued`
schema flag in place but stopped at the storage boundary — the lexical
store still treated each `(field, doc_id)` as a single BKD point, so
range queries on multi-valued fields could not match more than one
element per document. This PR plumbs multi-value support through the
indexing pipeline and switches range-query scoring to Lucene's
constant-score model.

Storage / indexing pipeline:

- `AnalyzedDocument::point_values` is now
  `AHashMap<String, Vec<Vec<f64>>>` — a list of points per field rather
  than a single point. Single-valued numeric / datetime / geo fields
  contribute one point; multi-valued numeric fields contribute one
  point per element.
- `DocumentParser::parse` (`laurus/src/lexical/core/parser.rs`) and the
  inverted index writer's inline parser
  (`laurus/src/lexical/index/inverted/writer.rs`) emit a separate 1D
  point per `Int64Array` / `Float64Array` element and a single
  appropriately-shaped point for the existing scalar / geo cases.
- `InvertedIndexWriter::write_bkd_trees` flattens the new structure into
  one BKD entry per point. The BKD reader's `range_search` already
  deduplicates `doc_id`s, so a multi-valued document is reported at
  most once per query — the Lucene contract.

Range query semantics (Lucene parity):

- `NumericRangeQuery::count_matching_documents` and the fallback
  matcher (`laurus/src/lexical/query/range.rs`) recognise
  `Int64Array` / `Float64Array` values and apply "any value matches"
  via `arr.iter().any(...)`. The BKD path inherits "any match" from
  the BKD-level dedup.
- `RangeScorer::score` now returns `boost` unconditionally — Lucene
  `PointRangeQuery` uses constant scoring, and per-value IDF /
  proximity weighting interacts poorly with multi-valued "any match"
  semantics. The legacy `range_idf` / `proximity_score` heuristics and
  the unused `lower_bound` / `upper_bound` / `range_width` /
  `total_docs` fields were removed; `RangeScorer::new` keeps its
  signature so existing call sites continue to compile.

doc_values:

- Intentionally left as the existing `(doc_id, FieldValue)` map. A
  multi-valued document stores one `FieldValue::Int64Array(...)` or
  `Float64Array(...)` per `(doc_id, field)`, which is sufficient for
  retrieval and round-trip without restructuring storage. A
  Lucene-style `NumericDocValues` / `SortedNumericDocValues` split
  would only matter for sort selectors (MIN / MAX / MEDIAN / SUM),
  which is left as a future task.

Tests:

- `laurus/tests/multi_valued_numeric_test.rs` (new) exercises the full
  writer → BKD → reader → range query pipeline:
  - `int64_array_any_value_in_range`: 3 docs with varied integer
    arrays and inclusive bounds; only the docs with any value in range
    match.
  - `int64_array_dedups_doc_when_multiple_values_match`: a single doc
    with three values inside the range is reported exactly once.
  - `float64_array_any_value_in_range`: same shape as integer, for
    `Float64Array`.
  - `single_valued_field_unchanged_by_multi_valued_changes`:
    regression for the existing single-valued integer path.
- All 698 unit tests + every existing integration test pass without
  modification, including range / BKD / scoring tests that exercised
  the old IDF + proximity scorer.

Closes #281
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant