fix: fulltext() scoring on configured predicates with novelty present, plus fulltext and vector scoring perf#1430
Conversation
defbd6b to
4d4430c
Compare
…:Lit rows Whenever a ledger carries any novelty, BinaryScanOperator disables late materialization (binary_scan.rs overlay-epoch gate), so every scanned value reaches eval_fulltext as Binding::Lit rather than EncodedLit. The Lit arm only scored @fulltext-datatype values and returned unbound for everything else — so fulltext() on f:fullTextDefaults-configured predicates went null for EVERY row (indexed or not) the moment one unindexed commit existed. A ledger queried exactly at index_t worked, which is why the memory-only reproducers kept passing while S3/Lambda deployments (indexing and querying in separate processes, rarely fully caught up) kept regressing. Generalize the Lit arm to mirror the EncodedLit arm: resolve the arena bucket from the row's lang tag (dict-assigned lang_id, English bucket for untagged/@fulltext values), analyze with the bucket's analyzer, and score via unified BM25 with the novelty delta. No-arena semantics are unchanged: non-@fulltext strings stay unbound, @fulltext values keep the TF-saturation fallback. Also extend ExecutableQuery::uses_fulltext to expression positions outside WHERE (ORDER BY binds, post-aggregation binds, HAVING) so the provider map is attached for those shapes too. New persisted-backend coverage (file storage, fresh process per phase, mirroring the two-Lambda deployment): - full reindex + reload scores untagged and @en probes - incremental build fetches and extends the persisted arena - unindexed novelty commit scores; indexed values keep scoring (the failing reproducer for this bug) - language-tagged values build per-language buckets (French stemming, cross-bucket isolation) that round-trip the FIR6 root
… row fulltext() evaluated per bound row and repeated query-invariant work every time: Analyzer::for_language clones the language's entire stopword set per call, the query string was re-tokenized and re-stemmed per row, and effective stats / df / IDF were recomputed per row with two binary searches over the ~60k-string term dictionary per term (once for df, once for tf). On materialized Binding::Lit rows — the norm whenever any novelty exists — the document text was also re-analyzed per row through the novelty path even when the doc was already in the arena. At 211k values that added up to ~27s per query in the field. - Analyzer::shared(lang): process-wide per-language analyzers, built once and leaked (bounded by the Language enum); for_language stays for build-time callers that want an owned instance. - PreparedFulltextQuery: bucket analyzer choice, query stems, arena term ids, IDF weights, and effective corpus stats computed once per (store_id, epoch, to_t, g_id, p_id, lang_id, query), moka-cached with a thread-local one-slot memo in front. store_id in the key pins prepared term ids to the arena instance they index into. Per-row scoring is now a doc_bow probe plus one u32 binary search per term, replacing score_bm25_indexed/score_bm25_novelty (same formula, same term order — scores are unchanged). - Binding::Lit indexed fast path: resolve the value's string_id (persisted dict, then dict novelty) and score from the precomputed arena BoW; only genuinely unindexed values analyze document text. Perf smoke harness (ignored test, run manually): 50k configured labels, full reindex, one unindexed commit to force the materialized-row path. Before: 1.45s per query; after: 61ms (~24x) in release on M-series.
The inline vector functions (dotProduct, cosineSimilarity, euclideanDistance) resolved both arguments through ComparableValue, which allocs and copies the full vector per row — including the query vector, which never changes across rows (~12 KB memcpy per row for a 1536-dim embedding on both sides). The math itself used plain iterator closures: cosine made three separate passes over both vectors, and the runtime-dispatched SIMD kernels in eval::vector_math had no production callers at all (only a bench imported them). - eval_binary_vector_fn resolves arguments to borrowed &[f64] slices straight from the row's Binding::Lit / the expression's constant (decoding EncodedLit once into an owned scratch), skipping the ComparableValue round-trip and both per-row copies. - Compute routes through vector_math::dot_f64 / cosine_f64 / l2_f64 (runtime AVX/SSE2/NEON dispatch above the 256-length threshold). - cosine_f64/f32 now use three dispatched dot-kernel passes above the SIMD threshold instead of the single scalar fused pass (below the threshold the fused scalar loop still wins). Perf smoke harness (ignored test, run manually): 50k novelty rows of 256-dim vectors, results verified against scalar recomputation of the same generated vectors (exact hit counts, top-5 ranks, scores within 1e-6). End-to-end gains are modest because the scan/materialize floor (~145 ms for 50k rows) dominates flat ranking, not the scoring eval: dotProduct 163->148 ms, cosine 205->151 ms, euclidean 167->149 ms in release on M-series; at 1024 dims the scan floor swallows nearly all of it. Larger wins for flat vector search at scale would need Arc-based vector values through scan/materialization, or HNSW top-k routing.
FlakeValue::Vector held Vec<f64>, so every clone through the flake → binding → batch → output pipeline deep-copied the whole embedding (12 KB per clone for a 1536-dim vector), and the same cost hit transact and novelty merges on vector-heavy ledgers. Switch to Arc<[f64]>: clones are now a refcount bump everywhere a FlakeValue is cloned. - Serde stays a plain number array via a local with-adapter (no global serde rc opt-in; wire form identical to the previous Vec<f64> — covered by a new round-trip test). - FlakeValue::max() keeps the empty-vector sentinel via a shared static. - FlakeValue <-> ComparableValue vector conversions become free Arc clones instead of buffer copies; VecArg in eval/vector.rs drops its Owned variant (EncodedLit decode now hands back the Arc directly). - Remaining call sites are mechanical: .into() at construction, .iter() where loops relied on Vec's IntoIterator. Measured on the flat-rank harness (release, M-series): ~5-8% at 50k x 256-dim, ~6% at 20k x 1024-dim — the flat-rank floor is per-row operator machinery, not vector copies. The primary value is the O(1) clone guarantee for embeddings across transact/novelty/query paths and removing the standing foot-gun where any binding clone silently paid O(dims).
Indexed vector arguments (EncodedLit VECTOR_ID) reached eval through decode_value, which collects a fresh Vec<f64> and converts it to an Arc per row — two allocations plus a full widening copy for every scored row, even though LazyVectorArena::lookup_vector already returns a zero-copy, shard-pinned f32 slice. - BinaryIndexStore::vector_slice(g_id, p_id, handle): public zero-copy lookup of the packed shard data. - eval resolves EncodedLit VECTOR_ID args to the borrowed f32 slice and widens into a reused thread-local scratch at compute time — same widened values, same f64 kernels, so scores are unchanged. Arena misses (ephemeral novelty handles, no arena for the predicate) fall back to the decode path. - Perf harness gains an indexed phase: full reindex + reload with empty novelty so rows late-materialize and eval sees EncodedLit, verified against the same scalar ground truth. Measured (50k x 256-dim, release, M-series): indexed warm run 27.0 -> 21.0 ms (~22%; the eliminated per-row allocs). Novelty-resident timing unchanged as expected — that path materializes values in the scan, not eval. Incidental finding worth recording: the indexed pipeline runs the identical query 6.5x faster than novelty-resident (21 ms vs 137 ms per 50k rows), so keeping vector corpora indexed dominates any eval-side tuning.
Detects the canonical vector-similarity shape — single triple over a
vector predicate, bind ?score dotProduct(?vec, <const|single-row VALUES
target>), optional threshold filter, optional ORDER BY DESC(?score),
optional LIMIT/OFFSET, projection over {?s, ?score} — and executes it
as a direct scan of the vector arena instead of driving every row
through the scan → bind → filter → sort pipeline.
Executor (fast_vector_topk.rs):
- rows come from the shared overlay-merging cursor (novelty asserts
appear, retracted base rows cancel — the same merge lane the count
fast paths use); novelty vector values (ephemeral handles) are
pre-decoded once before the scan
- VectorArenaSnapshot pins every shard once for zero-probe reads;
scoring widens f32 shard data to f64 and uses the same SIMD dot
kernel as eval, so scores are bit-identical to the pipeline
- large base scans partition by subject ranges at leaf/leaflet
boundaries onto the global rayon pool, each partition with its
subject-sliced overlay ops and a bounded top-k heap; deterministic
tie-break (score desc, then s_id)
- bails to the generic pipeline on any uncertainty: non-vector rows on
the predicate, missing arena with novelty present, dims mismatch,
unresolvable overlay ops, policy / multi-ledger / time-travel
Parity test (vector_topk_fast_path_parity) verifies against scalar
ground truth across lanes: indexed top-k, overlay with a novelty assert
plus a retraction via upsert, multi-cardinality subjects, and fallback
shapes (?vec projected) staying correct via the pipeline.
Measured (50k x 256-dim indexed, release, M-series): dotProduct
threshold + ORDER BY query 21.0 ms -> 3.3 ms (6.4x; 8x vs the
pre-zero-copy 27 ms), identical results. Novelty-resident ledgers
(no arena) keep the pipeline path unchanged.
The vector and fulltext scaling guidance predated the flat-rank performance work (prepared fulltext scoring, zero-copy SIMD vector eval, the vector top-k fast path) and treated all query shapes alike. vector-search.md: - Document the flat-rank top-k fast path: the canonical signature it serves, exact/strongly-consistent semantics, bail conditions, and fresh 50k x 768-dim measurements (novelty 324 ms / pipeline 30 ms / fast path 8.1 ms, ~6M vec/s). - Split the HNSW rule-of-thumb table into two columns: top-k-signature queries stay interactive to ~1M vectors (and viable to ~10M memory-resident), while general-shape queries keep roughly the old thresholds. Call out the consistency trade (HNSW graph source is eventually consistent; flat scan is exact and read-your-writes). - Replace the aspirational normalized-embedding bullet with actionable guidance: prefer dotProduct for unit-normalized embeddings (identical ranking to cosine, cheaper, and fast-path-eligible). fulltext.md: - Note that indexed-path numbers now hold with unindexed commits present (prepared per-query scoring + BoW lookup on materialized rows) — previously the with-novelty path was drastically slower. - Quantify the 500K-1M tier (~1-2 us/doc, no top-k pruning inline) and correct the WAND mention (graph-source feature, not inline); add the consistency trade note. README.md: qualified flat-scan vs HNSW summary to match. Harness: vector_flatrank_perf_50k now also times the general-pipeline lane (duplicate threshold filter defeats fast-path detection) so both columns of the guidance stay measurable in one run.
The docs led with the @fulltext datatype and framed the #config-graph approach as the advanced option. That inverts the actual guidance: the config path is the standards-conformant one — searchable values keep their xsd:string / rdf:langString datatypes (external consumers, validation, federation all see ordinary literals) and language-tagged values get per-language analysis. Tagging a value @fulltext replaces its stored datatype with the Fluree-specific f:fullText, which the docs never surfaced as a trade-off. - fulltext.md: intro recommends f:fullTextDefaults first and states the datatype-replacement trade-off plainly; @fulltext is positioned as a quick-start for siloed databases or properties orthogonal to the core data model. 'Why a dedicated datatype?' and Portability now cover the external-RDF caveat; the when-to-use table leads with the config path. - indexing-and-search README: entry-point ordering and decision list updated to match. - cookbook-search.md: quick start keeps the self-contained @fulltext example but points standards-focused readers at f:fullTextDefaults up front (queries are identical either way). - concepts/datatypes.md: @fulltext entry notes the stored-datatype consequence and the config alternative.
eab70ae to
a2a8b4d
Compare
| if order_desc || limit.is_some() { | ||
| rows.sort_unstable_by(|a, b| b.1.total_cmp(&a.1).then_with(|| a.0.cmp(&b.0))); | ||
| } |
There was a problem hiding this comment.
For a detected query with LIMIT but no ORDER BY, the fast path still sorts by score-desc (order_desc || limit.is_some()) and returns the top-k by score, whereas the generic pipeline returns k arbitrary rows (scan/subject order). SPARQL leaves bare-LIMIT order unspecified, so this is spec-legal, but it silently changes which rows come back once this optimization triggers, and it breaks the PR's stated "bit-identical to the pipeline" invariant for that shape. Cleanest fix is to decline the shape in detection so it routes to the pipeline unchanged:
// in detect_vector_topk, after computing order_var and before building the spec:
// A bare LIMIT with no ORDER BY means "arbitrary k rows"; imposing top-k-by-score
// here would diverge from the pipeline. Only serve LIMIT alongside DESC(?score).
if query.limit.is_some() && order_var.is_none() {
return None;
}(If serving bare-LIMIT is intentional, please add a comment at line 536 stating the divergence is accepted.)
| for partial in partials { | ||
| match partial? { | ||
| Some(rows) => { | ||
| for (s, score) in rows { | ||
| merged.push(s, score); | ||
| } | ||
| } | ||
| None => return Ok(None), // a partition hit a bail condition | ||
| } | ||
| } |
There was a problem hiding this comment.
scan_partitioned returns Ok(None) for two semantically different outcomes: "too few partitions, fall back to the serial lane" (lines 647/673/684) and "a partition hit a genuine Bail, defer to the pipeline" (line 735). The caller (lines 526–532) maps every None to "try scan_serial", so a genuine bail triggers a full redundant serial re-scan that will hit the same bail row and then defer. Correct result, wasted work on the (rare) bail path. Distinguish the two, e.g.:
// return type -> Result<Option<Result<Vec<(u64,f64)>, Bail>>> or an enum:
// Ok(None) => too few partitions, try serial
// Ok(Some(Err(Bail))) => a partition bailed, defer straight to pipeline
// Ok(Some(Ok(rows))) => rows
None => return Ok(Some(Err(Bail))), // was: return Ok(None)and have the caller short-circuit to Ok(None) (pipeline) on the Bail variant instead of re-scanning serially.
| fn new(need: Option<usize>) -> Self { | ||
| match need { | ||
| Some(n) => Fold::TopK { | ||
| need: n, | ||
| heap: BinaryHeap::with_capacity(n + 1), | ||
| }, | ||
| None => Fold::All(Vec::new()), | ||
| } | ||
| } |
There was a problem hiding this comment.
Fold::TopK eagerly pre-allocates BinaryHeap::with_capacity(need + 1) where need = limit.saturating_add(offset) (line 494) is a user literal. The realistic concern (verified): a large-but-legal limit (e.g. LIMIT 10000000 over a few thousand vectors) eagerly allocates one oversized heap per partition (up to MAX_PARTITIONS = 16, plus the merge heap and the serial-lane heap). The need + 1 overflow→debug-panic sub-case is real but pathological — it needs LIMIT ≈ 2^64, and release builds wrap to with_capacity(0) and stay correct. Clamp the pre-allocation to the data:
Some(n) => Fold::TopK {
need: n,
// cap the eager reservation; the heap grows if genuinely needed
heap: BinaryHeap::with_capacity(n.saturating_add(1).min(4096)),
},(The per-partition need could also be clamped to min(need, partition_row_estimate) in scan_partitioned, but a fixed cap is simplest.)
| /// End-to-end parity for the flat-rank vector fast path against scalar | ||
| /// ground truth, across its lanes: | ||
| /// | ||
| /// - indexed base (epoch 0): threshold + ORDER BY DESC + LIMIT top-k | ||
| /// - overlay lane: novelty assert (new top vector) + retraction of an | ||
| /// indexed vector via upsert — both must be reflected without reindexing | ||
| /// - multi-cardinality: a subject with two vectors yields two scored rows | ||
| /// - fallback shapes: an extra hydration triple and a projected `?vec` | ||
| /// must not detect (results still correct via the pipeline) | ||
| #[tokio::test] | ||
| async fn vector_topk_fast_path_parity() { |
There was a problem hiding this comment.
vector_topk_fast_path_parity exercises the detection-time bails (projected ?vec, extra hydration triple) but not the runtime Scorer::score → Err(Bail) → pipeline-fallback path — i.e. a non-vector object physically stored on the vector predicate with a bare {?s, ?score} projection (which passes detection, then bails mid-scan). vector_search_mixed_datatypes (line 399) has a non-vector value but projects it, so it never reaches the fast path. Add a case: insert a non-vector literal on the vector predicate, run the canonical {?s ?score} shape, and assert results equal the scalar ground truth (proving the runtime bail correctly defers to the generic plan and emits nothing partial).
…ime bail Three review-driven fixes to the fast vector top-k path: - detect_vector_topk: decline a bare LIMIT with no ORDER BY. The scan lane sorts by score whenever a limit is present, so without this the fast path returned top-k-by-score while the generic pipeline returns k arbitrary scan-order rows — spec-legal but a silent divergence that broke the "bit-identical to the pipeline" invariant. Now only DESC(?score)+LIMIT is served; bare LIMIT routes to the pipeline unchanged. - Fold::new: clamp the eager BinaryHeap reservation to min(need+1, 4096). `need` is a user literal (limit+offset), so a large-but-legal LIMIT would otherwise reserve an oversized heap per partition; push already caps growth at need. saturating_add also removes the pathological need+1 overflow debug-panic. - test: add vector_topk_fast_path_runtime_bail_defers_to_pipeline — a non-vector literal on the vector predicate passes detection then trips Scorer::score -> Err(Bail) mid-scan; asserts the whole fast path defers and returns the pipeline's exact rows (no partial output). Complements the detection-time bail coverage already present.
…gs in a struct - scan_partitioned now returns a PartitionScan enum (TooFewPartitions / Bailed / Rows) instead of an overloaded Ok(None). The caller short-circuits a genuine bail straight to the pipeline instead of re-scanning serially (which would hit the same bail row and waste a full pass); "too few partitions" still falls back to the serial lane. Correct before, but clearer and no redundant scan. - Group the six params shared by both scan lanes (ctx, store, g_id, p_id, scorer, need) into a VectorScan struct. scan_partitioned drops from 9 args to 2 and scan_serial from 7 to 2, removing the too_many_arguments allow; to_t / epoch are derived from ctx inside the partitioned lane.
Problem
Field reports from Solo (Lambda + S3, ledger with ~212k configured
rdfs:labelvalues) showedfulltext(?v, "…")returning unbound for every row on anf:fullTextDefaults-configured predicate — indexed or not, lang-tagged or plain — even though the BM25 arena built correctly. A follow-up report measured ~27s wall time per query once scoring worked. While in the area, the inline (flat-rank) vector scoring path was audited for the same per-row inefficiencies.Root cause (fulltext correctness)
BinaryScanOperatoronly uses late materialization (Binding::EncodedLit) when the overlay epoch is 0. The moment the ledger has any unindexed commit, every scanned row is eagerly materialized intoBinding::Lit— and theLitarm ofeval_fulltextonly scored@fulltext-datatype values, returning unbound for configured plain/lang-tagged strings. So inline full-text search worked only while a ledger was exactly caught up to its index and silently went dark one commit later. The existing memory-only reproducers always query at exactlyindex_t(empty novelty), which is why they kept passing while deployments with separate indexing/query processes kept regressing.Fix: the
Litarm now mirrors theEncodedLitarm — resolve the arena bucket from the row's lang tag (dict-assigned lang_id; English bucket for untagged/@fulltextvalues) and score via unified BM25 with the novelty delta. No-arena semantics are unchanged: non-@fulltextstrings stay unbound,@fulltextvalues keep the TF-saturation fallback.ExecutableQuery::uses_fulltextalso now covers expression positions outside WHERE (ORDER BY binds, post-aggregation binds, HAVING) so the provider map is attached for those shapes.Root cause (fulltext performance)
eval_fulltextruns once per bound row and repeated query-invariant work every time:Analyzer::for_languageclones the language's entire stopword set per call — and was called per rowLitrows re-analyzed the document text through the novelty path even when the doc was already in the arenaFix:
Analyzer::shared(lang)— process-wide per-language analyzers, built oncePreparedFulltextQuery— bucket analyzer choice, query stems, arena term ids, IDF weights, and effective corpus stats computed once per(store_id, epoch, to_t, g_id, p_id, lang_id, query), moka-cached with a thread-local one-slot memo in front.store_idin the key pins prepared term ids to the arena instance they index into. Per-row scoring is adoc_bowprobe plus one u32 binary search per query term. Scores are unchanged (same formula, same term order).Litindexed fast path: resolve the value'sstring_id(persisted dict, then dict novelty) and score from the precomputed arena BoW; only genuinely unindexed values analyze textMeasured (fulltext)
50k configured labels, file-backed, one unindexed commit to force the materialized-row path (the production shape), release build:
Extrapolated to the field's 211k labels, the ~27s query should drop to ~1–1.5s, dominated by the base scan itself. The next lever, if ever needed, is posting-driven retrieval (WAND top-k pushdown, as the graph-source BM25 already does) — a planner feature, deliberately out of scope here.
Vector (flat-rank) scoring
The inline vector functions (
dotProduct,cosineSimilarity,euclideanDistance) had the same class of per-row waste, audited and fixed in two commits:Zero-copy eval wired to SIMD — both arguments were resolved through
ComparableValue, which allocs and copies the full vector per row (including the query vector, which never changes: ~12 KB of memcpy per row on both sides for a 1536-dim embedding). The math used plain iterator closures — cosine made three separate passes — while the runtime-dispatched SIMD kernels ineval::vector_mathhad no production callers at all. Now: arguments resolve to borrowed&[f64]slices straight from the binding or the expression constant, compute routes through the AVX/SSE2/NEON-dispatched kernels, andcosine_f64/f32use three dispatched dot passes above the SIMD length threshold instead of one scalar fused pass.Arc-backed
FlakeValue::Vector— the core type heldVec<f64>, so everyFlakeValueclone through flake → binding → batch → output (and through transact/novelty merges on vector-heavy ledgers) deep-copied the whole embedding. NowArc<[f64]>: clones are refcount bumps everywhere. Serde stays a plain number array via a local adapter (wire form identical, covered by a round-trip test);FlakeValue↔ComparableValuevector conversions become freeArcclones. 21 files, almost entirely mechanical.Zero-copy f32 shard reads for indexed vectors — indexed vector arguments (
EncodedLit VECTOR_ID) reached eval throughdecode_value, which collects a freshVec<f64>+Arcper row even though the arena already hands out a zero-copy, shard-pinned f32 slice. Eval now borrows the packed f32 data via a newBinaryIndexStore::vector_sliceaccessor and widens into a reused thread-local scratch — same widened values, same f64 kernels, identical scores. Arena misses (ephemeral novelty handles) fall back to the decode path.Measured (vector, honest numbers)
50k rows × 256-dim, release, results verified identical against scalar recomputation. Novelty-resident (materialized-row path):
Indexed (vector arena, late-materialized rows), same query and verification:
Novelty-resident gains are modest because that path's floor is per-row operator machinery (scan, binding construction, join, sort, format), not scoring or copies — measured directly by a 1024-dim run where the delta stays ~6%. The Arc change's primary value is the O(1)-clone guarantee for embeddings across transact/novelty/query and removing the foot-gun where any binding clone silently paid O(dims).
Flat-rank top-k fast path — the step-change. Detects the canonical similarity shape (single triple over a vector predicate +
bind ?score dotProduct(?vec, target)+ optional threshold /ORDER BY DESC(?score)/LIMIT, projecting{?s, ?score}) and executes it as a direct scan of the packed f32 shards instead of the operator pipeline: shards pinned once (VectorArenaSnapshot, zero per-row cache probes), rows from the shared overlay-merging cursor (novelty asserts appear, retractions cancel — same lane as the count fast paths), subject-range partitioned onto the global rayon pool with per-partition bounded top-k heaps, deterministic tie-break. Scores widen f32 → f64 through the same SIMD kernel as eval, so results are bit-identical to the pipeline; any uncertainty (non-vector rows, missing arena with novelty, dims mismatch, policy/multi-ledger/time-travel) bails to the generic plan. A dedicated parity test verifies indexed top-k, the overlay lane (novelty assert + retraction via upsert), multi-cardinality subjects, and fallback shapes against scalar ground truth.Notable finding along the way: the indexed pipeline runs the identical flat-rank query 6.5× faster than novelty-resident (21 ms vs 137 ms per 50k rows) — keeping vector corpora indexed matters more than any eval tuning. Novelty-resident (arena-less) ledgers keep the pipeline path unchanged.
Test coverage
The fulltext suite was memory-only. New persisted-backend tests build in one Fluree instance, drop it, and query from a fresh instance over the same file storage (mirroring the split indexing/query deployment):
@enprobesfulltext_perf_50k_labelsandvector_flatrank_perf_50k— ignored perf smoke harnesses, run manually with--ignored --nocapture; the vector harness verifies hit counts, top-5 ranks, and scores against scalar recomputation of the same generated vectors, across both a novelty-resident and a fully-indexed phasevector_topk_fast_path_parity— scalar-ground-truth parity for the top-k fast path across its lanes (indexed base, overlay assert + retraction, multi-cardinality, fallback shapes)FlakeValue::Vectorserde round-trip test (untagged enum +withadapter → plain number array)Validation: full-workspace clippy (all features, all targets) clean; query-lib (1168), core (693), binary-index (331), transact (274), grp_query (303), and flatrank (16) suites pass.