feat: add fts support#408
Conversation
a2882f1 to
122fb51
Compare
|
|
||
| import pytest | ||
|
|
||
| from zvec.model.param.query import Fts, Query |
There was a problem hiding this comment.
This naming "Fts" is a little bit too generic. Would it be more precise to name it after its underlying dependency, like _FtsQuery (binding) or FtsQueryParam (C++)?
There was a problem hiding this comment.
The current design is: Query is the top-level request containing multiple retrieval fields (vector, FTS, filter, etc.), and Fts is just one component within it — similar to how you'd have Vector or Filter as sibling fields.
Naming it FtsQuery would imply it's a standalone query type at the same level as Query, which is misleading. If we later split Query into multiple specialized types, FtsQuery would then be the right name for a full FTS query request.
We'll also align the C++ side to drop the "Query" suffix from the param struct for consistency.
What do you think? Any other naming suggestions?
There was a problem hiding this comment.
You’re right — FtsQuery doesn’t feel quite right.
That said, Fts alone also feels a bit awkward, since it doesn’t make it clear that this is a component of Query, not a standalone type.
Maybe we should consider a name that better reflects that role, such as FtsQueryParam, FtsClause, FtsCondition, FtsOption, or FtsFilter.
There was a problem hiding this comment.
We're adding VectorClause and FtsClause in the C++ SDK.
https://github.com/alibaba/zvec/pull/405/changes#diff-3c93a0b897a78e57cd5990f965fd452e6fc52765d8c93366ec47182fab03f2efR63
However, the Python SDK follows a more concise style. Since there's no clause abstraction in other Python scenarios, it would feel inconsistent to introduce FtsClause alone.
2120dc8 to
ee9bef9
Compare
| q = Query( | ||
| field_name="embedding", | ||
| vector=[0.1, 0.2, 0.3], | ||
| fts=Fts(match_string="deep learning"), |
There was a problem hiding this comment.
Also, would it be better to rename fts= here to something more explicit, like full_text= / full_text_search= / text_search=? It feels a bit clearer and makes the intent more obvious, especially for users who may not immediately recognize fts as full-text search.
There was a problem hiding this comment.
Appreciate the suggestion! I get that full_text= is more explicit. But fts is shorter and well-understood in this space, so I'd lean toward keeping it. Thanks for understanding!
BTW, sqlite simply uses fts5
CREATE VIRTUAL TABLE email USING fts5(sender, title, body);block_max_info_for() now returns {score, last_doc} in one binary search
(with a small cache), so the standalone current_block_max_score(),
skip_to_next_block(), block_max_score_for() and block_max_last_doc_for()
methods have no live callers. Remove them from BitPackedPostingIterator,
the DocIterator base, and TermDocIterator, along with the now-dead
current_block_max_score_ member and its decode_block assignment. Tests
adjusted to query via block_max_info_for().
When an invert filter is highly selective compared to the FTS posting size, posting-driven evaluation walks far more docs than necessary. Mirror the existing vector_recall pattern: when invert match_count is below fts_brute_force_by_keys_ratio * doc_count, extract the small id set and AND it into the FTS root via a new CandidateDocIterator. The candidate iterator becomes the lead by cost, turning the posting walk into per-candidate advance() + matches() + score() and fully reusing the existing AND / filter-pushdown / BM25 machinery. - new CandidateDocIterator: ascending segment-local ids, lower_bound advance, zero score contribution - FtsColumnIndexer::search wraps root_iter in Conjunction when FtsQueryParams.candidate_ids is non-empty - new GlobalConfig::fts_brute_force_by_keys_ratio (default 0.05, independent from the vector knob because per-candidate FTS cost is higher due to phrase phase-2 IO), wired through C API + Python binding - DocFilter::get_bf_by_keys_and_update now takes an explicit ratio so the two callers (vector vs FTS) pick the right knob; on the brute- force branch invert_filter_ is cleared so DocFilter never re-checks the same ids - 9 iterator unit tests + 7 reader equivalence tests (Term / OR / AND / Phrase / Nested, coexistence with IndexFilter, empty-candidate fallback) + config default / validation asserts
create_term_iterator_from_raw used to open BitPackedPostingIterator twice on the same buffer: an outer probe to read df/max_score, then a second open inside TermDocIterator's BitPacked constructor. Drop the probe and let the constructor read df/max_score from bp_iter_ after its single open(). On parse failure the iterator now reports cost()==0 (with the existing LOG_ERROR) and the caller skips it just like an empty term.
- FtsQueryParser now requires a tokenizer pipeline; phrase contents and bare terms are tokenized the same way the index was, so CJK queries match the doc-side segmentation instead of falling back to ASCII whitespace split. Multi-token bare terms compose via default_op. - JiebaTokenizer.position becomes a per-output sequence number. CutForSearch's overlapping sub-words shared a unicode_offset, which broke PhraseDocIterator's strict anchor+i adjacency check; sequence numbers stay contiguous so doc/query phrases align. - sqlengine_impl hoists pipeline creation above the query/match-string branch and feeds the parser.
Use DictTrie / HMMModel + per-mode segmenter directly so constructing JiebaTokenizer no longer force-loads idf.utf8 and stop_words.utf8 — files we never use, that previously forced ~12MB of required files on every deployment and ~10MB of resident memory at runtime. cppjieba aborts when those files are missing, which made the dict path a deploy-time foot-gun. Mode-aware required-field checks: cut_mode=hmm no longer demands dict_path, cut_mode=full no longer demands model_path. Drops the "idf_path" / "stop_word_path" config keys that were silently inert (KeywordExtractor uses them; segmenters don't).
When the analyzer drops every term (all stop-words, pure punctuation, or all-whitespace match_string), both query_string and match_string paths used to fail with InvalidArgument. Now they parse to an explicit EmptyNode that matches zero documents, mirroring Lucene's MatchNoDocsQuery and aligning the two entry points. The new node composes naturally with AND (whole conjunction matches nothing) and OR (child is skipped) via the existing null-iterator paths in build_iterator/search, so executor changes are limited to one switch case.
A single post-order pass over the FTS AST collapses sibling duplicates, flattens same-type composites for better block-max WAND pruning, detects must vs must_not contradictions, propagates EmptyNode, and folds single children. Duplicate TermNode/PhraseNode siblings are merged into one node whose boost is the linear sum, so the post-rewrite per-doc score equals the pre-rewrite "sum of N independent scorers" output exactly -- zero behavior change, only N posting lookups and N scorings collapsed to one per unique term. Flattening uses Lucene-style guards (pure-disjunction OR, no must_not in AND inner) to avoid altering match sets. Boost is plumbed through BM25Scorer::score_with_idf and applied to TermDocIterator's max_score so WAND pivot stays correct.
When an OrNode has any must_not children alongside positives, the rewriter now lifts them into a wrapping AndNode -- AND(OR(positives), must_nots...) -- so OrNode never carries must_not children downstream. Single-positive case skips the intermediate OR entirely, yielding the flatter AND(positive, must_nots...). Two follow-on cleanups: * build_or_iterator drops the disjunction-then-must_not-Conjunction wrap and now only handles SHOULD-style positives. A defensive error fires if a must_not slips through, surfacing rewriter bypasses loudly instead of silently producing wrong scores. * `apple -apple` style self-contradictions now collapse to EmptyNode for free: after canonicalization they become AND(apple, -apple), which the existing and_has_mustnot_conflict check rewrites to empty. FtsColumnIndexer::search() is reached via two paths: sqlengine (always runs simplify in parse_fts_query) and the indexer unit tests (helper search_ok now also calls simplify). Both paths therefore deliver a canonical AST, matching the new build_or_iterator invariant.
… anchor verify_phrase_positions previously issued one Get per (term, doc_id) *inside* the anchor loop, re-reading the same non-anchor term up to |anchor_positions| times. It also allocated a fresh key string per call, copied values into std::string, and always picked terms_[0] as anchor even when that term was the most frequent in the doc. Rewrite the phase-2 check to: - Dedup repeated terms within the phrase (e.g., "to be or not to be"). - Issue a single MultiGet across the unique terms, using PinnableSlice and a single reserved key buffer so Slice pointers stay valid and per-call allocations drop to O(1). - Decode every position list once into a local cache. - Pick the shortest position list as anchor so the outer loop iterates the rarest term, not the phrase-order first term. - Validate adjacency entirely in memory. Add append_doc_term_key as an in-place variant of make_doc_term_key for buffer-reuse callers. Add regression tests for repeated-term phrases and high-frequency leading terms.
Make jieba FTS work out of the box for users who just `import zvec` (or the equivalent in other SDKs), without requiring init(), env var, or per-field path configuration.
Per-segment FTS readers are drained by SegmentNode in LIFO order, so the BM25 ranking from individual segments was not preserved across the merged stream. Add a global order_by on the score column for FTS in the multi-segment branch of make_physical_plan, mirroring the existing behavior for vector queries. Cover the regression with a new fts_multi_segment_test that engineers segment stats so the highest- scoring doc lives in the LIFO-last segment.
FtsIndexParams previously leaked internal concerns into the SDK public header: create_pipeline(), the PipelinePtr alias, a forward declaration of fts::TokenizerPipeline, and pipeline cache fields (once_flag, shared_ptr, bool). Move the cache into an opaque detail::FtsState (Pimpl-style unique_ptr) and expose acquisition via internal-only detail::AcquireFtsPipeline() declared in src/db/index/common/fts_pipeline.h. The state is reconstructed lazily on first acquire so a moved-from instance remains usable. Internal callers (sqlengine_impl, fts_column_indexer, fts_bench) updated to use the new entry. SDK consumers see FtsIndexParams as a pure config class with zero pipeline references.
d15bb50 to
c1cdba4
Compare
Adapt FTS to the query-model refactor from alibaba#428/alibaba#405 on main: - VectorQuery is replaced by SearchQuery + QueryTarget (variant<VectorClause, FtsClause>) - Fold FTS validation into SearchQuery::validate_and_sanitize - c_api / python binding read & write FTS via FtsClause and target_.clause_ - Add QueryTarget::get_fts_clause() helper, replacing raw std::get_if calls - Migrate FTS tests to the new model; drop the vector/fts mutual-exclusion case that no longer applies under the variant
address #397