feat: add fts support by egolearner · Pull Request #408 · alibaba/zvec

egolearner · 2026-05-15T09:17:45Z

address #397

JalinWang · 2026-05-22T10:29:26Z

+
+import pytest
+
+from zvec.model.param.query import Fts, Query


This naming "Fts" is a little bit too generic. Would it be more precise to name it after its underlying dependency, like _FtsQuery (binding) or FtsQueryParam (C++)?

The current design is: Query is the top-level request containing multiple retrieval fields (vector, FTS, filter, etc.), and Fts is just one component within it — similar to how you'd have Vector or Filter as sibling fields.

Naming it FtsQuery would imply it's a standalone query type at the same level as Query, which is misleading. If we later split Query into multiple specialized types, FtsQuery would then be the right name for a full FTS query request.

We'll also align the C++ side to drop the "Query" suffix from the param struct for consistency.

What do you think? Any other naming suggestions?

You’re right — FtsQuery doesn’t feel quite right.

That said, Fts alone also feels a bit awkward, since it doesn’t make it clear that this is a component of Query, not a standalone type.

Maybe we should consider a name that better reflects that role, such as FtsQueryParam, FtsClause, FtsCondition, FtsOption, or FtsFilter.

We're adding VectorClause and FtsClause in the C++ SDK.
https://github.com/alibaba/zvec/pull/405/changes#diff-3c93a0b897a78e57cd5990f965fd452e6fc52765d8c93366ec47182fab03f2efR63

However, the Python SDK follows a more concise style. Since there's no clause abstraction in other Python scenarios, it would feel inconsistent to introduce FtsClause alone.

JalinWang · 2026-05-27T09:53:36Z

+        q = Query(
+            field_name="embedding",
+            vector=[0.1, 0.2, 0.3],
+            fts=Fts(match_string="deep learning"),


Also, would it be better to rename fts= here to something more explicit, like full_text= / full_text_search= / text_search=? It feels a bit clearer and makes the intent more obvious, especially for users who may not immediately recognize fts as full-text search.

Appreciate the suggestion! I get that full_text= is more explicit. But fts is shorter and well-understood in this space, so I'd lean toward keeping it. Thanks for understanding!

BTW, sqlite simply uses fts5

CREATE VIRTUAL TABLE email USING fts5(sender, title, body);

… subdirs

block_max_info_for() now returns {score, last_doc} in one binary search (with a small cache), so the standalone current_block_max_score(), skip_to_next_block(), block_max_score_for() and block_max_last_doc_for() methods have no live callers. Remove them from BitPackedPostingIterator, the DocIterator base, and TermDocIterator, along with the now-dead current_block_max_score_ member and its decode_block assignment. Tests adjusted to query via block_max_info_for().

When an invert filter is highly selective compared to the FTS posting size, posting-driven evaluation walks far more docs than necessary. Mirror the existing vector_recall pattern: when invert match_count is below fts_brute_force_by_keys_ratio * doc_count, extract the small id set and AND it into the FTS root via a new CandidateDocIterator. The candidate iterator becomes the lead by cost, turning the posting walk into per-candidate advance() + matches() + score() and fully reusing the existing AND / filter-pushdown / BM25 machinery. - new CandidateDocIterator: ascending segment-local ids, lower_bound advance, zero score contribution - FtsColumnIndexer::search wraps root_iter in Conjunction when FtsQueryParams.candidate_ids is non-empty - new GlobalConfig::fts_brute_force_by_keys_ratio (default 0.05, independent from the vector knob because per-candidate FTS cost is higher due to phrase phase-2 IO), wired through C API + Python binding - DocFilter::get_bf_by_keys_and_update now takes an explicit ratio so the two callers (vector vs FTS) pick the right knob; on the brute- force branch invert_filter_ is cleared so DocFilter never re-checks the same ids - 9 iterator unit tests + 7 reader equivalence tests (Term / OR / AND / Phrase / Nested, coexistence with IndexFilter, empty-candidate fallback) + config default / validation asserts

create_term_iterator_from_raw used to open BitPackedPostingIterator twice on the same buffer: an outer probe to read df/max_score, then a second open inside TermDocIterator's BitPacked constructor. Drop the probe and let the constructor read df/max_score from bp_iter_ after its single open(). On parse failure the iterator now reports cost()==0 (with the existing LOG_ERROR) and the caller skips it just like an empty term.

- FtsQueryParser now requires a tokenizer pipeline; phrase contents and bare terms are tokenized the same way the index was, so CJK queries match the doc-side segmentation instead of falling back to ASCII whitespace split. Multi-token bare terms compose via default_op. - JiebaTokenizer.position becomes a per-output sequence number. CutForSearch's overlapping sub-words shared a unicode_offset, which broke PhraseDocIterator's strict anchor+i adjacency check; sequence numbers stay contiguous so doc/query phrases align. - sqlengine_impl hoists pipeline creation above the query/match-string branch and feeds the parser.

Use DictTrie / HMMModel + per-mode segmenter directly so constructing JiebaTokenizer no longer force-loads idf.utf8 and stop_words.utf8 — files we never use, that previously forced ~12MB of required files on every deployment and ~10MB of resident memory at runtime. cppjieba aborts when those files are missing, which made the dict path a deploy-time foot-gun. Mode-aware required-field checks: cut_mode=hmm no longer demands dict_path, cut_mode=full no longer demands model_path. Drops the "idf_path" / "stop_word_path" config keys that were silently inert (KeywordExtractor uses them; segmenters don't).

When the analyzer drops every term (all stop-words, pure punctuation, or all-whitespace match_string), both query_string and match_string paths used to fail with InvalidArgument. Now they parse to an explicit EmptyNode that matches zero documents, mirroring Lucene's MatchNoDocsQuery and aligning the two entry points. The new node composes naturally with AND (whole conjunction matches nothing) and OR (child is skipped) via the existing null-iterator paths in build_iterator/search, so executor changes are limited to one switch case.

A single post-order pass over the FTS AST collapses sibling duplicates, flattens same-type composites for better block-max WAND pruning, detects must vs must_not contradictions, propagates EmptyNode, and folds single children. Duplicate TermNode/PhraseNode siblings are merged into one node whose boost is the linear sum, so the post-rewrite per-doc score equals the pre-rewrite "sum of N independent scorers" output exactly -- zero behavior change, only N posting lookups and N scorings collapsed to one per unique term. Flattening uses Lucene-style guards (pure-disjunction OR, no must_not in AND inner) to avoid altering match sets. Boost is plumbed through BM25Scorer::score_with_idf and applied to TermDocIterator's max_score so WAND pivot stays correct.

When an OrNode has any must_not children alongside positives, the rewriter now lifts them into a wrapping AndNode -- AND(OR(positives), must_nots...) -- so OrNode never carries must_not children downstream. Single-positive case skips the intermediate OR entirely, yielding the flatter AND(positive, must_nots...). Two follow-on cleanups: * build_or_iterator drops the disjunction-then-must_not-Conjunction wrap and now only handles SHOULD-style positives. A defensive error fires if a must_not slips through, surfacing rewriter bypasses loudly instead of silently producing wrong scores. * `apple -apple` style self-contradictions now collapse to EmptyNode for free: after canonicalization they become AND(apple, -apple), which the existing and_has_mustnot_conflict check rewrites to empty. FtsColumnIndexer::search() is reached via two paths: sqlengine (always runs simplify in parse_fts_query) and the indexer unit tests (helper search_ok now also calls simplify). Both paths therefore deliver a canonical AST, matching the new build_or_iterator invariant.

… anchor verify_phrase_positions previously issued one Get per (term, doc_id) *inside* the anchor loop, re-reading the same non-anchor term up to |anchor_positions| times. It also allocated a fresh key string per call, copied values into std::string, and always picked terms_[0] as anchor even when that term was the most frequent in the doc. Rewrite the phase-2 check to: - Dedup repeated terms within the phrase (e.g., "to be or not to be"). - Issue a single MultiGet across the unique terms, using PinnableSlice and a single reserved key buffer so Slice pointers stay valid and per-call allocations drop to O(1). - Decode every position list once into a local cache. - Pick the shortest position list as anchor so the outer loop iterates the rarest term, not the phrase-order first term. - Validate adjacency entirely in memory. Add append_doc_term_key as an in-place variant of make_doc_term_key for buffer-reuse callers. Add regression tests for repeated-term phrases and high-frequency leading terms.

Make jieba FTS work out of the box for users who just `import zvec` (or the equivalent in other SDKs), without requiring init(), env var, or per-field path configuration.

Per-segment FTS readers are drained by SegmentNode in LIFO order, so the BM25 ranking from individual segments was not preserved across the merged stream. Add a global order_by on the score column for FTS in the multi-segment branch of make_physical_plan, mirroring the existing behavior for vector queries. Cover the regression with a new fts_multi_segment_test that engineers segment stats so the highest- scoring doc lives in the LIFO-last segment.

FtsIndexParams previously leaked internal concerns into the SDK public header: create_pipeline(), the PipelinePtr alias, a forward declaration of fts::TokenizerPipeline, and pipeline cache fields (once_flag, shared_ptr, bool). Move the cache into an opaque detail::FtsState (Pimpl-style unique_ptr) and expose acquisition via internal-only detail::AcquireFtsPipeline() declared in src/db/index/common/fts_pipeline.h. The state is reconstructed lazily on first acquire so a moved-from instance remains usable. Internal callers (sqlengine_impl, fts_column_indexer, fts_bench) updated to use the new entry. SDK consumers see FtsIndexParams as a pure config class with zero pipeline references.

fix windows ci

Adapt FTS to the query-model refactor from alibaba#428/alibaba#405 on main: - VectorQuery is replaced by SearchQuery + QueryTarget (variant<VectorClause, FtsClause>) - Fold FTS validation into SearchQuery::validate_and_sanitize - c_api / python binding read & write FTS via FtsClause and target_.clause_ - Add QueryTarget::get_fts_clause() helper, replacing raw std::get_if calls - Migrate FTS tests to the new model; drop the vector/fts mutual-exclusion case that no longer applies under the variant

egolearner requested review from JalinWang, chinaux, feihongxu0824 and zhourrr as code owners May 15, 2026 09:17

github-actions Bot assigned egolearner May 15, 2026

egolearner force-pushed the feat/fts branch 6 times, most recently from a2882f1 to 122fb51 Compare May 22, 2026 07:43

egolearner requested a review from Cuiyus as a code owner May 22, 2026 07:43

egolearner changed the title ~~feat: add fts support in db layer~~ feat: add fts support May 22, 2026

JalinWang reviewed May 22, 2026

View reviewed changes

egolearner force-pushed the feat/fts branch 4 times, most recently from 2120dc8 to ee9bef9 Compare May 27, 2026 09:48

JalinWang reviewed May 27, 2026

View reviewed changes

feihongxu0824 reviewed May 28, 2026

View reviewed changes

egolearner added 10 commits May 28, 2026 15:46

feat: add fts support

9a29685

fix mac compile & ci

494aea5

refactor parse fts & add fts debug text

d9825a0

fix some problems

0d66e82

refactor(fts_column): reorganize into tokenizer/, posting/, iterator/…

40458a2

… subdirs

perf: or use multi_get

e27c0b3

perf: optimize disjunction iterator

32a8729

perf: fts use hashskiplist

150a0b4

refactor batch_get_postings

68b7da6

perf: optimize iterator virtual function

b498b3b

egolearner added 21 commits May 28, 2026 15:46

PartialMerge no optimize

09b2ba2

fix fts score

568bb46

python binding support fts

9e45a8a

feat: wire FTS reduce into Optimize compaction

4e16e97

feat: allow schema without vector fields

9371427

rename cpp FtsQuery to Fts and fix compile

6aa8c45

add c binding

12691b5

fix some problems

868433b

feat: auto-register bundled jieba dict on SDK import

c9f6aa3

Make jieba FTS work out of the box for users who just `import zvec` (or the equivalent in other SDKs), without requiring init(), env var, or per-field path configuration.

fix boost

166c8a0

egolearner force-pushed the feat/fts branch 2 times, most recently from d15bb50 to c1cdba4 Compare May 28, 2026 16:15

egolearner added 2 commits May 29, 2026 09:48

fixup! feat: auto-register bundled jieba dict on SDK import

4a0d164

fix windows ci

Create/Drop Index check supported index type

5c7de5c

egolearner force-pushed the feat/fts branch from c1cdba4 to 5c7de5c Compare May 29, 2026 06:54

egolearner added 4 commits May 29, 2026 18:38

fts field not allowed in filter

866ecfb

fix some invert check

0e840cf

fix bench

6f26688

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add fts support#408

feat: add fts support#408
egolearner wants to merge 46 commits into
alibaba:mainfrom
egolearner:feat/fts

egolearner commented May 15, 2026

Uh oh!

JalinWang May 22, 2026 •

edited

Loading

Uh oh!

egolearner May 25, 2026

Uh oh!

JalinWang May 27, 2026

Uh oh!

egolearner May 28, 2026

Uh oh!

JalinWang May 27, 2026

Uh oh!

egolearner May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

egolearner commented May 15, 2026

Uh oh!

JalinWang May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

egolearner May 25, 2026

Choose a reason for hiding this comment

Uh oh!

JalinWang May 27, 2026

Choose a reason for hiding this comment

Uh oh!

egolearner May 28, 2026

Choose a reason for hiding this comment

Uh oh!

JalinWang May 27, 2026

Choose a reason for hiding this comment

Uh oh!

egolearner May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JalinWang May 22, 2026 •

edited

Loading