Skip to content

feat: add fts support#408

Open
egolearner wants to merge 46 commits into
alibaba:mainfrom
egolearner:feat/fts
Open

feat: add fts support#408
egolearner wants to merge 46 commits into
alibaba:mainfrom
egolearner:feat/fts

Conversation

@egolearner
Copy link
Copy Markdown
Collaborator

address #397

@egolearner egolearner force-pushed the feat/fts branch 6 times, most recently from a2882f1 to 122fb51 Compare May 22, 2026 07:43
@egolearner egolearner requested a review from Cuiyus as a code owner May 22, 2026 07:43
@egolearner egolearner changed the title feat: add fts support in db layer feat: add fts support May 22, 2026

import pytest

from zvec.model.param.query import Fts, Query
Copy link
Copy Markdown
Collaborator

@JalinWang JalinWang May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming "Fts" is a little bit too generic. Would it be more precise to name it after its underlying dependency, like _FtsQuery (binding) or FtsQueryParam (C++)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design is: Query is the top-level request containing multiple retrieval fields (vector, FTS, filter, etc.), and Fts is just one component within it — similar to how you'd have Vector or Filter as sibling fields.

Naming it FtsQuery would imply it's a standalone query type at the same level as Query, which is misleading. If we later split Query into multiple specialized types, FtsQuery would then be the right name for a full FTS query request.

We'll also align the C++ side to drop the "Query" suffix from the param struct for consistency.

What do you think? Any other naming suggestions?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right — FtsQuery doesn’t feel quite right.

That said, Fts alone also feels a bit awkward, since it doesn’t make it clear that this is a component of Query, not a standalone type.

Maybe we should consider a name that better reflects that role, such as FtsQueryParam, FtsClause, FtsCondition, FtsOption, or FtsFilter.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're adding VectorClause and FtsClause in the C++ SDK.
https://github.com/alibaba/zvec/pull/405/changes#diff-3c93a0b897a78e57cd5990f965fd452e6fc52765d8c93366ec47182fab03f2efR63

However, the Python SDK follows a more concise style. Since there's no clause abstraction in other Python scenarios, it would feel inconsistent to introduce FtsClause alone.

@egolearner egolearner force-pushed the feat/fts branch 4 times, most recently from 2120dc8 to ee9bef9 Compare May 27, 2026 09:48
q = Query(
field_name="embedding",
vector=[0.1, 0.2, 0.3],
fts=Fts(match_string="deep learning"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, would it be better to rename fts= here to something more explicit, like full_text= / full_text_search= / text_search=? It feels a bit clearer and makes the intent more obvious, especially for users who may not immediately recognize fts as full-text search.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the suggestion! I get that full_text= is more explicit. But fts is shorter and well-understood in this space, so I'd lean toward keeping it. Thanks for understanding!

BTW, sqlite simply uses fts5

CREATE VIRTUAL TABLE email USING fts5(sender, title, body);

Comment thread src/db/sqlengine/parser/select_info.h
Comment thread src/db/sqlengine/planner/query_planner.cc
Comment thread src/db/index/column/fts_column/iterator/fts_term_iterator.cc Outdated
Comment thread src/db/index/segment/segment.cc
Comment thread src/db/proto/zvec.proto
Comment thread src/include/zvec/db/index_params.h Outdated
egolearner added 21 commits May 28, 2026 15:46
block_max_info_for() now returns {score, last_doc} in one binary search
(with a small cache), so the standalone current_block_max_score(),
skip_to_next_block(), block_max_score_for() and block_max_last_doc_for()
methods have no live callers. Remove them from BitPackedPostingIterator,
the DocIterator base, and TermDocIterator, along with the now-dead
current_block_max_score_ member and its decode_block assignment. Tests
adjusted to query via block_max_info_for().
When an invert filter is highly selective compared to the FTS posting
size, posting-driven evaluation walks far more docs than necessary.
Mirror the existing vector_recall pattern: when invert match_count is
below fts_brute_force_by_keys_ratio * doc_count, extract the small id
set and AND it into the FTS root via a new CandidateDocIterator. The
candidate iterator becomes the lead by cost, turning the posting walk
into per-candidate advance() + matches() + score() and fully reusing
the existing AND / filter-pushdown / BM25 machinery.

- new CandidateDocIterator: ascending segment-local ids, lower_bound
  advance, zero score contribution
- FtsColumnIndexer::search wraps root_iter in Conjunction when
  FtsQueryParams.candidate_ids is non-empty
- new GlobalConfig::fts_brute_force_by_keys_ratio (default 0.05,
  independent from the vector knob because per-candidate FTS cost is
  higher due to phrase phase-2 IO), wired through C API + Python binding
- DocFilter::get_bf_by_keys_and_update now takes an explicit ratio so
  the two callers (vector vs FTS) pick the right knob; on the brute-
  force branch invert_filter_ is cleared so DocFilter never re-checks
  the same ids
- 9 iterator unit tests + 7 reader equivalence tests (Term / OR / AND /
  Phrase / Nested, coexistence with IndexFilter, empty-candidate
  fallback) + config default / validation asserts
create_term_iterator_from_raw used to open BitPackedPostingIterator twice
on the same buffer: an outer probe to read df/max_score, then a second
open inside TermDocIterator's BitPacked constructor. Drop the probe and
let the constructor read df/max_score from bp_iter_ after its single
open(). On parse failure the iterator now reports cost()==0 (with the
existing LOG_ERROR) and the caller skips it just like an empty term.
- FtsQueryParser now requires a tokenizer pipeline; phrase contents and
  bare terms are tokenized the same way the index was, so CJK queries
  match the doc-side segmentation instead of falling back to ASCII
  whitespace split. Multi-token bare terms compose via default_op.
- JiebaTokenizer.position becomes a per-output sequence number.
  CutForSearch's overlapping sub-words shared a unicode_offset, which
  broke PhraseDocIterator's strict anchor+i adjacency check; sequence
  numbers stay contiguous so doc/query phrases align.
- sqlengine_impl hoists pipeline creation above the query/match-string
  branch and feeds the parser.
Use DictTrie / HMMModel + per-mode segmenter directly so constructing
JiebaTokenizer no longer force-loads idf.utf8 and stop_words.utf8 — files
we never use, that previously forced ~12MB of required files on every
deployment and ~10MB of resident memory at runtime. cppjieba aborts when
those files are missing, which made the dict path a deploy-time foot-gun.

Mode-aware required-field checks: cut_mode=hmm no longer demands
dict_path, cut_mode=full no longer demands model_path. Drops the
"idf_path" / "stop_word_path" config keys that were silently inert
(KeywordExtractor uses them; segmenters don't).
When the analyzer drops every term (all stop-words, pure punctuation, or
all-whitespace match_string), both query_string and match_string paths
used to fail with InvalidArgument. Now they parse to an explicit
EmptyNode that matches zero documents, mirroring Lucene's
MatchNoDocsQuery and aligning the two entry points.

The new node composes naturally with AND (whole conjunction matches
nothing) and OR (child is skipped) via the existing null-iterator
paths in build_iterator/search, so executor changes are limited to
one switch case.
A single post-order pass over the FTS AST collapses sibling duplicates,
flattens same-type composites for better block-max WAND pruning, detects
must vs must_not contradictions, propagates EmptyNode, and folds single
children. Duplicate TermNode/PhraseNode siblings are merged into one
node whose boost is the linear sum, so the post-rewrite per-doc score
equals the pre-rewrite "sum of N independent scorers" output exactly --
zero behavior change, only N posting lookups and N scorings collapsed
to one per unique term.

Flattening uses Lucene-style guards (pure-disjunction OR, no must_not
in AND inner) to avoid altering match sets. Boost is plumbed through
BM25Scorer::score_with_idf and applied to TermDocIterator's max_score
so WAND pivot stays correct.
When an OrNode has any must_not children alongside positives, the
rewriter now lifts them into a wrapping AndNode -- AND(OR(positives),
must_nots...) -- so OrNode never carries must_not children downstream.
Single-positive case skips the intermediate OR entirely, yielding the
flatter AND(positive, must_nots...).

Two follow-on cleanups:

* build_or_iterator drops the disjunction-then-must_not-Conjunction
  wrap and now only handles SHOULD-style positives. A defensive error
  fires if a must_not slips through, surfacing rewriter bypasses
  loudly instead of silently producing wrong scores.

* `apple -apple` style self-contradictions now collapse to EmptyNode
  for free: after canonicalization they become AND(apple, -apple),
  which the existing and_has_mustnot_conflict check rewrites to empty.

FtsColumnIndexer::search() is reached via two paths: sqlengine (always
runs simplify in parse_fts_query) and the indexer unit tests (helper
search_ok now also calls simplify). Both paths therefore deliver a
canonical AST, matching the new build_or_iterator invariant.
… anchor

verify_phrase_positions previously issued one Get per (term, doc_id)
*inside* the anchor loop, re-reading the same non-anchor term up to
|anchor_positions| times. It also allocated a fresh key string per call,
copied values into std::string, and always picked terms_[0] as anchor
even when that term was the most frequent in the doc.

Rewrite the phase-2 check to:
- Dedup repeated terms within the phrase (e.g., "to be or not to be").
- Issue a single MultiGet across the unique terms, using PinnableSlice
  and a single reserved key buffer so Slice pointers stay valid and
  per-call allocations drop to O(1).
- Decode every position list once into a local cache.
- Pick the shortest position list as anchor so the outer loop iterates
  the rarest term, not the phrase-order first term.
- Validate adjacency entirely in memory.

Add append_doc_term_key as an in-place variant of make_doc_term_key for
buffer-reuse callers. Add regression tests for repeated-term phrases
and high-frequency leading terms.
Make jieba FTS work out of the box for users who just `import zvec`
(or the equivalent in other SDKs), without requiring init(), env var,
or per-field path configuration.
Per-segment FTS readers are drained by SegmentNode in LIFO order, so
the BM25 ranking from individual segments was not preserved across the
merged stream. Add a global order_by on the score column for FTS in the
multi-segment branch of make_physical_plan, mirroring the existing
behavior for vector queries. Cover the regression with a new
fts_multi_segment_test that engineers segment stats so the highest-
scoring doc lives in the LIFO-last segment.
FtsIndexParams previously leaked internal concerns into the SDK
public header: create_pipeline(), the PipelinePtr alias, a forward
declaration of fts::TokenizerPipeline, and pipeline cache fields
(once_flag, shared_ptr, bool).

Move the cache into an opaque detail::FtsState (Pimpl-style
unique_ptr) and expose acquisition via internal-only
detail::AcquireFtsPipeline() declared in
src/db/index/common/fts_pipeline.h. The state is reconstructed
lazily on first acquire so a moved-from instance remains usable.

Internal callers (sqlengine_impl, fts_column_indexer, fts_bench)
updated to use the new entry. SDK consumers see FtsIndexParams as
a pure config class with zero pipeline references.
@egolearner egolearner force-pushed the feat/fts branch 2 times, most recently from d15bb50 to c1cdba4 Compare May 28, 2026 16:15
Adapt FTS to the query-model refactor from alibaba#428/alibaba#405 on main:
- VectorQuery is replaced by SearchQuery + QueryTarget
  (variant<VectorClause, FtsClause>)
- Fold FTS validation into SearchQuery::validate_and_sanitize
- c_api / python binding read & write FTS via FtsClause and target_.clause_
- Add QueryTarget::get_fts_clause() helper, replacing raw std::get_if calls
- Migrate FTS tests to the new model; drop the vector/fts mutual-exclusion
  case that no longer applies under the variant
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants