feat: port TextQuery stopword filtering from Python redisvl (#16)#31
feat: port TextQuery stopword filtering from Python redisvl (#16)#31ymendez-redis wants to merge 15 commits into
Conversation
Adds the implementation plan and ignores .augment/ per AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tokens are now normalized (trim, strip leading/trailing commas, strip typographic quotes, lowercase) before stopword filtering and escaping, matching Python redisvl's _tokenize_and_escape_query (query.py:1445). Default 'stopwords' is 'english' (the embedded NLTK list); pass null to disable, or a string[]/Set<string> to override. Custom lists are stored as-is (Python parity at query.py:1426) — callers must lowercase entries. Closes #16 (Phase 1 — English only).
Was: 'text yielded no tokens after stopword removal' Now: 'text yielded no tokens after normalization and stopword filtering' The previous wording misled when normalization alone (not stopword filtering) emptied every token — e.g. text=',,,' with stopwords=null.
Adds a TextQuery integration test that confirms the default English stopword list strips 'for' before the query reaches Redis, producing a valid @title:(programming) clause that returns the expected document. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The other in-flight PR renames every 'redisvl' to 'redis-vl' across the docs. Pre-conform our new stopwords additions so the result is consistent after both merge.
Two related fixes from senior code review.
1. resolveStopwords('__proto__') (or 'toString', 'constructor', …) silently
returned the matching Object.prototype member instead of throwing. The
error then surfaced as 'TypeError: stopwordSet.has is not a function'
at buildQuery() time instead of QueryValidationError at construction.
Fix: gate the registry lookup with Object.hasOwn.
2. The undefined and 'english' branches returned the LANGUAGE_REGISTRY
singleton by identity. A user mutating stopwords.english (or the
readonly Set typed away) poisoned every subsequent default-stopwords
TextQuery. Python copies on every construction (set(...) at
query.py:1414/:1426); we now do the same.
Tests: 14/14 pass — added a 4-case parametrized regression test for the
prototype-key edge cases; updated two existing assertions from toBe to
toEqual + not.toBe to verify the new copy semantics.
Python's _set_stopwords is a private method (leading underscore); the TS counterpart should also stay internal. The barrel now exposes only: - english (the data Set) - stopwords (the namespace const) - StopwordsInput (the public type) text.ts continues to import resolveStopwords directly from ./resolve.js for internal use. No public-facing change to typed consumers.
Without this, the test passes whether or not stopword filtering runs — Redis would match 'Laptop computer for programming' on 'programming' alone, with or without 'for' in the OR-clause. Adding an assertion on q.buildQuery() before the search ensures the filtering side of the contract is exercised by this test.
|
Thanks @ymendez-redis — this is a thorough port. NLTK attribution + SHA-256 pin + the 1. Drop the planning note from the diff
2. Confirm token normalization is Python-parity, not net-newThe new 3. All-stopwords input behaviorRight now After those three are addressed I'll do a final pass and merge. Heads up: there's an in-flight refactor in #25 that converts |
Closes #16.
Summary
Brings TS
TextQueryto Python redisvl parity on stopword behavior:stopwords?: string | ReadonlySet<string> | readonly string[] | nullonTextQueryConfig, default'english'.trim → strip leading/trailing commas → strip U+201C/U+201D → lowercase(matchesquery.py:1445).98e1426; raw-bytes SHA-256 recorded for drift detection.import { stopwords } from 'redis-vl'andimport { english } from 'redis-vl/stopwords'(new subpath export).THIRD_PARTY_NOTICES.mdshipped in the npm tarball (BSD-3 condition Fix HASH vector fetch decoding and score normalization #2).Phase 1 ships English only; the registry pattern under
src/utils/stopwords/is designed so the 32 remaining NLTK languages drop in as a pure-data follow-up PR without touchingTextQuery.Behavior change (acceptable for
0.1.0-beta.0)Multi-word queries with default settings now drop English stopwords. Single-word queries are unaffected. The old
:::note Minimal portcallout in the user-guide already flagged this gap.'hello, world'now renders as@description:(hello | world)instead of@description:(hello\\, | world)— the comma is normalized away pre-escape, mirroring Python.To opt out:
stopwords: null. To override: pass astring[]orSet<string>(entries stored verbatim — pass lowercase to match the lowercased tokens; Python-parity foot-gun documented in JSDoc and the user-guide).What's covered
src/utils/stopwords/{english,registry,resolve,index}.ts— new modulesrc/query/text.ts— pipeline rewritesrc/index.ts—stopwords+StopwordsInputre-exportpackage.json—./stopwordssubpath +THIRD_PARTY_NOTICES.mdinfilestests/unit/utils/stopwords/english.test.tstests/unit/utils/stopwords/resolve.test.ts(10 base + 4 prototype-key regression)tests/unit/query/text.test.ts; 1 existing test updated ('hello, world')tests/integration/query-types.test.tswebsite/docs/user-guide/filters-and-queries.md— callout, options table, examplesTest counts: 455 unit / 540 total (was 421 / 506).
Out of scope (separate issues)
STOPWORDSindex optionVerification
```
npm run lint # clean (2 pre-existing any warnings in huggingface-vectorizer)
npm run type-check # ok
npm run type-check:tests # ok
npm run test:unit # 455/455
npm run test # 540/540 (incl. integration via testcontainers)
cd website && npm run build # ok
```
Manual smoke:
```ts
import { TextQuery, stopwords } from 'redis-vl';
new TextQuery({ text: 'the quick brown fox', textFieldName: 'd' }).buildQuery();
// → '@d:(quick | brown | fox)'
new TextQuery({ text: 'the quick', textFieldName: 'd', stopwords: null }).buildQuery();
// → '@d:(the | quick)'
new TextQuery({ text: 'foo bar quick', textFieldName: 'd', stopwords: new Set([...stopwords.english, 'foo']) }).buildQuery();
// → '@d:(bar | quick)'
```
Test plan
🤖 Generated with Claude Code