feat: port TextQuery stopword filtering from Python redisvl (#16) by ymendez-redis · Pull Request #31 · redis-developer/redis-vl-typescript

ymendez-redis · 2026-05-20T19:29:04Z

Closes #16.

Summary

Brings TS TextQuery to Python redisvl parity on stopword behavior:

stopwords?: string | ReadonlySet<string> | readonly string[] | null on TextQueryConfig, default 'english'.
Token normalization pre-filter pipeline: trim → strip leading/trailing commas → strip U+201C/U+201D → lowercase (matches query.py:1445).
Embedded NLTK English list (198 words, BSD-3-Clause Snowball) verbatim from pinned commit 98e1426; raw-bytes SHA-256 recorded for drift detection.
Public surface: import { stopwords } from 'redis-vl' and import { english } from 'redis-vl/stopwords' (new subpath export).
THIRD_PARTY_NOTICES.md shipped in the npm tarball (BSD-3 condition Fix HASH vector fetch decoding and score normalization #2).

Phase 1 ships English only; the registry pattern under src/utils/stopwords/ is designed so the 32 remaining NLTK languages drop in as a pure-data follow-up PR without touching TextQuery.

Behavior change (acceptable for `0.1.0-beta.0`)

Multi-word queries with default settings now drop English stopwords. Single-word queries are unaffected. The old :::note Minimal port callout in the user-guide already flagged this gap.

'hello, world' now renders as @description:(hello | world) instead of @description:(hello\\, | world) — the comma is normalized away pre-escape, mirroring Python.

To opt out: stopwords: null. To override: pass a string[] or Set<string> (entries stored verbatim — pass lowercase to match the lowercased tokens; Python-parity foot-gun documented in JSDoc and the user-guide).

What's covered

src/utils/stopwords/{english,registry,resolve,index}.ts — new module
src/query/text.ts — pipeline rewrite
src/index.ts — stopwords + StopwordsInput re-export
package.json — ./stopwords subpath + THIRD_PARTY_NOTICES.md in files
4 new unit tests in tests/unit/utils/stopwords/english.test.ts
14 new unit tests in tests/unit/utils/stopwords/resolve.test.ts (10 base + 4 prototype-key regression)
16 new tests in tests/unit/query/text.test.ts; 1 existing test updated ('hello, world')
1 integration test in tests/integration/query-types.test.ts
website/docs/user-guide/filters-and-queries.md — callout, options table, examples

Test counts: 455 unit / 540 total (was 421 / 506).

Out of scope (separate issues)

Phase 2: 32 more NLTK languages (mechanical, register + barrel)
Per-field / per-token text weights from Python
Server-side STOPWORDS index option

Verification

```
npm run lint # clean (2 pre-existing any warnings in huggingface-vectorizer)
npm run type-check # ok
npm run type-check:tests # ok
npm run test:unit # 455/455
npm run test # 540/540 (incl. integration via testcontainers)
cd website && npm run build # ok
```

Manual smoke:

```ts
import { TextQuery, stopwords } from 'redis-vl';
new TextQuery({ text: 'the quick brown fox', textFieldName: 'd' }).buildQuery();
// → '@d:(quick | brown | fox)'
new TextQuery({ text: 'the quick', textFieldName: 'd', stopwords: null }).buildQuery();
// → '@d:(the | quick)'
new TextQuery({ text: 'foo bar quick', textFieldName: 'd', stopwords: new Set([...stopwords.english, 'foo']) }).buildQuery();
// → '@d:(bar | quick)'
```

Test plan

CI green on PR
Review the Snowball / NLTK attribution in `THIRD_PARTY_NOTICES.md`
Confirm `stopwords` namespace shape is what we want for Phase 2
Confirm the behavior change on multi-word queries is acceptable pre-1.0

🤖 Generated with Claude Code

Adds the implementation plan and ignores .augment/ per AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…426)

…export

Tokens are now normalized (trim, strip leading/trailing commas, strip typographic quotes, lowercase) before stopword filtering and escaping, matching Python redisvl's _tokenize_and_escape_query (query.py:1445). Default 'stopwords' is 'english' (the embedded NLTK list); pass null to disable, or a string[]/Set<string> to override. Custom lists are stored as-is (Python parity at query.py:1426) — callers must lowercase entries. Closes #16 (Phase 1 — English only).

Was: 'text yielded no tokens after stopword removal' Now: 'text yielded no tokens after normalization and stopword filtering' The previous wording misled when normalization alone (not stopword filtering) emptied every token — e.g. text=',,,' with stopwords=null.

@title

Adds a TextQuery integration test that confirms the default English stopword list strips 'for' before the query reaches Redis, producing a valid @title:(programming) clause that returns the expected document. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The other in-flight PR renames every 'redisvl' to 'redis-vl' across the docs. Pre-conform our new stopwords additions so the result is consistent after both merge.

Two related fixes from senior code review. 1. resolveStopwords('__proto__') (or 'toString', 'constructor', …) silently returned the matching Object.prototype member instead of throwing. The error then surfaced as 'TypeError: stopwordSet.has is not a function' at buildQuery() time instead of QueryValidationError at construction. Fix: gate the registry lookup with Object.hasOwn. 2. The undefined and 'english' branches returned the LANGUAGE_REGISTRY singleton by identity. A user mutating stopwords.english (or the readonly Set typed away) poisoned every subsequent default-stopwords TextQuery. Python copies on every construction (set(...) at query.py:1414/:1426); we now do the same. Tests: 14/14 pass — added a 4-case parametrized regression test for the prototype-key edge cases; updated two existing assertions from toBe to toEqual + not.toBe to verify the new copy semantics.

Python's _set_stopwords is a private method (leading underscore); the TS counterpart should also stay internal. The barrel now exposes only: - english (the data Set) - stopwords (the namespace const) - StopwordsInput (the public type) text.ts continues to import resolveStopwords directly from ./resolve.js for internal use. No public-facing change to typed consumers.

Without this, the test passes whether or not stopword filtering runs — Redis would match 'Laptop computer for programming' on 'programming' alone, with or without 'for' in the OR-clause. Adding an assertion on q.buildQuery() before the search ensures the filtering side of the contract is exercised by this test.

banker · 2026-05-22T20:26:29Z

Thanks @ymendez-redis — this is a thorough port. NLTK attribution + SHA-256 pin + the StopwordsInput shape all look great. A few asks before merge:

1. Drop the planning note from the diff

docs/superpowers/plans/2026-05-20-textquery-stopwords.md shouldn't ship in the repo per AGENTS.md → Files Never Committed ("Any other ad-hoc markdown files during development"). The .gitignore add for .augment/ suggests you're already on board with the rule — please remove this file (and add docs/superpowers/ to .gitignore if you want to keep planning notes locally).

2. Confirm token normalization is Python-parity, not net-new

The new normalizeToken does trim + strip leading/trailing commas + strip typographic quotes + lowercase. Lowercasing is required for the stopword set lookup to work, but the comma/curly-quote stripping is an extra behavior the previous TextQuery didn't have. Could you point at the equivalent in Python redisvl/query/query.py, or call out that this is an intentional improvement over Python? If it's net-new, I'd lean toward dropping it (or scoping it tighter) to keep the parity story clean — comma/quote handling can be a separate small PR.

3. All-stopwords input behavior

Right now new TextQuery({ text: "the and a" }).buildQuery() throws QueryValidationError("text yielded no tokens after normalization and stopword filtering"). What does Python do in this case — throw, emit an empty query string, or something else? Whichever Python does, please match it (and add a test that pins the chosen behavior).

After those three are addressed I'll do a final pass and merge. Heads up: there's an in-flight refactor in #25 that converts TextQuery from implements BaseQuery to extends BaseQuery — I'll coordinate the merge order so you don't have to rebase against it; this PR goes first.

yorbigmendez and others added 15 commits May 20, 2026 10:10

chore: scaffold TextQuery stopwords work (#16)

593dce0

Adds the implementation plan and ignores .augment/ per AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs: add THIRD_PARTY_NOTICES.md for BSD-3 Snowball stopword list

d543d6f

feat(stopwords): embed NLTK English list (BSD-3 Snowball, pinned 98e1…

2beb85c

…426)

feat(stopwords): add registry and resolveStopwords helper

e7bb87c

feat(stopwords): add barrel with stopwords namespace + english named …

2a9a6b3

…export

feat(stopwords): re-export stopwords namespace from main entry

34495da

feat(stopwords): add redis-vl/stopwords subpath export + ship notices

85a1eca

docs: document TextQuery stopword filtering

c1c898e

docs: use canonical 'redis-vl' package name in stopwords examples

29095c7

The other in-flight PR renames every 'redisvl' to 'redis-vl' across the docs. Pre-conform our new stopwords additions so the result is consistent after both merge.

ajGingrich mentioned this pull request May 20, 2026

feat(query): port TextQuery per-token and per-field weights (#17) #32

Draft

4 tasks

ymendez-redis marked this pull request as ready for review May 21, 2026 15:43

ymendez-redis requested review from ajGingrich and booleanhunter May 21, 2026 15:43

banker mentioned this pull request May 22, 2026

feat(query): add shared query base classes #25

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: port TextQuery stopword filtering from Python redisvl (#16)#31

feat: port TextQuery stopword filtering from Python redisvl (#16)#31
ymendez-redis wants to merge 15 commits into
mainfrom
feat/textquery-stopwords-16

ymendez-redis commented May 20, 2026

Uh oh!

banker commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ymendez-redis commented May 20, 2026

Summary

Behavior change (acceptable for 0.1.0-beta.0)

What's covered

Out of scope (separate issues)

Verification

Test plan

Uh oh!

banker commented May 22, 2026

1. Drop the planning note from the diff

2. Confirm token normalization is Python-parity, not net-new

3. All-stopwords input behavior

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Behavior change (acceptable for `0.1.0-beta.0`)