Skip to content

feat: port TextQuery stopword filtering from Python redisvl (#16)#31

Open
ymendez-redis wants to merge 15 commits into
mainfrom
feat/textquery-stopwords-16
Open

feat: port TextQuery stopword filtering from Python redisvl (#16)#31
ymendez-redis wants to merge 15 commits into
mainfrom
feat/textquery-stopwords-16

Conversation

@ymendez-redis
Copy link
Copy Markdown
Collaborator

Closes #16.

Summary

Brings TS TextQuery to Python redisvl parity on stopword behavior:

  • stopwords?: string | ReadonlySet<string> | readonly string[] | null on TextQueryConfig, default 'english'.
  • Token normalization pre-filter pipeline: trim → strip leading/trailing commas → strip U+201C/U+201D → lowercase (matches query.py:1445).
  • Embedded NLTK English list (198 words, BSD-3-Clause Snowball) verbatim from pinned commit 98e1426; raw-bytes SHA-256 recorded for drift detection.
  • Public surface: import { stopwords } from 'redis-vl' and import { english } from 'redis-vl/stopwords' (new subpath export).
  • THIRD_PARTY_NOTICES.md shipped in the npm tarball (BSD-3 condition Fix HASH vector fetch decoding and score normalization #2).

Phase 1 ships English only; the registry pattern under src/utils/stopwords/ is designed so the 32 remaining NLTK languages drop in as a pure-data follow-up PR without touching TextQuery.

Behavior change (acceptable for 0.1.0-beta.0)

Multi-word queries with default settings now drop English stopwords. Single-word queries are unaffected. The old :::note Minimal port callout in the user-guide already flagged this gap.

'hello, world' now renders as @description:(hello | world) instead of @description:(hello\\, | world) — the comma is normalized away pre-escape, mirroring Python.

To opt out: stopwords: null. To override: pass a string[] or Set<string> (entries stored verbatim — pass lowercase to match the lowercased tokens; Python-parity foot-gun documented in JSDoc and the user-guide).

What's covered

  • src/utils/stopwords/{english,registry,resolve,index}.ts — new module
  • src/query/text.ts — pipeline rewrite
  • src/index.tsstopwords + StopwordsInput re-export
  • package.json./stopwords subpath + THIRD_PARTY_NOTICES.md in files
  • 4 new unit tests in tests/unit/utils/stopwords/english.test.ts
  • 14 new unit tests in tests/unit/utils/stopwords/resolve.test.ts (10 base + 4 prototype-key regression)
  • 16 new tests in tests/unit/query/text.test.ts; 1 existing test updated ('hello, world')
  • 1 integration test in tests/integration/query-types.test.ts
  • website/docs/user-guide/filters-and-queries.md — callout, options table, examples

Test counts: 455 unit / 540 total (was 421 / 506).

Out of scope (separate issues)

  • Phase 2: 32 more NLTK languages (mechanical, register + barrel)
  • Per-field / per-token text weights from Python
  • Server-side STOPWORDS index option

Verification

```
npm run lint # clean (2 pre-existing any warnings in huggingface-vectorizer)
npm run type-check # ok
npm run type-check:tests # ok
npm run test:unit # 455/455
npm run test # 540/540 (incl. integration via testcontainers)
cd website && npm run build # ok
```

Manual smoke:

```ts
import { TextQuery, stopwords } from 'redis-vl';
new TextQuery({ text: 'the quick brown fox', textFieldName: 'd' }).buildQuery();
// → '@d:(quick | brown | fox)'
new TextQuery({ text: 'the quick', textFieldName: 'd', stopwords: null }).buildQuery();
// → '@d:(the | quick)'
new TextQuery({ text: 'foo bar quick', textFieldName: 'd', stopwords: new Set([...stopwords.english, 'foo']) }).buildQuery();
// → '@d:(bar | quick)'
```

Test plan

  • CI green on PR
  • Review the Snowball / NLTK attribution in `THIRD_PARTY_NOTICES.md`
  • Confirm `stopwords` namespace shape is what we want for Phase 2
  • Confirm the behavior change on multi-word queries is acceptable pre-1.0

🤖 Generated with Claude Code

yorbigmendez and others added 15 commits May 20, 2026 10:10
Adds the implementation plan and ignores .augment/ per AGENTS.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tokens are now normalized (trim, strip leading/trailing commas, strip
typographic quotes, lowercase) before stopword filtering and escaping,
matching Python redisvl's _tokenize_and_escape_query (query.py:1445).

Default 'stopwords' is 'english' (the embedded NLTK list); pass null to
disable, or a string[]/Set<string> to override. Custom lists are stored
as-is (Python parity at query.py:1426) — callers must lowercase entries.

Closes #16 (Phase 1 — English only).
Was: 'text yielded no tokens after stopword removal'
Now: 'text yielded no tokens after normalization and stopword filtering'

The previous wording misled when normalization alone (not stopword filtering)
emptied every token — e.g. text=',,,' with stopwords=null.
Adds a TextQuery integration test that confirms the default English
stopword list strips 'for' before the query reaches Redis, producing
a valid @title:(programming) clause that returns the expected document.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The other in-flight PR renames every 'redisvl' to 'redis-vl' across the
docs. Pre-conform our new stopwords additions so the result is consistent
after both merge.
Two related fixes from senior code review.

1. resolveStopwords('__proto__') (or 'toString', 'constructor', …) silently
   returned the matching Object.prototype member instead of throwing. The
   error then surfaced as 'TypeError: stopwordSet.has is not a function'
   at buildQuery() time instead of QueryValidationError at construction.
   Fix: gate the registry lookup with Object.hasOwn.

2. The undefined and 'english' branches returned the LANGUAGE_REGISTRY
   singleton by identity. A user mutating stopwords.english (or the
   readonly Set typed away) poisoned every subsequent default-stopwords
   TextQuery. Python copies on every construction (set(...) at
   query.py:1414/:1426); we now do the same.

Tests: 14/14 pass — added a 4-case parametrized regression test for the
prototype-key edge cases; updated two existing assertions from toBe to
toEqual + not.toBe to verify the new copy semantics.
Python's _set_stopwords is a private method (leading underscore); the TS
counterpart should also stay internal. The barrel now exposes only:
  - english (the data Set)
  - stopwords (the namespace const)
  - StopwordsInput (the public type)

text.ts continues to import resolveStopwords directly from ./resolve.js
for internal use. No public-facing change to typed consumers.
Without this, the test passes whether or not stopword filtering runs —
Redis would match 'Laptop computer for programming' on 'programming'
alone, with or without 'for' in the OR-clause. Adding an assertion on
q.buildQuery() before the search ensures the filtering side of the
contract is exercised by this test.
@banker
Copy link
Copy Markdown
Contributor

banker commented May 22, 2026

Thanks @ymendez-redis — this is a thorough port. NLTK attribution + SHA-256 pin + the StopwordsInput shape all look great. A few asks before merge:

1. Drop the planning note from the diff

docs/superpowers/plans/2026-05-20-textquery-stopwords.md shouldn't ship in the repo per AGENTS.md → Files Never Committed ("Any other ad-hoc markdown files during development"). The .gitignore add for .augment/ suggests you're already on board with the rule — please remove this file (and add docs/superpowers/ to .gitignore if you want to keep planning notes locally).

2. Confirm token normalization is Python-parity, not net-new

The new normalizeToken does trim + strip leading/trailing commas + strip typographic quotes + lowercase. Lowercasing is required for the stopword set lookup to work, but the comma/curly-quote stripping is an extra behavior the previous TextQuery didn't have. Could you point at the equivalent in Python redisvl/query/query.py, or call out that this is an intentional improvement over Python? If it's net-new, I'd lean toward dropping it (or scoping it tighter) to keep the parity story clean — comma/quote handling can be a separate small PR.

3. All-stopwords input behavior

Right now new TextQuery({ text: "the and a" }).buildQuery() throws QueryValidationError("text yielded no tokens after normalization and stopword filtering"). What does Python do in this case — throw, emit an empty query string, or something else? Whichever Python does, please match it (and add a test that pins the chosen behavior).

After those three are addressed I'll do a final pass and merge. Heads up: there's an in-flight refactor in #25 that converts TextQuery from implements BaseQuery to extends BaseQuery — I'll coordinate the merge order so you don't have to rebase against it; this PR goes first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Port TextQuery stopword filtering from Python redisvl

3 participants