Skip to content

Default n-gram tokenizer keeps JSON delimiters " : { }#4

Merged
platypii merged 1 commit into
masterfrom
ngram-structural
Jun 29, 2026
Merged

Default n-gram tokenizer keeps JSON delimiters " : { }#4
platypii merged 1 commit into
masterfrom
ngram-structural

Conversation

@platypii

Copy link
Copy Markdown
Contributor

Structured and JSON greps like "role": were un-prunable: the tokenizer dropped all punctuation, so role and system matched as bare words in every block and the index could not narrow the candidate set at all.

This makes the default tokenizer keep the JSON delimiters " : { } inside n-gram runs, so n-grams span them and stay selective on structured text.

Results (WildChat, 3.2M LLM logs, full 13 GB source)

Index grows just 1.04x (1.20 to 1.24 GB), while JSON greps see order-of-magnitude wins. Code, URL, and prose greps are unchanged, with no regressions.

End-to-end to the first 100 results, alphanumeric vs structural:

query alphanumeric structural speedup
"role": 14.7 s / 678 MB 0.35 s / 20 MB 42x
"type": "object" 79.0 s / 3766 MB 2.2 s / 176 MB 35x
"temperature": 0 45.8 s / 2596 MB 1.2 s / 81 MB 38x
"content": 11.0 s / 514 MB 0.9 s / 56 MB 12x
{"name": 5.6 s / 281 MB 0.5 s / 32 MB 11x
system.out.println 0.46 s 0.48 s tie
self.assertequal 1.76 s 1.71 s tie
schopenhauer 0.87 s 0.85 s tie

We picked the kept-character set empirically: keeping only " : { } recovers essentially all the selectivity of keeping every punctuation character, but at 1.04x the index size instead of 2.6x, and without the one query regression that the keep-everything approach showed.

Backward compatibility

The kept-character set is recorded per index in kv metadata (hypgrep.ngram_chars) and read back at query time, so query and index always tokenize identically. An index built before this change has no such key and is read as the empty set (plain alphanumeric), so existing indexes keep working unchanged. No index format-version bump.

Tests

New unit and round-trip tests cover the structural default, the recorded character set, the regex-escaping of kept characters, and the pre-structural compatibility path. Full suite passes (95 tests), lint clean.

Structured/JSON greps like "role": were un-prunable because the tokenizer
dropped all punctuation, so "role" and "system" matched as bare words in
every block. Keep the JSON delimiters " : { } inside n-gram runs so n-grams
span them and stay selective.

Measured on WildChat (3.2M LLM logs, full 13 GB source): index grows just
1.04x (1.20 to 1.24 GB), versus 2.6x for keeping all punctuation, while JSON
greps see order-of-magnitude wins. End-to-end to the first 100 results,
"role": drops from 14.7 s / 678 MB scanned to 0.35 s / 20 MB (42x), and
"type": "object" from 79 s to 2.2 s (35x). Code, URL, and prose greps are
unchanged, with no regressions.

The kept-character set is recorded per index in kv metadata
(hypgrep.ngram_chars) and read back at query time so query and index tokenize
identically. An index built before this change has no such key and is read as
the empty set (plain alphanumeric), so it keeps working unchanged, with no
index format-version bump.
@platypii

Copy link
Copy Markdown
Contributor Author

Code Review: structural (JSON-delimiter) default tokenizer

Solid, well-scoped change with good test coverage and careful backward-compat reasoning. npm test passes (95/95). npm run lint reports errors only in untracked data/*.mjs files unrelated to this PR; the changed source is clean.

Correctness

praise: the regex character-class escaping is correct and complete (src/ngrams.js). The escape set /[\]\\^-]/ covers exactly the four metacharacters special inside a JS regex character class: ], \, ^, -. [ is a literal inside a class in JS, so deliberately not escaping it is right. This is the part most likely to harbor a bug, and it's right.

praise: backward AND forward compatible without a version bump, and that's justified. A pre-structural index (no hypgrep.ngram_chars) reads as '' -> plain alphanumeric, matching how it was built. The reverse also holds: a structural run is a superset that contains every plain alphanumeric run as a contiguous substring, so an old client reading a new index over-matches candidate blocks and the per-row substring filter still returns correct results. Both directions stay correct.

praise: query/index tokenizer agreement is enforced by construction. createIndex validates and records ngramChars; queryIndex reads it back and threads it into queryNgrams/literalsToNgrams. All internal callers pass chars explicitly.

Metadata round-trips, including the empty-string case (the IDX_PLAIN test confirms '' survives the kv write/read).

Nits

  • The exported extractNgrams/queryNgrams/literalsToNgrams default param changed to defaultNgramChars (non-empty). Internal callers all pass it explicitly so the library is self-consistent, but any external caller invoking them without a chars arg silently switches tokenization. Worth a CHANGELOG line.
  • Escaping test coverage is partial (test/ngrams.test.js). Tests cover ] and -; the \ and ^ branches aren't exercised. A one-line assertion (e.g. extractNgrams('a^b', 3, '^')) would lock it in.
  • README scope creep. The diff adds badges, a benchmark table, and a package.json description, all unrelated to the tokenizer change. Not wrong, but it muddies the PR.

Validation note

createIndex rejects alphanumerics and whitespace via /[a-z0-9\s]/i. Reasonable: whitespace is intentionally excluded (it would explode index size) and the i flag catches uppercase. Duplicate chars in a class are harmless, so no dedup needed.

Overall: ship it. Optionally add a CHANGELOG note on the default-param behavior change.

@platypii platypii merged commit 6dbe916 into master Jun 29, 2026
6 checks passed
@platypii platypii deleted the ngram-structural branch June 29, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant