Default n-gram tokenizer keeps JSON delimiters " : { } by platypii · Pull Request #4 · hyparam/hypgrep

platypii · 2026-06-29T16:43:44Z

Structured and JSON greps like "role": were un-prunable: the tokenizer dropped all punctuation, so role and system matched as bare words in every block and the index could not narrow the candidate set at all.

This makes the default tokenizer keep the JSON delimiters " : { } inside n-gram runs, so n-grams span them and stay selective on structured text.

Results (WildChat, 3.2M LLM logs, full 13 GB source)

Index grows just 1.04x (1.20 to 1.24 GB), while JSON greps see order-of-magnitude wins. Code, URL, and prose greps are unchanged, with no regressions.

End-to-end to the first 100 results, alphanumeric vs structural:

query	alphanumeric	structural	speedup
`"role":`	14.7 s / 678 MB	0.35 s / 20 MB	42x
`"type": "object"`	79.0 s / 3766 MB	2.2 s / 176 MB	35x
`"temperature": 0`	45.8 s / 2596 MB	1.2 s / 81 MB	38x
`"content":`	11.0 s / 514 MB	0.9 s / 56 MB	12x
`{"name":`	5.6 s / 281 MB	0.5 s / 32 MB	11x
`system.out.println`	0.46 s	0.48 s	tie
`self.assertequal`	1.76 s	1.71 s	tie
`schopenhauer`	0.87 s	0.85 s	tie

We picked the kept-character set empirically: keeping only " : { } recovers essentially all the selectivity of keeping every punctuation character, but at 1.04x the index size instead of 2.6x, and without the one query regression that the keep-everything approach showed.

Backward compatibility

The kept-character set is recorded per index in kv metadata (hypgrep.ngram_chars) and read back at query time, so query and index always tokenize identically. An index built before this change has no such key and is read as the empty set (plain alphanumeric), so existing indexes keep working unchanged. No index format-version bump.

Tests

New unit and round-trip tests cover the structural default, the recorded character set, the regex-escaping of kept characters, and the pre-structural compatibility path. Full suite passes (95 tests), lint clean.

Structured/JSON greps like "role": were un-prunable because the tokenizer dropped all punctuation, so "role" and "system" matched as bare words in every block. Keep the JSON delimiters " : { } inside n-gram runs so n-grams span them and stay selective. Measured on WildChat (3.2M LLM logs, full 13 GB source): index grows just 1.04x (1.20 to 1.24 GB), versus 2.6x for keeping all punctuation, while JSON greps see order-of-magnitude wins. End-to-end to the first 100 results, "role": drops from 14.7 s / 678 MB scanned to 0.35 s / 20 MB (42x), and "type": "object" from 79 s to 2.2 s (35x). Code, URL, and prose greps are unchanged, with no regressions. The kept-character set is recorded per index in kv metadata (hypgrep.ngram_chars) and read back at query time so query and index tokenize identically. An index built before this change has no such key and is read as the empty set (plain alphanumeric), so it keeps working unchanged, with no index format-version bump.

platypii · 2026-06-29T16:48:25Z

Code Review: structural (JSON-delimiter) default tokenizer

Solid, well-scoped change with good test coverage and careful backward-compat reasoning. npm test passes (95/95). npm run lint reports errors only in untracked data/*.mjs files unrelated to this PR; the changed source is clean.

Correctness

praise: the regex character-class escaping is correct and complete (src/ngrams.js). The escape set /[\]\\^-]/ covers exactly the four metacharacters special inside a JS regex character class: ], \, ^, -. [ is a literal inside a class in JS, so deliberately not escaping it is right. This is the part most likely to harbor a bug, and it's right.

praise: backward AND forward compatible without a version bump, and that's justified. A pre-structural index (no hypgrep.ngram_chars) reads as '' -> plain alphanumeric, matching how it was built. The reverse also holds: a structural run is a superset that contains every plain alphanumeric run as a contiguous substring, so an old client reading a new index over-matches candidate blocks and the per-row substring filter still returns correct results. Both directions stay correct.

praise: query/index tokenizer agreement is enforced by construction. createIndex validates and records ngramChars; queryIndex reads it back and threads it into queryNgrams/literalsToNgrams. All internal callers pass chars explicitly.

Metadata round-trips, including the empty-string case (the IDX_PLAIN test confirms '' survives the kv write/read).

Nits

The exported extractNgrams/queryNgrams/literalsToNgrams default param changed to defaultNgramChars (non-empty). Internal callers all pass it explicitly so the library is self-consistent, but any external caller invoking them without a chars arg silently switches tokenization. Worth a CHANGELOG line.
Escaping test coverage is partial (test/ngrams.test.js). Tests cover ] and -; the \ and ^ branches aren't exercised. A one-line assertion (e.g. extractNgrams('a^b', 3, '^')) would lock it in.
README scope creep. The diff adds badges, a benchmark table, and a package.json description, all unrelated to the tokenizer change. Not wrong, but it muddies the PR.

Validation note

createIndex rejects alphanumerics and whitespace via /[a-z0-9\s]/i. Reasonable: whitespace is intentionally excluded (it would explode index size) and the i flag catches uppercase. Duplicate chars in a class are harmless, so no dedup needed.

Overall: ship it. Optionally add a CHANGELOG note on the default-param behavior change.

platypii merged commit 6dbe916 into master Jun 29, 2026
6 checks passed

platypii deleted the ngram-structural branch June 29, 2026 16:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Default n-gram tokenizer keeps JSON delimiters " : { }#4

Default n-gram tokenizer keeps JSON delimiters " : { }#4
platypii merged 1 commit into
masterfrom
ngram-structural

platypii commented Jun 29, 2026

Uh oh!

platypii commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

platypii commented Jun 29, 2026

Results (WildChat, 3.2M LLM logs, full 13 GB source)

Backward compatibility

Tests

Uh oh!

platypii commented Jun 29, 2026

Code Review: structural (JSON-delimiter) default tokenizer

Correctness

Nits

Validation note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant