refactor: replace regex with stdlib re and brace counter#35
refactor: replace regex with stdlib re and brace counter#35isaacbmiller wants to merge 1 commit intomainfrom
Conversation
- dpr.py: swap \p{L}/\p{N}/\p{M}/\p{Z}/\p{C} Unicode property escapes
for \w and \S approximations (covers all practical tokenization cases)
- json_adapter.py: replace recursive (?R) pattern with a brace-counting
_extract_outermost_json_object() function
- Remove regex>=2023.10.3 from direct dependencies (still transitive
via tiktoken)
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Greptile SummaryThis PR removes
Confidence Score: 4/5Safe to merge; the JSON adapter replacement is functionally correct, and the DPR tokeniser changes are minor approximations unlikely to affect production workloads. The brace-counting parser in dspy/dsp/utils/dpr.py deserves a second look for projects that rely on Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["JSONAdapter.parse(completion)"] --> B["json_repair.loads(completion)"]
B --> C{isinstance dict?}
C -- Yes --> G[Return fields]
C -- No --> D["_extract_outermost_json_object(completion)"]
D --> E{found braces?}
E -- No --> F["raise AdapterParseError"]
E -- Yes --> H["Scan char-by-char\ntrack in_string + escape + depth"]
H --> I{depth == 0?}
I -- Yes --> J["Return text[start:i+1]"]
I -- No --> H
J --> K["json_repair.loads(extracted)"]
K --> L{isinstance dict?}
L -- Yes --> G
L -- No --> F
Reviews (1): Last reviewed commit: "refactor: replace regex with stdlib re a..." | Re-trigger Greptile |
| class SimpleTokenizer(Tokenizer): | ||
| ALPHA_NUM = r"[\p{L}\p{N}\p{M}]+" | ||
| NON_WS = r"[^\p{Z}\p{C}]" | ||
| ALPHA_NUM = r"[\w\u0300-\u036f]+" |
There was a problem hiding this comment.
\w includes underscore, silently changing tokenisation
Python's \w is [a-zA-Z0-9_] (plus Unicode letters/digits with the UNICODE flag), whereas the original [\p{L}\p{N}\p{M}] never included _. Under the old pattern hello_world produced three tokens (hello, _, world via NON_WS); under the new one it produces a single token hello_world. This changes how underscore-delimited technical terms are tokenised when SimpleTokenizer is used for retrieval, potentially shifting exact-match recall for queries or passages that contain underscores.
| ALPHA_NUM = r"[\p{L}\p{N}\p{M}]+" | ||
| NON_WS = r"[^\p{Z}\p{C}]" | ||
| ALPHA_NUM = r"[\w\u0300-\u036f]+" | ||
| NON_WS = r"\S" |
There was a problem hiding this comment.
\S is not equivalent to [^\p{Z}\p{C}] for Unicode format characters
[^\p{Z}\p{C}] excludes all Unicode "Other" characters (\p{C}), which includes invisible format characters such as zero-width joiners (U+200D), the BOM (U+FEFF), and soft hyphens (U+00AD). \S (even with re.UNICODE) does not exclude most of these: they are non-whitespace, so \S matches them and they will surface as single-character NON_WS tokens where the original silently ignored them. For realistic NLP corpora this is unlikely to matter, but it is a semantic drift worth documenting.
Summary
Remove
regexas a direct dependency by replacing both usages with stdlib equivalents.regexremains transitively available viatiktoken.Changes
dspy/dsp/utils/dpr.py: Replace\p{L}\p{N}\p{M}Unicode property escapes with\w+ combining diacriticals range, and\p{Z}\p{C}with\S. Covers all practical tokenization cases (tested with accented Unicode text).dspy/adapters/json_adapter.py: Replace recursive(?R)regex pattern with a_extract_outermost_json_object()brace-counting function that handles nested braces, string-embedded braces, and escaped quotes.pyproject.toml: Removeregex>=2023.10.3from direct dependencies.Testing