refactor: replace regex with stdlib re and brace counter by isaacbmiller · Pull Request #35 · cmpnd-ai/dspy

isaacbmiller · 2026-05-07T19:32:04Z

Summary

Remove regex as a direct dependency by replacing both usages with stdlib equivalents. regex remains transitively available via tiktoken.

Changes

dspy/dsp/utils/dpr.py: Replace \p{L}\p{N}\p{M} Unicode property escapes with \w + combining diacriticals range, and \p{Z}\p{C} with \S. Covers all practical tokenization cases (tested with accented Unicode text).
dspy/adapters/json_adapter.py: Replace recursive (?R) regex pattern with a _extract_outermost_json_object() brace-counting function that handles nested braces, string-embedded braces, and escaped quotes.
pyproject.toml: Remove regex>=2023.10.3 from direct dependencies.

Testing

28 json adapter tests pass
Manual DPR SimpleTokenizer verification with Unicode text (Café, résumé, naïve)
Edge case assertions for brace counter (nested, string-embedded braces, escaped quotes, no JSON)

- dpr.py: swap \p{L}/\p{N}/\p{M}/\p{Z}/\p{C} Unicode property escapes for \w and \S approximations (covers all practical tokenization cases) - json_adapter.py: replace recursive (?R) pattern with a brace-counting _extract_outermost_json_object() function - Remove regex>=2023.10.3 from direct dependencies (still transitive via tiktoken) Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

greptile-apps · 2026-05-07T19:35:06Z

Greptile Summary

This PR removes regex as a direct dependency by substituting stdlib re for both usages: a brace-counting parser replaces the recursive (?R) pattern in the JSON adapter, and Unicode property escapes in SimpleTokenizer are approximated with \w + a combining-marks range.

json_adapter.py: _extract_outermost_json_object correctly handles nested braces, string-embedded braces, and escaped quotes; the logic is a faithful functional replacement for the removed recursive regex.
dsp/utils/dpr.py: The re substitutions are reasonable approximations for Latin-script text, but \w includes underscore (not in the original \p{L}\p{N}\p{M}), and \S admits Unicode format characters that the original [^\p{Z}\p{C}] excluded—both represent subtle tokenisation behaviour changes.
pyproject.toml / uv.lock: regex cleanly removed from direct dependencies.

Confidence Score: 4/5

Safe to merge; the JSON adapter replacement is functionally correct, and the DPR tokeniser changes are minor approximations unlikely to affect production workloads.

The brace-counting parser in json_adapter.py is a clean, well-tested replacement and poses no regression risk. The dpr.py changes introduce two quiet tokenisation shifts—\w treating underscores as alphanumeric characters and \S admitting certain Unicode format characters—that differ from the original regex semantics. Neither is likely to surface in typical use, but they are undocumented behaviour changes in a retrieval utility where exact token matching can matter.

dspy/dsp/utils/dpr.py deserves a second look for projects that rely on SimpleTokenizer with technical text (underscores) or text containing Unicode format characters.

Important Files Changed

Filename	Overview
dspy/adapters/json_adapter.py	Replaces recursive `(?R)` regex with a pure-Python brace-counting parser; logic handles nested braces, string-embedded braces, and escaped quotes correctly.
dspy/dsp/utils/dpr.py	Swaps `regex` for stdlib `re`; `\w` in ALPHA_NUM silently adds underscore to alphanumeric tokens and `\S` in NON_WS admits Unicode format characters that the original `[^\p{Z}\p{C}]` excluded.
pyproject.toml	Removes `regex>=2023.10.3` from direct dependencies; `regex` remains transitively available via `tiktoken`.
uv.lock	Lockfile updated to reflect removal of `regex` as a direct dependency; no other changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["JSONAdapter.parse(completion)"] --> B["json_repair.loads(completion)"]
    B --> C{isinstance dict?}
    C -- Yes --> G[Return fields]
    C -- No --> D["_extract_outermost_json_object(completion)"]
    D --> E{found braces?}
    E -- No --> F["raise AdapterParseError"]
    E -- Yes --> H["Scan char-by-char\ntrack in_string + escape + depth"]
    H --> I{depth == 0?}
    I -- Yes --> J["Return text[start:i+1]"]
    I -- No --> H
    J --> K["json_repair.loads(extracted)"]
    K --> L{isinstance dict?}
    L -- Yes --> G
    L -- No --> F

_{Reviews (1): Last reviewed commit: "refactor: replace regex with stdlib re a..." | Re-trigger Greptile}

greptile-apps · 2026-05-07T19:35:13Z

 class SimpleTokenizer(Tokenizer):
-    ALPHA_NUM = r"[\p{L}\p{N}\p{M}]+"
-    NON_WS = r"[^\p{Z}\p{C}]"
+    ALPHA_NUM = r"[\w\u0300-\u036f]+"


\w includes underscore, silently changing tokenisation

Python's \w is [a-zA-Z0-9_] (plus Unicode letters/digits with the UNICODE flag), whereas the original [\p{L}\p{N}\p{M}] never included _. Under the old pattern hello_world produced three tokens (hello, _, world via NON_WS); under the new one it produces a single token hello_world. This changes how underscore-delimited technical terms are tokenised when SimpleTokenizer is used for retrieval, potentially shifting exact-match recall for queries or passages that contain underscores.

greptile-apps · 2026-05-07T19:35:15Z

-    ALPHA_NUM = r"[\p{L}\p{N}\p{M}]+"
-    NON_WS = r"[^\p{Z}\p{C}]"
+    ALPHA_NUM = r"[\w\u0300-\u036f]+"
+    NON_WS = r"\S"


\S is not equivalent to [^\p{Z}\p{C}] for Unicode format characters

[^\p{Z}\p{C}] excludes all Unicode "Other" characters (\p{C}), which includes invisible format characters such as zero-width joiners (U+200D), the BOM (U+FEFF), and soft hyphens (U+00AD). \S (even with re.UNICODE) does not exclude most of these: they are non-whitespace, so \S matches them and they will surface as single-character NON_WS tokens where the original silently ignored them. For realistic NLP corpora this is unlikely to matter, but it is a semantic drift worth documenting.

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: replace regex with stdlib re and brace counter#35

refactor: replace regex with stdlib re and brace counter#35
isaacbmiller wants to merge 1 commit intomainfrom
isaac/remove-regex

isaacbmiller commented May 7, 2026

Uh oh!

greptile-apps Bot commented May 7, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

isaacbmiller commented May 7, 2026

Summary

Changes

Testing

Uh oh!

greptile-apps Bot commented May 7, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant