Skip to content

refactor: replace regex with stdlib re and brace counter#35

Open
isaacbmiller wants to merge 1 commit intomainfrom
isaac/remove-regex
Open

refactor: replace regex with stdlib re and brace counter#35
isaacbmiller wants to merge 1 commit intomainfrom
isaac/remove-regex

Conversation

@isaacbmiller
Copy link
Copy Markdown

Summary

Remove regex as a direct dependency by replacing both usages with stdlib equivalents. regex remains transitively available via tiktoken.

Changes

  • dspy/dsp/utils/dpr.py: Replace \p{L}\p{N}\p{M} Unicode property escapes with \w + combining diacriticals range, and \p{Z}\p{C} with \S. Covers all practical tokenization cases (tested with accented Unicode text).
  • dspy/adapters/json_adapter.py: Replace recursive (?R) regex pattern with a _extract_outermost_json_object() brace-counting function that handles nested braces, string-embedded braces, and escaped quotes.
  • pyproject.toml: Remove regex>=2023.10.3 from direct dependencies.

Testing

  • 28 json adapter tests pass
  • Manual DPR SimpleTokenizer verification with Unicode text (Café, résumé, naïve)
  • Edge case assertions for brace counter (nested, string-embedded braces, escaped quotes, no JSON)

- dpr.py: swap \p{L}/\p{N}/\p{M}/\p{Z}/\p{C} Unicode property escapes
  for \w and \S approximations (covers all practical tokenization cases)
- json_adapter.py: replace recursive (?R) pattern with a brace-counting
  _extract_outermost_json_object() function
- Remove regex>=2023.10.3 from direct dependencies (still transitive
  via tiktoken)

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR removes regex as a direct dependency by substituting stdlib re for both usages: a brace-counting parser replaces the recursive (?R) pattern in the JSON adapter, and Unicode property escapes in SimpleTokenizer are approximated with \w + a combining-marks range.

  • json_adapter.py: _extract_outermost_json_object correctly handles nested braces, string-embedded braces, and escaped quotes; the logic is a faithful functional replacement for the removed recursive regex.
  • dsp/utils/dpr.py: The re substitutions are reasonable approximations for Latin-script text, but \w includes underscore (not in the original \p{L}\p{N}\p{M}), and \S admits Unicode format characters that the original [^\p{Z}\p{C}] excluded—both represent subtle tokenisation behaviour changes.
  • pyproject.toml / uv.lock: regex cleanly removed from direct dependencies.

Confidence Score: 4/5

Safe to merge; the JSON adapter replacement is functionally correct, and the DPR tokeniser changes are minor approximations unlikely to affect production workloads.

The brace-counting parser in json_adapter.py is a clean, well-tested replacement and poses no regression risk. The dpr.py changes introduce two quiet tokenisation shifts—\w treating underscores as alphanumeric characters and \S admitting certain Unicode format characters—that differ from the original regex semantics. Neither is likely to surface in typical use, but they are undocumented behaviour changes in a retrieval utility where exact token matching can matter.

dspy/dsp/utils/dpr.py deserves a second look for projects that rely on SimpleTokenizer with technical text (underscores) or text containing Unicode format characters.

Important Files Changed

Filename Overview
dspy/adapters/json_adapter.py Replaces recursive (?R) regex with a pure-Python brace-counting parser; logic handles nested braces, string-embedded braces, and escaped quotes correctly.
dspy/dsp/utils/dpr.py Swaps regex for stdlib re; \w in ALPHA_NUM silently adds underscore to alphanumeric tokens and \S in NON_WS admits Unicode format characters that the original [^\p{Z}\p{C}] excluded.
pyproject.toml Removes regex>=2023.10.3 from direct dependencies; regex remains transitively available via tiktoken.
uv.lock Lockfile updated to reflect removal of regex as a direct dependency; no other changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["JSONAdapter.parse(completion)"] --> B["json_repair.loads(completion)"]
    B --> C{isinstance dict?}
    C -- Yes --> G[Return fields]
    C -- No --> D["_extract_outermost_json_object(completion)"]
    D --> E{found braces?}
    E -- No --> F["raise AdapterParseError"]
    E -- Yes --> H["Scan char-by-char\ntrack in_string + escape + depth"]
    H --> I{depth == 0?}
    I -- Yes --> J["Return text[start:i+1]"]
    I -- No --> H
    J --> K["json_repair.loads(extracted)"]
    K --> L{isinstance dict?}
    L -- Yes --> G
    L -- No --> F
Loading

Reviews (1): Last reviewed commit: "refactor: replace regex with stdlib re a..." | Re-trigger Greptile

Comment thread dspy/dsp/utils/dpr.py
class SimpleTokenizer(Tokenizer):
ALPHA_NUM = r"[\p{L}\p{N}\p{M}]+"
NON_WS = r"[^\p{Z}\p{C}]"
ALPHA_NUM = r"[\w\u0300-\u036f]+"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 \w includes underscore, silently changing tokenisation

Python's \w is [a-zA-Z0-9_] (plus Unicode letters/digits with the UNICODE flag), whereas the original [\p{L}\p{N}\p{M}] never included _. Under the old pattern hello_world produced three tokens (hello, _, world via NON_WS); under the new one it produces a single token hello_world. This changes how underscore-delimited technical terms are tokenised when SimpleTokenizer is used for retrieval, potentially shifting exact-match recall for queries or passages that contain underscores.

Comment thread dspy/dsp/utils/dpr.py
ALPHA_NUM = r"[\p{L}\p{N}\p{M}]+"
NON_WS = r"[^\p{Z}\p{C}]"
ALPHA_NUM = r"[\w\u0300-\u036f]+"
NON_WS = r"\S"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 \S is not equivalent to [^\p{Z}\p{C}] for Unicode format characters

[^\p{Z}\p{C}] excludes all Unicode "Other" characters (\p{C}), which includes invisible format characters such as zero-width joiners (U+200D), the BOM (U+FEFF), and soft hyphens (U+00AD). \S (even with re.UNICODE) does not exclude most of these: they are non-whitespace, so \S matches them and they will surface as single-character NON_WS tokens where the original silently ignored them. For realistic NLP corpora this is unlikely to matter, but it is a semantic drift worth documenting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant