Skip to content

refactor: replace xxhash with stdlib hashlib.sha256#33

Open
isaacbmiller wants to merge 1 commit intomainfrom
isaac/remove-xxhash
Open

refactor: replace xxhash with stdlib hashlib.sha256#33
isaacbmiller wants to merge 1 commit intomainfrom
isaac/remove-xxhash

Conversation

@isaacbmiller
Copy link
Copy Markdown

Replace the xxhash C-extension dependency with stdlib hashlib.sha256. The Hasher class (vendored from HuggingFace datasets) uses only xxh64(), .update(), and .hexdigest() — all of which have identical API equivalents in hashlib.sha256.

Only 2 call sites in the codebase, both low-frequency:

  • bootstrap.py — seeds an RNG for demo selection during optimization
  • utils_finetune.py — generates a filename for finetuning data (once per job)

xxhash is ~4x faster than sha256, but at these call volumes (tens of calls per optimization run, not per LM call) the absolute difference is negligible.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
@isaacbmiller
Copy link
Copy Markdown
Author

@greptile

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 7, 2026

Greptile Summary

Removes the xxhash C-extension dependency and replaces it with hashlib.sha256 from the Python stdlib. The Hasher class API is unchanged — only the underlying hash algorithm and digest length (16 → 64 hex chars) differ.

  • dspy/utils/hasher.py: Both xxhash.xxh64() call sites replaced with hashlib.sha256(); update() and hexdigest() are drop-in compatible.
  • pyproject.toml / uv.lock: xxhash>=3.5.0 removed from dspy's direct dependencies; xxhash remains in the lockfile as a transitive dep of the optional datasets package, which is correct.

Confidence Score: 5/5

Safe to merge — the replacement is a drop-in at both call sites and removes a C-extension with no functional regression.

The two call sites (RNG seeding and filename generation) are both unaffected by the longer digest — random.Random accepts any-length string seeds, and the finetune filename write path never checks for a pre-existing file. The xxhash entry retained in uv.lock is correctly scoped to the optional datasets transitive dependency.

No files require special attention.

Important Files Changed

Filename Overview
dspy/utils/hasher.py Replaces xxhash.xxh64() with hashlib.sha256() throughout; API is identical (update/hexdigest), digest length changes from 16 to 64 hex chars but no callers depend on length.
pyproject.toml Removes xxhash>=3.5.0 from the direct dependency list; correct and complete.
uv.lock Removes xxhash from dspy's resolved dependencies; xxhash remains in the lockfile as a transitive dep of the optional datasets package, which is expected.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Caller: bootstrap.py\nSeeds random.Random] --> H[Hasher.hash]
    B[Caller: utils_finetune.py\nGenerates filename] --> H
    H --> HB[hash_bytes]
    HB --> SHA[hashlib.sha256\nnow instead of xxhash.xxh64]
    SHA --> D[64-char hex digest]
    D --> A2[random.Random seed\nany length OK]
    D --> B2[filename: hash.jsonl\n64 chars instead of 16]
Loading

Reviews (1): Last reviewed commit: "refactor: replace xxhash with stdlib has..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant