refactor: replace xxhash with stdlib hashlib.sha256#33
refactor: replace xxhash with stdlib hashlib.sha256#33isaacbmiller wants to merge 1 commit intomainfrom
Conversation
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
|
@greptile |
Greptile SummaryRemoves the
Confidence Score: 5/5Safe to merge — the replacement is a drop-in at both call sites and removes a C-extension with no functional regression. The two call sites (RNG seeding and filename generation) are both unaffected by the longer digest — random.Random accepts any-length string seeds, and the finetune filename write path never checks for a pre-existing file. The xxhash entry retained in uv.lock is correctly scoped to the optional datasets transitive dependency. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Caller: bootstrap.py\nSeeds random.Random] --> H[Hasher.hash]
B[Caller: utils_finetune.py\nGenerates filename] --> H
H --> HB[hash_bytes]
HB --> SHA[hashlib.sha256\nnow instead of xxhash.xxh64]
SHA --> D[64-char hex digest]
D --> A2[random.Random seed\nany length OK]
D --> B2[filename: hash.jsonl\n64 chars instead of 16]
Reviews (1): Last reviewed commit: "refactor: replace xxhash with stdlib has..." | Re-trigger Greptile |
Replace the
xxhashC-extension dependency with stdlibhashlib.sha256. TheHasherclass (vendored from HuggingFacedatasets) uses onlyxxh64(),.update(), and.hexdigest()— all of which have identical API equivalents inhashlib.sha256.Only 2 call sites in the codebase, both low-frequency:
bootstrap.py— seeds an RNG for demo selection during optimizationutils_finetune.py— generates a filename for finetuning data (once per job)xxhash is ~4x faster than sha256, but at these call volumes (tens of calls per optimization run, not per LM call) the absolute difference is negligible.