refactor: merge 4 C types into single XXHASHObject type#155
Conversation
Following CPython _hashlib's pattern (one HASH type for all algorithms), merge the four per-algorithm C types (PYXXH32Object, PYXXH64Object, PYXXH3_64Object, PYXXH3_128Object) into a single XXHASHObject type. Key changes: - Single struct with XXH_Algo enum + union of state pointers - Single set of tp_* methods with algo-based dispatch (switch) - Single methods/getseters/slots/spec (was 4 copies of each) - 4 module-level constructor functions (xxh32, xxh64, xxh3_64, xxh3_128) sharing a single _xxhash_new helper - One-shot functions consolidated via DEFINE_ONESHOT macro - PyCFunction constructors replace 4 separate type objects - Updated .pyi stubs: constructors are functions, not @Final classes Results: - C source reduced by 37% (2271 -> 1421 lines, -845 lines) - Zero warnings on all supported Python versions (3.9-3.15, 3.14t) - All 124 tests pass across all versions
There was a problem hiding this comment.
Code Review
This pull request updates the type stubs in xxhash/__init__.pyi to correctly define the xxh* constructors as functions returning a _Hasher instance rather than distinct classes, and updates the subinterpreter isolation tests accordingly. A review comment correctly identifies an incorrect alias assignment where xxh64 is assigned to xxh3_64 despite already being defined as a distinct function.
| # Aliases | ||
| xxh64 = xxh3_64 # type: ignore[assignment] |
There was a problem hiding this comment.
The assignment xxh64 = xxh3_64 is incorrect. xxh64 is a distinct algorithm and has already been correctly defined as a function on line 69. It is not an alias of xxh3_64 (unlike xxh128 which is an alias of xxh3_128). This assignment should be removed to prevent type-checking errors and confusion.
| # Aliases | |
| xxh64 = xxh3_64 # type: ignore[assignment] | |
| # Aliases |
Merging this PR will not alter performance
Comparing Footnotes
|
Replace the generic _xxhash_oneshot() dispatcher (switch(algo) x switch(return_type)) with fully specialized per-algorithm/per-return-type functions generated by DEFINE_ONESHOT_INT and DEFINE_ONESHOT_DH macros. Each function directly calls the right xxhash C function and formats the result, avoiding the double-dispatch overhead. Also extract _xxhash_hexdigest_from_bytes() and _xxhash_xxh128_intdigest_result() helpers, and remove the unused XXH_OneshotDispatch struct. Benchmark result (xxh32_hexdigest, 1KB data): Before: 421.2 ns/call After: 353.3 ns/call (~16% faster)
…l methods Define XXH_AlgoDispatch (create_state, free_state, reset, update, digest, intdigest, copy_state) and a const XXH_DISPATCH[] array indexed by XXH_Algo, eliminating all 10 switch(self->algo) statements in the stateful path (_xxhash_init_state, _xxhash_free_state, _xxhash_reset_state, _xxhash_do_update, _xxhash_new, XXHASH_init, XXHASH_digest, XXHASH_hexdigest, XXHASH_intdigest, XXHASH_copy). Also remove the unused _free_module callback (PyType_FromModuleAndSpec ties heap type lifetime to the module automatically). Benchmarks show performance at parity or slightly improved (constructor+hexdigest ~6-20% faster).
The comment explains why the extra state copy is needed: xxHash v0.7.3 and earlier have a bug where states reset with a seed will have a wild pointer to the original state when copied, causing a use-after-free if the original is freed.
The update() method is now on a unified XXHASHObject type supporting all four algorithms (XXH32, XXH64, XXH3_64, XXH3_128), not just xxh32.
… repr - XXHASH_new (tp_new) now raises TypeError instead of creating instances, matching _hashlib.HASH behavior. Only factory functions can create objects. - Add XXHASH_repr (tp_repr) showing algorithm name: <XXH32 _xxhash.HASH object> matching CPython's <md5 _hashlib.HASH object> pattern. - Rename type from 'xxhash.xxhash' to '_xxhash.HASH' to align with hashlib.
Replace the single _xxhash.HASH type with 4 named heap types (xxh32, xxh64, xxh3_64, xxh3_128), each with its own tp_vectorcall that hardcodes the algorithm. The implementation is shared through XXH_DISPATCH table — only the type name and vectorcall wrapper differ. This restores the tp_vectorcall fast path lost when constructors were changed from types to PyCFunction wrappers, matching the original master's calling convention.
…fy seed getter - Remove empty XXHASH_ModuleState (m_size=0 means no per-module state) - Drop redundant #include <string.h> (included by Python.h and xxhash.h) - Replace XXHASH_get_seed 10-line switch with 3-line if-else (no dead default)
|
@gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces alias tests for xxhash algorithms and updates the type stubs and subinterpreter tests to treat the xxh* constructors as functions returning a single _Hasher type. However, the review feedback correctly points out that these constructors are actually PyTypeObject classes at runtime. Changing them to functions in the type stubs is a regression that breaks type annotations and isinstance checks. It is recommended to keep them defined as classes in the stubs and revert the corresponding test workarounds that use type(xxhash.xxh32()) instead of referencing the classes directly.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # All xxh* constructors return the same _Hasher type. | ||
| # The algorithm is determined internally and visible via the .name property. | ||
| def xxh32(data: _DataType = ..., seed: int = ...) -> _Hasher: ... | ||
| def xxh64(data: _DataType = ..., seed: int = ...) -> _Hasher: ... | ||
| def xxh3_64(data: _DataType = ..., seed: int = ...) -> _Hasher: ... | ||
| def xxh3_128(data: _DataType = ..., seed: int = ...) -> _Hasher: ... |
There was a problem hiding this comment.
At runtime, xxh32, xxh64, xxh3_64, and xxh3_128 are still exposed as PyTypeObjects (classes) via PyModule_AddType, not as standard Python functions.
Declaring them as functions in the type stub is a regression because:
- It prevents users from using these classes as type annotations (e.g.,
def process(hasher: xxh32)). - It breaks static analysis/type checking for
isinstancechecks (e.g.,isinstance(h, xxh32)). - The private
_Hasherclass is not exported in__all__, leaving users with no public type to use for annotations.
We should keep them defined as classes inheriting from _Hasher in the stub.
| # All xxh* constructors return the same _Hasher type. | |
| # The algorithm is determined internally and visible via the .name property. | |
| def xxh32(data: _DataType = ..., seed: int = ...) -> _Hasher: ... | |
| def xxh64(data: _DataType = ..., seed: int = ...) -> _Hasher: ... | |
| def xxh3_64(data: _DataType = ..., seed: int = ...) -> _Hasher: ... | |
| def xxh3_128(data: _DataType = ..., seed: int = ...) -> _Hasher: ... | |
| @final | |
| class xxh32(_Hasher): ... | |
| @final | |
| class xxh64(_Hasher): ... | |
| @final | |
| class xxh3_64(_Hasher): ... | |
| @final | |
| class xxh3_128(_Hasher): ... |
| """The single xxhash type is immutable inside subinterpreters (3.12+). | ||
| The constructors (xxh32, xxh64, xxh3_64, xxh3_128) are functions, | ||
| not types, so we get the type from a constructed instance.""" | ||
| self._run("""\ | ||
| import sys, xxhash | ||
| types = [xxhash.xxh32, xxhash.xxh64, xxhash.xxh3_64, xxhash.xxh3_128] | ||
| for t in types: | ||
| if sys.version_info >= (3, 12): | ||
| try: | ||
| t.newattr = 42 | ||
| raise AssertionError('type should be immutable') | ||
| except TypeError: | ||
| pass | ||
| assert t.__flags__ & (1 << 8), 'immutable flag not set' | ||
| t = type(xxhash.xxh32()) |
There was a problem hiding this comment.
The constructors xxh32, xxh64, xxh3_64, and xxh3_128 are actually PyTypeObjects (classes) at runtime, added to the module via PyModule_AddType. They are not standard functions.
Therefore, we can reference the type objects directly (e.g., xxhash.xxh32) instead of using type(xxhash.xxh32()), and we can simplify the docstring accordingly.
| """The single xxhash type is immutable inside subinterpreters (3.12+). | |
| The constructors (xxh32, xxh64, xxh3_64, xxh3_128) are functions, | |
| not types, so we get the type from a constructed instance.""" | |
| self._run("""\ | |
| import sys, xxhash | |
| types = [xxhash.xxh32, xxhash.xxh64, xxhash.xxh3_64, xxhash.xxh3_128] | |
| for t in types: | |
| if sys.version_info >= (3, 12): | |
| try: | |
| t.newattr = 42 | |
| raise AssertionError('type should be immutable') | |
| except TypeError: | |
| pass | |
| assert t.__flags__ & (1 << 8), 'immutable flag not set' | |
| t = type(xxhash.xxh32()) | |
| """Hash types are immutable inside subinterpreters (3.12+).""" | |
| self._run("""\ | |
| import sys, xxhash | |
| t = xxhash.xxh32 |
| import sys, xxhash | ||
| types = [xxhash.xxh32, xxhash.xxh64, xxhash.xxh3_64, xxhash.xxh3_128] | ||
| for t in types: | ||
| if sys.version_info >= (3, 12): | ||
| assert t.__flags__ & (1 << 8), 'immutable flag not set' | ||
| try: | ||
| t.newattr = 42 | ||
| raise AssertionError('type should be immutable') | ||
| except TypeError: | ||
| pass | ||
| t = type(xxhash.xxh32()) |
There was a problem hiding this comment.
Since xxhash.xxh32 is a type object at runtime, we can reference it directly instead of calling it and getting its type.
| import sys, xxhash | |
| types = [xxhash.xxh32, xxhash.xxh64, xxhash.xxh3_64, xxhash.xxh3_128] | |
| for t in types: | |
| if sys.version_info >= (3, 12): | |
| assert t.__flags__ & (1 << 8), 'immutable flag not set' | |
| try: | |
| t.newattr = 42 | |
| raise AssertionError('type should be immutable') | |
| except TypeError: | |
| pass | |
| t = type(xxhash.xxh32()) | |
| import sys, xxhash | |
| t = xxhash.xxh32 |
PyPy does not implement tp_vectorcall, so type_call falls back to tp_new + tp_init. XXHASH_new now determines the algorithm from the type's tp_name and allocates a default-initialized object; XXHASH_init then sets the real seed and processes the data. On CPython 3.9+ this path is normally bypassed by tp_vectorcall.
Following CPython _hashlib's pattern (one HASH type for all algorithms), merge the four per-algorithm C types (PYXXH32Object, PYXXH64Object, PYXXH3_64Object, PYXXH3_128Object) into a single XXHASHObject type.
Key changes:
Results: