Skip to content

refactor: merge 4 C types into single XXHASHObject type#155

Open
ifduyue wants to merge 15 commits into
masterfrom
one-type-multiple-algo
Open

refactor: merge 4 C types into single XXHASHObject type#155
ifduyue wants to merge 15 commits into
masterfrom
one-type-multiple-algo

Conversation

@ifduyue

@ifduyue ifduyue commented May 30, 2026

Copy link
Copy Markdown
Owner

Following CPython _hashlib's pattern (one HASH type for all algorithms), merge the four per-algorithm C types (PYXXH32Object, PYXXH64Object, PYXXH3_64Object, PYXXH3_128Object) into a single XXHASHObject type.

Key changes:

  • Single struct with XXH_Algo enum + union of state pointers
  • Single set of tp_* methods with algo-based dispatch (switch)
  • Single methods/getseters/slots/spec (was 4 copies of each)
  • 4 module-level constructor functions (xxh32, xxh64, xxh3_64, xxh3_128) sharing a single _xxhash_new helper
  • One-shot functions consolidated via DEFINE_ONESHOT macro
  • PyCFunction constructors replace 4 separate type objects
  • Updated .pyi stubs: constructors are functions, not @Final classes

Results:

  • C source reduced by 37% (2271 -> 1421 lines, -845 lines)
  • Zero warnings on all supported Python versions (3.9-3.15, 3.14t)
  • All 124 tests pass across all versions

Following CPython _hashlib's pattern (one HASH type for all algorithms),
merge the four per-algorithm C types (PYXXH32Object, PYXXH64Object,
PYXXH3_64Object, PYXXH3_128Object) into a single XXHASHObject type.

Key changes:
- Single struct with XXH_Algo enum + union of state pointers
- Single set of tp_* methods with algo-based dispatch (switch)
- Single methods/getseters/slots/spec (was 4 copies of each)
- 4 module-level constructor functions (xxh32, xxh64, xxh3_64, xxh3_128)
  sharing a single _xxhash_new helper
- One-shot functions consolidated via DEFINE_ONESHOT macro
- PyCFunction constructors replace 4 separate type objects
- Updated .pyi stubs: constructors are functions, not @Final classes

Results:
- C source reduced by 37% (2271 -> 1421 lines, -845 lines)
- Zero warnings on all supported Python versions (3.9-3.15, 3.14t)
- All 124 tests pass across all versions

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the type stubs in xxhash/__init__.pyi to correctly define the xxh* constructors as functions returning a _Hasher instance rather than distinct classes, and updates the subinterpreter isolation tests accordingly. A review comment correctly identifies an incorrect alias assignment where xxh64 is assigned to xxh3_64 despite already being defined as a distinct function.

Comment thread xxhash/__init__.pyi Outdated
Comment on lines +73 to +74
# Aliases
xxh64 = xxh3_64 # type: ignore[assignment]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The assignment xxh64 = xxh3_64 is incorrect. xxh64 is a distinct algorithm and has already been correctly defined as a function on line 69. It is not an alias of xxh3_64 (unlike xxh128 which is an alias of xxh3_128). This assignment should be removed to prevent type-checking errors and confusion.

Suggested change
# Aliases
xxh64 = xxh3_64 # type: ignore[assignment]
# Aliases

@codspeed-hq

codspeed-hq Bot commented May 30, 2026

Copy link
Copy Markdown
Contributor

Merging this PR will not alter performance

✅ 148 untouched benchmarks
⏩ 66 skipped benchmarks1


Comparing one-type-multiple-algo (d6fb0b4) with master (8114fd1)

Open in CodSpeed

Footnotes

  1. 66 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

ifduyue added 2 commits May 30, 2026 16:27
Replace the generic _xxhash_oneshot() dispatcher (switch(algo) x switch(return_type))
with fully specialized per-algorithm/per-return-type functions generated by
DEFINE_ONESHOT_INT and DEFINE_ONESHOT_DH macros. Each function directly calls
the right xxhash C function and formats the result, avoiding the double-dispatch
overhead.

Also extract _xxhash_hexdigest_from_bytes() and _xxhash_xxh128_intdigest_result()
helpers, and remove the unused XXH_OneshotDispatch struct.

Benchmark result (xxh32_hexdigest, 1KB data):
  Before: 421.2 ns/call
  After:  353.3 ns/call  (~16% faster)
…l methods

Define XXH_AlgoDispatch (create_state, free_state, reset, update,
digest, intdigest, copy_state) and a const XXH_DISPATCH[] array
indexed by XXH_Algo, eliminating all 10 switch(self->algo) statements
in the stateful path (_xxhash_init_state, _xxhash_free_state,
_xxhash_reset_state, _xxhash_do_update, _xxhash_new, XXHASH_init,
XXHASH_digest, XXHASH_hexdigest, XXHASH_intdigest, XXHASH_copy).

Also remove the unused _free_module callback (PyType_FromModuleAndSpec
ties heap type lifetime to the module automatically).

Benchmarks show performance at parity or slightly improved
(constructor+hexdigest ~6-20% faster).
ifduyue added 4 commits June 5, 2026 16:30
The comment explains why the extra state copy is needed: xxHash v0.7.3 and
earlier have a bug where states reset with a seed will have a wild pointer
to the original state when copied, causing a use-after-free if the original
is freed.
The update() method is now on a unified XXHASHObject type supporting all
four algorithms (XXH32, XXH64, XXH3_64, XXH3_128), not just xxh32.
… repr

- XXHASH_new (tp_new) now raises TypeError instead of creating instances,
  matching _hashlib.HASH behavior. Only factory functions can create objects.
- Add XXHASH_repr (tp_repr) showing algorithm name: <XXH32 _xxhash.HASH object>
  matching CPython's <md5 _hashlib.HASH object> pattern.
- Rename type from 'xxhash.xxhash' to '_xxhash.HASH' to align with hashlib.
ifduyue added 4 commits June 17, 2026 14:01
Replace the single _xxhash.HASH type with 4 named heap types
(xxh32, xxh64, xxh3_64, xxh3_128), each with its own tp_vectorcall
that hardcodes the algorithm.  The implementation is shared through
XXH_DISPATCH table — only the type name and vectorcall wrapper differ.

This restores the tp_vectorcall fast path lost when constructors were
changed from types to PyCFunction wrappers, matching the original
master's calling convention.
…fy seed getter

- Remove empty XXHASH_ModuleState (m_size=0 means no per-module state)
- Drop redundant #include <string.h> (included by Python.h and xxhash.h)
- Replace XXHASH_get_seed 10-line switch with 3-line if-else (no dead default)
@ifduyue

ifduyue commented Jun 17, 2026

Copy link
Copy Markdown
Owner Author

@gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces alias tests for xxhash algorithms and updates the type stubs and subinterpreter tests to treat the xxh* constructors as functions returning a single _Hasher type. However, the review feedback correctly points out that these constructors are actually PyTypeObject classes at runtime. Changing them to functions in the type stubs is a regression that breaks type annotations and isinstance checks. It is recommended to keep them defined as classes in the stubs and revert the corresponding test workarounds that use type(xxhash.xxh32()) instead of referencing the classes directly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread xxhash/__init__.pyi
Comment on lines +66 to +71
# All xxh* constructors return the same _Hasher type.
# The algorithm is determined internally and visible via the .name property.
def xxh32(data: _DataType = ..., seed: int = ...) -> _Hasher: ...
def xxh64(data: _DataType = ..., seed: int = ...) -> _Hasher: ...
def xxh3_64(data: _DataType = ..., seed: int = ...) -> _Hasher: ...
def xxh3_128(data: _DataType = ..., seed: int = ...) -> _Hasher: ...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

At runtime, xxh32, xxh64, xxh3_64, and xxh3_128 are still exposed as PyTypeObjects (classes) via PyModule_AddType, not as standard Python functions.

Declaring them as functions in the type stub is a regression because:

  1. It prevents users from using these classes as type annotations (e.g., def process(hasher: xxh32)).
  2. It breaks static analysis/type checking for isinstance checks (e.g., isinstance(h, xxh32)).
  3. The private _Hasher class is not exported in __all__, leaving users with no public type to use for annotations.

We should keep them defined as classes inheriting from _Hasher in the stub.

Suggested change
# All xxh* constructors return the same _Hasher type.
# The algorithm is determined internally and visible via the .name property.
def xxh32(data: _DataType = ..., seed: int = ...) -> _Hasher: ...
def xxh64(data: _DataType = ..., seed: int = ...) -> _Hasher: ...
def xxh3_64(data: _DataType = ..., seed: int = ...) -> _Hasher: ...
def xxh3_128(data: _DataType = ..., seed: int = ...) -> _Hasher: ...
@final
class xxh32(_Hasher): ...
@final
class xxh64(_Hasher): ...
@final
class xxh3_64(_Hasher): ...
@final
class xxh3_128(_Hasher): ...

Comment on lines +160 to +165
"""The single xxhash type is immutable inside subinterpreters (3.12+).
The constructors (xxh32, xxh64, xxh3_64, xxh3_128) are functions,
not types, so we get the type from a constructed instance."""
self._run("""\
import sys, xxhash
types = [xxhash.xxh32, xxhash.xxh64, xxhash.xxh3_64, xxhash.xxh3_128]
for t in types:
if sys.version_info >= (3, 12):
try:
t.newattr = 42
raise AssertionError('type should be immutable')
except TypeError:
pass
assert t.__flags__ & (1 << 8), 'immutable flag not set'
t = type(xxhash.xxh32())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The constructors xxh32, xxh64, xxh3_64, and xxh3_128 are actually PyTypeObjects (classes) at runtime, added to the module via PyModule_AddType. They are not standard functions.

Therefore, we can reference the type objects directly (e.g., xxhash.xxh32) instead of using type(xxhash.xxh32()), and we can simplify the docstring accordingly.

Suggested change
"""The single xxhash type is immutable inside subinterpreters (3.12+).
The constructors (xxh32, xxh64, xxh3_64, xxh3_128) are functions,
not types, so we get the type from a constructed instance."""
self._run("""\
import sys, xxhash
types = [xxhash.xxh32, xxhash.xxh64, xxhash.xxh3_64, xxhash.xxh3_128]
for t in types:
if sys.version_info >= (3, 12):
try:
t.newattr = 42
raise AssertionError('type should be immutable')
except TypeError:
pass
assert t.__flags__ & (1 << 8), 'immutable flag not set'
t = type(xxhash.xxh32())
"""Hash types are immutable inside subinterpreters (3.12+)."""
self._run("""\
import sys, xxhash
t = xxhash.xxh32

Comment on lines 355 to +356
import sys, xxhash
types = [xxhash.xxh32, xxhash.xxh64, xxhash.xxh3_64, xxhash.xxh3_128]
for t in types:
if sys.version_info >= (3, 12):
assert t.__flags__ & (1 << 8), 'immutable flag not set'
try:
t.newattr = 42
raise AssertionError('type should be immutable')
except TypeError:
pass
t = type(xxhash.xxh32())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since xxhash.xxh32 is a type object at runtime, we can reference it directly instead of calling it and getting its type.

Suggested change
import sys, xxhash
types = [xxhash.xxh32, xxhash.xxh64, xxhash.xxh3_64, xxhash.xxh3_128]
for t in types:
if sys.version_info >= (3, 12):
assert t.__flags__ & (1 << 8), 'immutable flag not set'
try:
t.newattr = 42
raise AssertionError('type should be immutable')
except TypeError:
pass
t = type(xxhash.xxh32())
import sys, xxhash
t = xxhash.xxh32

PyPy does not implement tp_vectorcall, so type_call falls back to
tp_new + tp_init.  XXHASH_new now determines the algorithm from the
type's tp_name and allocates a default-initialized object; XXHASH_init
then sets the real seed and processes the data.  On CPython 3.9+ this
path is normally bypassed by tp_vectorcall.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant