Optimize fingerprint hashing in preprocessing#818
Optimize fingerprint hashing in preprocessing#818psinger-prior wants to merge 7 commits intomainfrom
Conversation
…unding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s/TabPFN into psi/fingerprint-speedup
There was a problem hiding this comment.
Code Review
This pull request introduces some excellent performance optimizations to the fingerprint hashing logic. Rounding the feature matrix once upfront and avoiding redundant hash calculations for non-colliding rows are great improvements, as demonstrated by the detailed benchmarks.
I've left one suggestion to refactor a small piece of duplicated code in the collision handling logic to improve maintainability.
Additionally, with these changes, the _float_hash_arr function appears to be no longer used. You might consider removing it in a follow-up change to clean up the dead code.
Overall, this is a solid contribution that significantly speeds up a hot path in the preprocessing pipeline.
src/tabpfn/preprocessing/steps/add_fingerprint_features_step.py
Outdated
Show resolved
Hide resolved
…s/TabPFN into psi/fingerprint-speedup
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
klemens-floege
left a comment
There was a problem hiding this comment.
LGTM!, thanks for adding
Optimize fingerprint hashing in preprocessing
Summary
AddFingerprintFeaturesStep._transform()by rounding the full feature matrix once upfront (np.aroundon the whole array) instead of rounding each row individually inside the per-row hash loop_hash_row_bytes()that works on pre-rounded row bytes + salt bytes, removing repeatednp.aroundandoffset.to_bytesoverhead from the hot loopPerformance
Aggregate speedup: 1.17x (geometric mean across 28 scenarios, 3 repeats).
The improvement scales with training set size since fingerprint hashing is O(n_train) per ensemble member:
Top individual improvements:
cls-10000x201.55x,reg-5000x201.98x,reg-30000x201.43x. The speedup is entirely in fit time; predict is unaffected.What changed
The previous implementation called
np.around(row, decimals=12)on every row inside the hash loop — redundantly re-rounding the same data for each of then_trainrows. For training data (non-test), it also computed two SHA-256 hashes per row: one for the base hash and one for the final hash, even though they are identical whenadd_to_hash == 0(which is the case for all non-duplicate rows).This PR rounds the matrix once before the loop and reuses the base hash when no collision offset is needed. Memory overhead is unchanged: only the rounded array is allocated (matching the original's cumulative per-row allocations).
Detailed benchmark results
3 repeats, 1 warmup, 4x NVIDIA RTX PRO 6000 Blackwell.
Fit vs Predict breakdown
The speedup is entirely in fit time (preprocessing). Predict times are unchanged.
Caveat
Due to rounding once, we need to have a copy of the full data once. For very large data this can mean increased memory consumption.
If not worth it to be merged, we can keep the status-quo.