Optimize fingerprint hashing in preprocessing by psinger-prior · Pull Request #818 · PriorLabs/TabPFN

psinger-prior · 2026-03-17T12:15:14Z

Optimize fingerprint hashing in preprocessing

Summary

Optimizes AddFingerprintFeaturesStep._transform() by rounding the full feature matrix once upfront (np.around on the whole array) instead of rounding each row individually inside the per-row hash loop
Avoids a redundant second SHA-256 call per row when there are no hash collisions (the common case), by reusing the base hash directly
Extracts a helper _hash_row_bytes() that works on pre-rounded row bytes + salt bytes, removing repeated np.around and offset.to_bytes overhead from the hot loop

Performance

Aggregate speedup: 1.17x (geometric mean across 28 scenarios, 3 repeats).

The improvement scales with training set size since fingerprint hashing is O(n_train) per ensemble member:

Dataset size	Speedup
Small (≤500 train)	1.06x
Medium (1k-5k train)	1.14x
Large (10k-30k train)	1.32x
Very large (50k train)	1.16x

Top individual improvements: cls-10000x20 1.55x, reg-5000x20 1.98x, reg-30000x20 1.43x. The speedup is entirely in fit time; predict is unaffected.

What changed

The previous implementation called np.around(row, decimals=12) on every row inside the hash loop — redundantly re-rounding the same data for each of the n_train rows. For training data (non-test), it also computed two SHA-256 hashes per row: one for the base hash and one for the final hash, even though they are identical when add_to_hash == 0 (which is the case for all non-duplicate rows).

This PR rounds the matrix once before the loop and reuses the base hash when no collision offset is needed. Memory overhead is unchanged: only the rounded array is allocated (matching the original's cumulative per-row allocations).

Detailed benchmark results

3 repeats, 1 warmup, 4x NVIDIA RTX PRO 6000 Blackwell.

Scenario               Task   Train  Test Feat  Base(s)   Opt(s)    Delta  Speedup
---------------------------------------------------------------------------------------------------------
cls-100x10             cls      100   100   10    0.204    0.195   +0.009    1.05x
cls-500x10             cls      500   200   10    0.217    0.197   +0.019    1.10x
cls-1000x10            cls     1000   500   10    0.221    0.202   +0.018    1.09x
cls-5000x20            cls     5000  1000   20    0.339    0.326   +0.013    1.04x
cls-10000x20           cls    10000  2000   20    0.636    0.409   +0.227    1.55x
cls-30000x20           cls    30000  5000   20    1.809    1.302   +0.508    1.39x
cls-50000x20           cls    50000 10000   20    3.625    2.818   +0.806    1.29x
cls-1000x5             cls     1000   500    5    0.221    0.198   +0.023    1.12x
cls-1000x50            cls     1000   500   50    0.246    0.226   +0.020    1.09x
cls-1000x100           cls     1000   500  100    0.261    0.251   +0.011    1.04x
cls-1000x500           cls     1000   500  500    1.438    1.553   -0.116    0.93x
cls-1000x1000          cls     1000   500 1000    1.475    1.443   +0.031    1.02x
cls-10000x100          cls    10000  2000  100    1.390    1.196   +0.194    1.16x
cls-50000x100          cls    50000 10000  100   12.719   11.856   +0.863    1.07x
cls-1000x10-5c         cls     1000   500   10    0.228    0.193   +0.035    1.18x
cls-1000x10-10c        cls     1000   500   10    0.223    0.196   +0.027    1.14x
reg-100x10             reg      100   100   10    0.176    0.171   +0.005    1.03x
reg-1000x10            reg     1000   500   10    0.205    0.179   +0.026    1.14x
reg-5000x20            reg     5000  1000   20    0.652    0.329   +0.323    1.98x
reg-10000x20           reg    10000  2000   20    0.775    0.595   +0.180    1.30x
reg-30000x20           reg    30000  5000   20    2.721    1.899   +0.822    1.43x
reg-50000x20           reg    50000 10000   20    4.864    3.920   +0.943    1.24x
reg-1000x50            reg     1000   500   50    0.457    0.281   +0.176    1.63x
reg-1000x100           reg     1000   500  100    0.325    0.310   +0.015    1.05x
reg-1000x500           reg     1000   500  500    1.607    1.787   -0.180    0.90x
reg-1000x1000          reg     1000   500 1000    1.875    1.679   +0.196    1.12x
reg-10000x100          reg    10000  2000  100    2.292    2.013   +0.280    1.14x
reg-50000x100          reg    50000 10000  100   18.748   17.916   +0.832    1.05x

Aggregate (geomean):                              0.797    0.681   +0.116    1.17x

Fit vs Predict breakdown

The speedup is entirely in fit time (preprocessing). Predict times are unchanged.

Scenario                Base fit   Opt fit    Fit Δ  Base pred   Opt pred   Pred Δ
------------------------------------------------------------------------------------------
cls-100x10                 0.115     0.104   +0.011      0.094      0.092   +0.002
cls-500x10                 0.122     0.105   +0.016      0.095      0.092   +0.003
cls-1000x10                0.122     0.110   +0.012      0.098      0.093   +0.005
cls-5000x20                0.223     0.217   +0.007      0.111      0.108   +0.003
cls-10000x20               0.406     0.193   +0.213      0.230      0.216   +0.013
cls-30000x20               0.847     0.370   +0.477      0.961      0.930   +0.031
cls-50000x20               1.372     0.630   +0.742      2.253      2.188   +0.065
cls-1000x5                 0.122     0.104   +0.018      0.099      0.094   +0.005
cls-1000x50                0.141     0.124   +0.017      0.105      0.100   +0.005
cls-1000x100               0.167     0.143   +0.024      0.093      0.098   -0.004
cls-1000x500               1.058     1.089   -0.031      0.373      0.375   -0.002
cls-1000x1000              1.017     1.018   -0.002      0.392      0.391   +0.001
cls-10000x100              0.594     0.415   +0.179      0.794      0.781   +0.013
cls-50000x100              3.484     2.659   +0.824      9.241      9.184   +0.058
cls-1000x10-5c             0.131     0.100   +0.031      0.099      0.093   +0.006
cls-1000x10-10c            0.125     0.102   +0.023      0.098      0.094   +0.004
reg-100x10                 0.102     0.100   +0.002      0.074      0.072   +0.003
reg-1000x10                0.126     0.105   +0.021      0.079      0.074   +0.005
reg-5000x20                0.525     0.216   +0.309      0.120      0.113   +0.006
reg-10000x20               0.513     0.347   +0.166      0.262      0.249   +0.013
reg-30000x20               1.602     0.811   +0.791      1.117      1.088   +0.029
reg-50000x20               2.264     1.375   +0.889      2.601      2.552   +0.049
reg-1000x50                0.363     0.188   +0.175      0.093      0.089   +0.004
reg-1000x100               0.209     0.197   +0.012      0.116      0.113   +0.002
reg-1000x500               1.250     1.435   -0.186      0.357      0.352   +0.006
reg-1000x1000              1.494     1.260   +0.234      0.381      0.371   +0.010
reg-10000x100              1.308     1.037   +0.271      0.984      0.976   +0.009
reg-50000x100              7.518     6.702   +0.816     11.215     11.214   +0.000

Caveat

Due to rounding once, we need to have a copy of the full data once. For very large data this can mean increased memory consumption.

If not worth it to be merged, we can keep the status-quo.

…unding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…unding

…s/TabPFN into psi/fingerprint-speedup

gemini-code-assist

Code Review

This pull request introduces some excellent performance optimizations to the fingerprint hashing logic. Rounding the feature matrix once upfront and avoiding redundant hash calculations for non-colliding rows are great improvements, as demonstrated by the detailed benchmarks.

I've left one suggestion to refactor a small piece of duplicated code in the collision handling logic to improve maintainability.

Additionally, with these changes, the _float_hash_arr function appears to be no longer used. You might consider removing it in a follow-up change to clean up the dead code.

Overall, this is a solid contribution that significantly speeds up a hot path in the preprocessing pipeline.

src/tabpfn/preprocessing/steps/add_fingerprint_features_step.py

…s/TabPFN into psi/fingerprint-speedup

chatgpt-codex-connector · 2026-03-17T12:19:16Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

klemens-floege

LGTM!, thanks for adding

psinger-prior and others added 4 commits March 17, 2026 11:48

Optimize: batch-round fingerprint hashing, avoid redundant per-row ro…

5e90ac6

…unding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Optimize: batch-round fingerprint hashing, avoid redundant per-row ro…

f7344b1

…unding

Merge branch 'psi/fingerprint-speedup' of https://github.com/PriorLab…

9ef0c7b

…s/TabPFN into psi/fingerprint-speedup

change

5106071

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

src/tabpfn/preprocessing/steps/add_fingerprint_features_step.py Outdated Show resolved Hide resolved

psinger-prior added 2 commits March 17, 2026 12:18

change

5ba4878

Merge branch 'psi/fingerprint-speedup' of https://github.com/PriorLab…

930c199

…s/TabPFN into psi/fingerprint-speedup

psinger-prior marked this pull request as ready for review March 17, 2026 12:19

psinger-prior requested a review from a team as a code owner March 17, 2026 12:19

psinger-prior requested review from klemens-floege and removed request for a team March 17, 2026 12:19

Merge branch 'main' into psi/fingerprint-speedup

35b9260

klemens-floege approved these changes Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize fingerprint hashing in preprocessing#818

Optimize fingerprint hashing in preprocessing#818
psinger-prior wants to merge 7 commits intomainfrom
psi/fingerprint-speedup

psinger-prior commented Mar 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot commented Mar 17, 2026

Uh oh!

klemens-floege left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

psinger-prior commented Mar 17, 2026