Skip to content

Optimize fingerprint hashing in preprocessing#818

Open
psinger-prior wants to merge 7 commits intomainfrom
psi/fingerprint-speedup
Open

Optimize fingerprint hashing in preprocessing#818
psinger-prior wants to merge 7 commits intomainfrom
psi/fingerprint-speedup

Conversation

@psinger-prior
Copy link
Contributor

Optimize fingerprint hashing in preprocessing

Summary

  • Optimizes AddFingerprintFeaturesStep._transform() by rounding the full feature matrix once upfront (np.around on the whole array) instead of rounding each row individually inside the per-row hash loop
  • Avoids a redundant second SHA-256 call per row when there are no hash collisions (the common case), by reusing the base hash directly
  • Extracts a helper _hash_row_bytes() that works on pre-rounded row bytes + salt bytes, removing repeated np.around and offset.to_bytes overhead from the hot loop

Performance

Aggregate speedup: 1.17x (geometric mean across 28 scenarios, 3 repeats).

The improvement scales with training set size since fingerprint hashing is O(n_train) per ensemble member:

Dataset size Speedup
Small (≤500 train) 1.06x
Medium (1k-5k train) 1.14x
Large (10k-30k train) 1.32x
Very large (50k train) 1.16x

Top individual improvements: cls-10000x20 1.55x, reg-5000x20 1.98x, reg-30000x20 1.43x. The speedup is entirely in fit time; predict is unaffected.

What changed

The previous implementation called np.around(row, decimals=12) on every row inside the hash loop — redundantly re-rounding the same data for each of the n_train rows. For training data (non-test), it also computed two SHA-256 hashes per row: one for the base hash and one for the final hash, even though they are identical when add_to_hash == 0 (which is the case for all non-duplicate rows).

This PR rounds the matrix once before the loop and reuses the base hash when no collision offset is needed. Memory overhead is unchanged: only the rounded array is allocated (matching the original's cumulative per-row allocations).

Detailed benchmark results

3 repeats, 1 warmup, 4x NVIDIA RTX PRO 6000 Blackwell.

Scenario               Task   Train  Test Feat  Base(s)   Opt(s)    Delta  Speedup
---------------------------------------------------------------------------------------------------------
cls-100x10             cls      100   100   10    0.204    0.195   +0.009    1.05x
cls-500x10             cls      500   200   10    0.217    0.197   +0.019    1.10x
cls-1000x10            cls     1000   500   10    0.221    0.202   +0.018    1.09x
cls-5000x20            cls     5000  1000   20    0.339    0.326   +0.013    1.04x
cls-10000x20           cls    10000  2000   20    0.636    0.409   +0.227    1.55x
cls-30000x20           cls    30000  5000   20    1.809    1.302   +0.508    1.39x
cls-50000x20           cls    50000 10000   20    3.625    2.818   +0.806    1.29x
cls-1000x5             cls     1000   500    5    0.221    0.198   +0.023    1.12x
cls-1000x50            cls     1000   500   50    0.246    0.226   +0.020    1.09x
cls-1000x100           cls     1000   500  100    0.261    0.251   +0.011    1.04x
cls-1000x500           cls     1000   500  500    1.438    1.553   -0.116    0.93x
cls-1000x1000          cls     1000   500 1000    1.475    1.443   +0.031    1.02x
cls-10000x100          cls    10000  2000  100    1.390    1.196   +0.194    1.16x
cls-50000x100          cls    50000 10000  100   12.719   11.856   +0.863    1.07x
cls-1000x10-5c         cls     1000   500   10    0.228    0.193   +0.035    1.18x
cls-1000x10-10c        cls     1000   500   10    0.223    0.196   +0.027    1.14x
reg-100x10             reg      100   100   10    0.176    0.171   +0.005    1.03x
reg-1000x10            reg     1000   500   10    0.205    0.179   +0.026    1.14x
reg-5000x20            reg     5000  1000   20    0.652    0.329   +0.323    1.98x
reg-10000x20           reg    10000  2000   20    0.775    0.595   +0.180    1.30x
reg-30000x20           reg    30000  5000   20    2.721    1.899   +0.822    1.43x
reg-50000x20           reg    50000 10000   20    4.864    3.920   +0.943    1.24x
reg-1000x50            reg     1000   500   50    0.457    0.281   +0.176    1.63x
reg-1000x100           reg     1000   500  100    0.325    0.310   +0.015    1.05x
reg-1000x500           reg     1000   500  500    1.607    1.787   -0.180    0.90x
reg-1000x1000          reg     1000   500 1000    1.875    1.679   +0.196    1.12x
reg-10000x100          reg    10000  2000  100    2.292    2.013   +0.280    1.14x
reg-50000x100          reg    50000 10000  100   18.748   17.916   +0.832    1.05x

Aggregate (geomean):                              0.797    0.681   +0.116    1.17x

Fit vs Predict breakdown

The speedup is entirely in fit time (preprocessing). Predict times are unchanged.

Scenario                Base fit   Opt fit    Fit Δ  Base pred   Opt pred   Pred Δ
------------------------------------------------------------------------------------------
cls-100x10                 0.115     0.104   +0.011      0.094      0.092   +0.002
cls-500x10                 0.122     0.105   +0.016      0.095      0.092   +0.003
cls-1000x10                0.122     0.110   +0.012      0.098      0.093   +0.005
cls-5000x20                0.223     0.217   +0.007      0.111      0.108   +0.003
cls-10000x20               0.406     0.193   +0.213      0.230      0.216   +0.013
cls-30000x20               0.847     0.370   +0.477      0.961      0.930   +0.031
cls-50000x20               1.372     0.630   +0.742      2.253      2.188   +0.065
cls-1000x5                 0.122     0.104   +0.018      0.099      0.094   +0.005
cls-1000x50                0.141     0.124   +0.017      0.105      0.100   +0.005
cls-1000x100               0.167     0.143   +0.024      0.093      0.098   -0.004
cls-1000x500               1.058     1.089   -0.031      0.373      0.375   -0.002
cls-1000x1000              1.017     1.018   -0.002      0.392      0.391   +0.001
cls-10000x100              0.594     0.415   +0.179      0.794      0.781   +0.013
cls-50000x100              3.484     2.659   +0.824      9.241      9.184   +0.058
cls-1000x10-5c             0.131     0.100   +0.031      0.099      0.093   +0.006
cls-1000x10-10c            0.125     0.102   +0.023      0.098      0.094   +0.004
reg-100x10                 0.102     0.100   +0.002      0.074      0.072   +0.003
reg-1000x10                0.126     0.105   +0.021      0.079      0.074   +0.005
reg-5000x20                0.525     0.216   +0.309      0.120      0.113   +0.006
reg-10000x20               0.513     0.347   +0.166      0.262      0.249   +0.013
reg-30000x20               1.602     0.811   +0.791      1.117      1.088   +0.029
reg-50000x20               2.264     1.375   +0.889      2.601      2.552   +0.049
reg-1000x50                0.363     0.188   +0.175      0.093      0.089   +0.004
reg-1000x100               0.209     0.197   +0.012      0.116      0.113   +0.002
reg-1000x500               1.250     1.435   -0.186      0.357      0.352   +0.006
reg-1000x1000              1.494     1.260   +0.234      0.381      0.371   +0.010
reg-10000x100              1.308     1.037   +0.271      0.984      0.976   +0.009
reg-50000x100              7.518     6.702   +0.816     11.215     11.214   +0.000

Caveat

Due to rounding once, we need to have a copy of the full data once. For very large data this can mean increased memory consumption.

If not worth it to be merged, we can keep the status-quo.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces some excellent performance optimizations to the fingerprint hashing logic. Rounding the feature matrix once upfront and avoiding redundant hash calculations for non-colliding rows are great improvements, as demonstrated by the detailed benchmarks.

I've left one suggestion to refactor a small piece of duplicated code in the collision handling logic to improve maintainability.

Additionally, with these changes, the _float_hash_arr function appears to be no longer used. You might consider removing it in a follow-up change to clean up the dead code.

Overall, this is a solid contribution that significantly speeds up a hot path in the preprocessing pipeline.

@psinger-prior psinger-prior marked this pull request as ready for review March 17, 2026 12:19
@psinger-prior psinger-prior requested a review from a team as a code owner March 17, 2026 12:19
@psinger-prior psinger-prior requested review from klemens-floege and removed request for a team March 17, 2026 12:19
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Contributor

@klemens-floege klemens-floege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!, thanks for adding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants