GPU: Warp-aggregated DP emission (+10–30% perf) & new .dat v1.6 format#7
GPU: Warp-aggregated DP emission (+10–30% perf) & new .dat v1.6 format#7fmg75 wants to merge 5 commits intoRetiredC:mainfrom
Conversation
|
Hi, I get this error
when Do you have any idea why? With Can you give an example of a scenario where you got the 10~30% boost in performance? Thanks |
What version of which do you use? Syntax: ./build.sh <USE_JACOBIAN 0|1> <profile: release|debug>./build.sh 86 1 release # RTX 3060 (SM 8.6), Jacobian ON ./rckangaroo -dp 14 -range 70 -start 100000000000 -pubkey 0290e6900a58d33393bc1097b5aed31f2e4e7cbd3e5466af958665bc0121248483 -tames tames_dp14_71.dat To see a performance improvement, you should first generate the .dat file! |
I'm using |
I see it's finally complete? |
Same issue. |
It could be an incompatibility with your newer 12.8 CUDA chip. ❯ ./rckangaroo -dp 14 -range 67 -tames tames_67.dat -max 10 (base)
This software is free and open-source: https://github.com/RetiredC TAMES GENERATION MODE Solving point: Range 67 bits, DP 14, start... |
|
I got the same issues with rtx 4070 ti. single gpu |
I'd like to know why. What version of CUDA do you have installed? |
|
|
The bug is in the GPU kernel code for newer GPUs (when OldGpuMode = No). There's an illegal memory access in the implementation optimized for GPUs with large L2 caches. I made a small change to the code. It should work! |
|
Please if you can let me know if this fix worked and can you run the code on RTX 40XX or 50XX |
error is gone but unfortunately dont see any progress. I can let it run for hours but nothing: |
Same issue as @talebi
|
Description (English)
This PR introduces performance improvements and a new compact .dat format (v1.6) to RCKangaroo.
🚀 Improvements
Warp-aggregated atomics for DP emission
Reduced per-thread atomics to a single warp-level atomic, with coalesced writes.
→ Results: +10–30% performance boost, depending on GPU and -dp.
New .dat format (v1.6)
DP record reduced from 32B → 28B.
X tail: 5 bytes (was 9).
Distance: 22 bytes.
Type: 1 byte.
New tag: TMBM16.
Backward compatible: can read both v1.5 and v1.6 .dat files.
Memory coalescing improvements for PCIe transfers.
Documentation updated (README.md and README_es.md):
Added “What’s New in v1.6” sections.
Benchmarks and build recommendations updated.
📊 Benchmarks (RTX 3060)
v1.5: ~750 MKey/s @ -dp 16
v1.6: ~870 MKey/s @ -dp 16
+16% throughput with ~12.5% smaller .dat files.