Skip to content

GPU: Warp-aggregated DP emission (+10–30% perf) & new .dat v1.6 format#7

Open
fmg75 wants to merge 5 commits intoRetiredC:mainfrom
fmg75:feature/dat-v16-gpu-optimizations
Open

GPU: Warp-aggregated DP emission (+10–30% perf) & new .dat v1.6 format#7
fmg75 wants to merge 5 commits intoRetiredC:mainfrom
fmg75:feature/dat-v16-gpu-optimizations

Conversation

@fmg75
Copy link

@fmg75 fmg75 commented Aug 22, 2025

Description (English)

This PR introduces performance improvements and a new compact .dat format (v1.6) to RCKangaroo.

🚀 Improvements

Warp-aggregated atomics for DP emission
Reduced per-thread atomics to a single warp-level atomic, with coalesced writes.
→ Results: +10–30% performance boost, depending on GPU and -dp.

New .dat format (v1.6)

DP record reduced from 32B → 28B.

X tail: 5 bytes (was 9).

Distance: 22 bytes.

Type: 1 byte.

New tag: TMBM16.

Backward compatible: can read both v1.5 and v1.6 .dat files.

Memory coalescing improvements for PCIe transfers.

Documentation updated (README.md and README_es.md):

Added “What’s New in v1.6” sections.

Benchmarks and build recommendations updated.

📊 Benchmarks (RTX 3060)

v1.5: ~750 MKey/s @ -dp 16

v1.6: ~870 MKey/s @ -dp 16

+16% throughput with ~12.5% smaller .dat files.

@FreedomLabsIO
Copy link

Hi, I get this error

GPU 0, CallGpuKernel failed: an illegal memory access was encountered

when USE_JACOBIAN=1

Do you have any idea why?

make clean; make SM=89 USE_JACOBIAN=1 PROFILE=release -j

rm -f RCKangaroo.o GpuKang.o Ec.o utils.o ./RCGpuCore.o rckangaroo
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c RCKangaroo.cpp -o RCKangaroo.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c GpuKang.cpp -o GpuKang.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c Ec.cpp -o Ec.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c utils.cpp -o utils.o
/usr/local/cuda-12.8/bin/nvcc -std=c++17 -arch=sm_89 -O3 -Xptxas -O3 -Xptxas -dlcm=ca -Xfatbin=-compress-all -DUSE_JACOBIAN=1 -Xcompiler -ffunction-sections -Xcompiler -fdata-sections -c RCGpuCore.cu -o RCGpuCore.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -o rckangaroo RCKangaroo.o GpuKang.o Ec.o utils.o ./RCGpuCore.o -L/usr/local/cuda-12.8/lib64 -lcudart -pthread
./rckangaroo -start 1000000000000000000000 -range 84 -dp 18 -pubkey 0329c4574a4fd8c810b7e42a4b398882b381bcd85e40c6883712912d167c83e73a
********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 6, CUDA driver/runtime: 12.8/12.8
GPU 0: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 1, L2 size: 65536 KB
GPU 1: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 3, L2 size: 65536 KB
GPU 2: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 6, L2 size: 65536 KB
GPU 3: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 7, L2 size: 65536 KB
GPU 4: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 8, L2 size: 65536 KB
GPU 5: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 9, L2 size: 65536 KB
Total GPUs for work: 6

MAIN MODE

Solving public key
X: 29C4574A4FD8C810B7E42A4B398882B381BCD85E40C6883712912D167C83E73A
Y: 0E02C3AFD79913AB0961C95F12498F36A72FFA35C93AF27CEE30010FA6B51C53
Offset: 0000000000000000000000000000000000000000001000000000000000000000

Solving point: Range 84 bits, DP 18, start...
SOTA method, estimated ops: 2^42.202, RAM for DPs: 0.906 GB. DP and GPU overheads not included!
Estimated DPs per kangaroo: 6.542.
GPU 0: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 1: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 2: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 3: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 4: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 5: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPUs started...
GPU 2, CallGpuKernel failed: an illegal memory access was encountered
GPU 1, CallGpuKernel failed: an illegal memory access was encountered
GPU 0, CallGpuKernel failed: an illegal memory access was encountered
GPU 5, CallGpuKernel failed: an illegal memory access was encountered
GPU 4, CallGpuKernel failed: an illegal memory access was encountered
GPU 3, CallGpuKernel failed: an illegal memory access was encountered

With USE_JACOBIAN=0 it runs but performance is a little bit worse than original code.

Can you give an example of a scenario where you got the 10~30% boost in performance?

Thanks

@fmg75
Copy link
Author

fmg75 commented Nov 12, 2025

Hi, I get this error

GPU 0, CallGpuKernel failed: an illegal memory access was encountered

when USE_JACOBIAN=1

Do you have any idea why?

make clean; make SM=89 USE_JACOBIAN=1 PROFILE=release -j

rm -f RCKangaroo.o GpuKang.o Ec.o utils.o ./RCGpuCore.o rckangaroo
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c RCKangaroo.cpp -o RCKangaroo.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c GpuKang.cpp -o GpuKang.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c Ec.cpp -o Ec.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c utils.cpp -o utils.o
/usr/local/cuda-12.8/bin/nvcc -std=c++17 -arch=sm_89 -O3 -Xptxas -O3 -Xptxas -dlcm=ca -Xfatbin=-compress-all -DUSE_JACOBIAN=1 -Xcompiler -ffunction-sections -Xcompiler -fdata-sections -c RCGpuCore.cu -o RCGpuCore.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -o rckangaroo RCKangaroo.o GpuKang.o Ec.o utils.o ./RCGpuCore.o -L/usr/local/cuda-12.8/lib64 -lcudart -pthread
./rckangaroo -start 1000000000000000000000 -range 84 -dp 18 -pubkey 0329c4574a4fd8c810b7e42a4b398882b381bcd85e40c6883712912d167c83e73a
********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 6, CUDA driver/runtime: 12.8/12.8
GPU 0: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 1, L2 size: 65536 KB
GPU 1: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 3, L2 size: 65536 KB
GPU 2: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 6, L2 size: 65536 KB
GPU 3: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 7, L2 size: 65536 KB
GPU 4: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 8, L2 size: 65536 KB
GPU 5: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 9, L2 size: 65536 KB
Total GPUs for work: 6

MAIN MODE

Solving public key
X: 29C4574A4FD8C810B7E42A4B398882B381BCD85E40C6883712912D167C83E73A
Y: 0E02C3AFD79913AB0961C95F12498F36A72FFA35C93AF27CEE30010FA6B51C53
Offset: 0000000000000000000000000000000000000000001000000000000000000000

Solving point: Range 84 bits, DP 18, start...
SOTA method, estimated ops: 2^42.202, RAM for DPs: 0.906 GB. DP and GPU overheads not included!
Estimated DPs per kangaroo: 6.542.
GPU 0: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 1: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 2: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 3: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 4: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 5: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPUs started...
GPU 2, CallGpuKernel failed: an illegal memory access was encountered
GPU 1, CallGpuKernel failed: an illegal memory access was encountered
GPU 0, CallGpuKernel failed: an illegal memory access was encountered
GPU 5, CallGpuKernel failed: an illegal memory access was encountered
GPU 4, CallGpuKernel failed: an illegal memory access was encountered
GPU 3, CallGpuKernel failed: an illegal memory access was encountered

With USE_JACOBIAN=0 it runs but performance is a little bit worse than original code.

Can you give an example of a scenario where you got the 10~30% boost in performance?

Thanks

What version of which do you use?
Try compiling with:

Syntax: ./build.sh <USE_JACOBIAN 0|1> <profile: release|debug>

./build.sh 86 1 release # RTX 3060 (SM 8.6), Jacobian ON
./build.sh 86 0 release # Jacobian OFF (affine) for A/B

./rckangaroo -dp 14 -range 70 -start 100000000000 -pubkey 0290e6900a58d33393bc1097b5aed31f2e4e7cbd3e5466af958665bc0121248483 -tames tames_dp14_71.dat

To see a performance improvement, you should first generate the .dat file!

@FreedomLabsIO
Copy link

./build.sh 89 1 release
== CCFLAGS:   -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1
== NVCCFLAGS: -std=c++17 -arch=sm_89 -O3 -Xptxas -O3 -Xptxas -dlcm=ca -Xfatbin=-compress-all -DUSE_JACOBIAN=1 -Xcompiler -ffunction-sections -Xcompiler -fdata-sections
== Listo: ./rckangaroo
./rckangaroo -dp 14 -range 70 -start 100000000000 -pubkey 0290e6900a58d33393bc1097b5aed31f2e4e7cbd3e5466af958665bc0121248483 -tames tames_dp14_71.dat -max 10
********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 6, CUDA driver/runtime: 12.8/12.8
GPU 0: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 1, L2 size: 65536 KB
GPU 1: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 3, L2 size: 65536 KB
GPU 2: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 6, L2 size: 65536 KB
GPU 3: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 7, L2 size: 65536 KB
GPU 4: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 8, L2 size: 65536 KB
GPU 5: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 9, L2 size: 65536 KB
Total GPUs for work: 6

TAMES GENERATION MODE

Solving point: Range 70 bits, DP 14, start...
SOTA method, estimated ops: 2^35.202, RAM for DPs: 0.277 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^38.524, max RAM for DPs: 1.086 GB
Estimated DPs per kangaroo: 0.818. DP overhead is big, use less DP value if possible!
GPU 0: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 1: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 2: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 3: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 4: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 5: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPUs started...
GPU 0, CallGpuKernel failed: an illegal memory access was encountered
GPU 1, CallGpuKernel failed: an illegal memory access was encountered
GPU 4, CallGpuKernel failed: an illegal memory access was encountered
GPU 2, CallGpuKernel failed: an illegal memory access was encountered
GPU 5, CallGpuKernel failed: an illegal memory access was encountered
GPU 3, CallGpuKernel failed: an illegal memory access was encountered

I'm using CUDA 12.8 and g++-9 (tried with g++-13 too).

@fmg75
Copy link
Author

fmg75 commented Nov 12, 2025

./build.sh 89 1 release
== CCFLAGS:   -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1
== NVCCFLAGS: -std=c++17 -arch=sm_89 -O3 -Xptxas -O3 -Xptxas -dlcm=ca -Xfatbin=-compress-all -DUSE_JACOBIAN=1 -Xcompiler -ffunction-sections -Xcompiler -fdata-sections
== Listo: ./rckangaroo
./rckangaroo -dp 14 -range 70 -start 100000000000 -pubkey 0290e6900a58d33393bc1097b5aed31f2e4e7cbd3e5466af958665bc0121248483 -tames tames_dp14_71.dat -max 10
********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 6, CUDA driver/runtime: 12.8/12.8
GPU 0: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 1, L2 size: 65536 KB
GPU 1: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 3, L2 size: 65536 KB
GPU 2: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 6, L2 size: 65536 KB
GPU 3: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 7, L2 size: 65536 KB
GPU 4: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 8, L2 size: 65536 KB
GPU 5: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 9, L2 size: 65536 KB
Total GPUs for work: 6

TAMES GENERATION MODE

Solving point: Range 70 bits, DP 14, start...
SOTA method, estimated ops: 2^35.202, RAM for DPs: 0.277 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^38.524, max RAM for DPs: 1.086 GB
Estimated DPs per kangaroo: 0.818. DP overhead is big, use less DP value if possible!
GPU 0: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 1: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 2: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 3: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 4: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 5: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPUs started...
GPU 0, CallGpuKernel failed: an illegal memory access was encountered
GPU 1, CallGpuKernel failed: an illegal memory access was encountered
GPU 4, CallGpuKernel failed: an illegal memory access was encountered
GPU 2, CallGpuKernel failed: an illegal memory access was encountered
GPU 5, CallGpuKernel failed: an illegal memory access was encountered
GPU 3, CallGpuKernel failed: an illegal memory access was encountered

I'm using CUDA 12.8 and g++-9 (tried with g++-13 too).

I see it's finally complete?
Suspicious, some conflict with the multiple GPUs.
Perhaps you could try with a single GPU, for example, -gpu 0?

@FreedomLabsIO
Copy link

I see it's finally complete? Suspicious, some conflict with the multiple GPUs. Perhaps you could try with a single GPU, for example, -gpu 0?

Same issue.

./rckangaroo -dp 16 -range 70 -start 100000000000 -pubkey 0290e6900a58d33393bc1097b5aed31f2e4e7cbd3e5466af958665bc0121248483 -tames tames_dp14_71.dat -max 10 -gpu 0

********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 6, CUDA driver/runtime: 12.8/12.8
GPU 0: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 1, L2 size: 65536 KB
Total GPUs for work: 1

TAMES GENERATION MODE

Solving point: Range 70 bits, DP 16, start...
SOTA method, estimated ops: 2^35.202, RAM for DPs: 0.210 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^38.524, max RAM for DPs: 0.412 GB
Estimated DPs per kangaroo: 1.227. DP overhead is big, use less DP value if possible!
GPU 0: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPUs started...
GPU 0, CallGpuKernel failed: an illegal memory access was encountered

@fmg75
Copy link
Author

fmg75 commented Nov 12, 2025

I see it's finally complete? Suspicious, some conflict with the multiple GPUs. Perhaps you could try with a single GPU, for example, -gpu 0?

Same issue.

./rckangaroo -dp 16 -range 70 -start 100000000000 -pubkey 0290e6900a58d33393bc1097b5aed31f2e4e7cbd3e5466af958665bc0121248483 -tames tames_dp14_71.dat -max 10 -gpu 0

********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 6, CUDA driver/runtime: 12.8/12.8
GPU 0: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 1, L2 size: 65536 KB
Total GPUs for work: 1

TAMES GENERATION MODE

Solving point: Range 70 bits, DP 16, start...
SOTA method, estimated ops: 2^35.202, RAM for DPs: 0.210 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^38.524, max RAM for DPs: 0.412 GB
Estimated DPs per kangaroo: 1.227. DP overhead is big, use less DP value if possible!
GPU 0: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPUs started...
GPU 0, CallGpuKernel failed: an illegal memory access was encountered

It could be an incompatibility with your newer 12.8 CUDA chip.

❯ ./rckangaroo -dp 14 -range 67 -tames tames_67.dat -max 10 (base)


  •                RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
    

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 1, CUDA driver/runtime: 12.4/12.0
GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU, 5.79 GB, 30 CUs, cap 8.6, PCI 1, L2 size: 3072 KB
Total GPUs for work: 1

TAMES GENERATION MODE

Solving point: Range 67 bits, DP 14, start...
SOTA method, estimated ops: 2^33.702, RAM for DPs: 0.219 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^37.024, max RAM for DPs: 0.505 GB
Estimated DPs per kangaroo: 0.867. DP overhead is big, use less DP value if possible!
GPU 0: allocated 2899 MB, 983040 kangaroos. OldGpuMode: Yes
GPUs started...
GEN: Speed: 343 MKeys/s, Err: 0, DPs: 419K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 799 MKeys/s, Err: 0, DPs: 958K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 808 MKeys/s, Err: 0, DPs: 1438K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 802 MKeys/s, Err: 0, DPs: 1918K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 795 MKeys/s, Err: 0, DPs: 2399K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 792 MKeys/s, Err: 0, DPs: 2879K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 798 MKeys/s, Err: 0, DPs: 3358K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 804 MKeys/s, Err: 0, DPs: 3839K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 803 MKeys/s, Err: 0, DPs: 4379K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 802 MKeys/s, Err: 0, DPs: 4860K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 801 MKeys/s, Err: 0, DPs: 5340K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 800 MKeys/s, Err: 0, DPs: 5820K/852K, Time: 0d:00h:02m/0d:00h:00m
GEN: Speed: 799 MKeys/s, Err: 0, DPs: 6300K/852K, Time: 0d:00h:02m/0d:00h:00m
GEN: Speed: 798 MKeys/s, Err: 0, DPs: 6780K/852K, Time: 0d:00h:02m/0d:00h:00m
GEN: Speed: 796 MKeys/s, Err: 0, DPs: 7321K/852K, Time: 0d:00h:02m/0d:00h:00m
GEN: Speed: 794 MKeys/s, Err: 0, DPs: 7800K/852K, Time: 0d:00h:02m/0d:00h:00m
GEN: Speed: 793 MKeys/s, Err: 0, DPs: 8280K/852K, Time: 0d:00h:02m/0d:00h:00m
Operations limit reached
Stopping work ...
saving tames...
tames saved

@talebi
Copy link

talebi commented Nov 17, 2025

I got the same issues with rtx 4070 ti. single gpu

@fmg75
Copy link
Author

fmg75 commented Nov 18, 2025

I got the same issues with rtx 4070 ti. single gpu

I'd like to know why. What version of CUDA do you have installed?
It would be great if you could test the performance of modular inversion with the Montgomery trick and Jacobian coordinates. It really does affect performance. Could you try using CUDA driver 12.4/12.0?

@talebi
Copy link

talebi commented Nov 18, 2025

./rckangaroo -dp 14 -range 67 -tames tames_67.dat -max 10
********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 1, CUDA driver/runtime: 13.0/12.1
GPU 0: NVIDIA GeForce RTX 4070 Ti, 11.99 GB, 60 CUs, cap 8.9, PCI 1, L2 size: 49152 KB
Total GPUs for work: 1

TAMES GENERATION MODE

Solving point: Range 67 bits, DP 14, start...
SOTA method, estimated ops: 2^33.702, RAM for DPs: 0.219 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^37.024, max RAM for DPs: 0.505 GB
Estimated DPs per kangaroo: 2.313. DP overhead is big, use less DP value if possible!
GPU 0: allocated 1128 MB, 368640 kangaroos. OldGpuMode: No
GPUs started...
GPU 0, CallGpuKernel failed: an illegal memory access was encountered
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:01m/0d:00h:00m

@fmg75
Copy link
Author

fmg75 commented Nov 18, 2025

The bug is in the GPU kernel code for newer GPUs (when OldGpuMode = No). There's an illegal memory access in the implementation optimized for GPUs with large L2 caches.

I made a small change to the code. It should work!

@fmg75
Copy link
Author

fmg75 commented Nov 18, 2025

Please if you can let me know if this fix worked and can you run the code on RTX 40XX or 50XX

@talebi
Copy link

talebi commented Nov 19, 2025

Please if you can let me know if this fix worked and can you run the code on RTX 40XX or 50XX

error is gone but unfortunately dont see any progress. I can let it run for hours but nothing:

root@88a1b4e6d576:/RCKangaroo# ./rckangaroo -dp 14 -range 67 -tames tames_67.dat -max 10
********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 1, CUDA driver/runtime: 13.0/12.1
GPU 0: NVIDIA GeForce RTX 4070 Ti, 11.99 GB, 60 CUs, cap 8.9, PCI 1, L2 size: 49152 KB
Total GPUs for work: 1

TAMES GENERATION MODE

Solving point: Range 67 bits, DP 14, start...
SOTA method, estimated ops: 2^33.702, RAM for DPs: 0.219 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^37.024, max RAM for DPs: 0.505 GB
Estimated DPs per kangaroo: 0.434. DP overhead is big, use less DP value if possible!
GPU 0: allocated 5787 MB, 1966080 kangaroos. OldGpuMode: Yes
GPUs started...
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:01m/0d:00h:00m

@FreedomLabsIO
Copy link

Please if you can let me know if this fix worked and can you run the code on RTX 40XX or 50XX

Same issue as @talebi

an illegal memory access was encountered error is gone, but DPs do not increase.

./rckangaroo -range 71 -dp 16 -start 0 -tames tames71_v15.dat -max 10 -pubkey 0290e6900a58d33393bc1097b5aed31f2e4e7cbd3e5466af958665bc0121248483

********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 1, CUDA driver/runtime: 13.0/12.8
GPU 0: NVIDIA GeForce RTX 5070 Ti, 15.47 GB, 70 CUs, cap 12.0, PCI 1, L2 size: 49152 KB
Total GPUs for work: 1

TAMES GENERATION MODE

Solving point: Range 71 bits, DP 16, start...
SOTA method, estimated ops: 2^35.702, RAM for DPs: 0.219 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^39.024, max RAM for DPs: 0.505 GB
Estimated DPs per kangaroo: 0.372. DP overhead is big, use less DP value if possible!
GPU 0: allocated 6749 MB, 2293760 kangaroos. OldGpuMode: Yes
GPUs started...
GEN: Speed: 2293760 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants