GPU: Warp-aggregated DP emission (+10–30% perf) & new .dat v1.6 format by fmg75 · Pull Request #7 · RetiredC/RCKangaroo

fmg75 · 2025-08-22T20:21:08Z

Description (English)

This PR introduces performance improvements and a new compact .dat format (v1.6) to RCKangaroo.

🚀 Improvements

Warp-aggregated atomics for DP emission
Reduced per-thread atomics to a single warp-level atomic, with coalesced writes.
→ Results: +10–30% performance boost, depending on GPU and -dp.

New .dat format (v1.6)

DP record reduced from 32B → 28B.

X tail: 5 bytes (was 9).

Distance: 22 bytes.

Type: 1 byte.

New tag: TMBM16.

Backward compatible: can read both v1.5 and v1.6 .dat files.

Memory coalescing improvements for PCIe transfers.

Documentation updated (README.md and README_es.md):

Added “What’s New in v1.6” sections.

Benchmarks and build recommendations updated.

📊 Benchmarks (RTX 3060)

v1.5: ~750 MKey/s @ -dp 16

v1.6: ~870 MKey/s @ -dp 16

+16% throughput with ~12.5% smaller .dat files.

FreedomLabsIO · 2025-11-12T14:26:51Z

Hi, I get this error

GPU 0, CallGpuKernel failed: an illegal memory access was encountered

when USE_JACOBIAN=1

Do you have any idea why?

make clean; make SM=89 USE_JACOBIAN=1 PROFILE=release -j

rm -f RCKangaroo.o GpuKang.o Ec.o utils.o ./RCGpuCore.o rckangaroo
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c RCKangaroo.cpp -o RCKangaroo.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c GpuKang.cpp -o GpuKang.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c Ec.cpp -o Ec.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c utils.cpp -o utils.o
/usr/local/cuda-12.8/bin/nvcc -std=c++17 -arch=sm_89 -O3 -Xptxas -O3 -Xptxas -dlcm=ca -Xfatbin=-compress-all -DUSE_JACOBIAN=1 -Xcompiler -ffunction-sections -Xcompiler -fdata-sections -c RCGpuCore.cu -o RCGpuCore.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -o rckangaroo RCKangaroo.o GpuKang.o Ec.o utils.o ./RCGpuCore.o -L/usr/local/cuda-12.8/lib64 -lcudart -pthread

./rckangaroo -start 1000000000000000000000 -range 84 -dp 18 -pubkey 0329c4574a4fd8c810b7e42a4b398882b381bcd85e40c6883712912d167c83e73a
********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 6, CUDA driver/runtime: 12.8/12.8
GPU 0: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 1, L2 size: 65536 KB
GPU 1: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 3, L2 size: 65536 KB
GPU 2: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 6, L2 size: 65536 KB
GPU 3: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 7, L2 size: 65536 KB
GPU 4: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 8, L2 size: 65536 KB
GPU 5: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 9, L2 size: 65536 KB
Total GPUs for work: 6

MAIN MODE

Solving public key
X: 29C4574A4FD8C810B7E42A4B398882B381BCD85E40C6883712912D167C83E73A
Y: 0E02C3AFD79913AB0961C95F12498F36A72FFA35C93AF27CEE30010FA6B51C53
Offset: 0000000000000000000000000000000000000000001000000000000000000000

Solving point: Range 84 bits, DP 18, start...
SOTA method, estimated ops: 2^42.202, RAM for DPs: 0.906 GB. DP and GPU overheads not included!
Estimated DPs per kangaroo: 6.542.
GPU 0: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 1: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 2: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 3: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 4: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 5: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPUs started...
GPU 2, CallGpuKernel failed: an illegal memory access was encountered
GPU 1, CallGpuKernel failed: an illegal memory access was encountered
GPU 0, CallGpuKernel failed: an illegal memory access was encountered
GPU 5, CallGpuKernel failed: an illegal memory access was encountered
GPU 4, CallGpuKernel failed: an illegal memory access was encountered
GPU 3, CallGpuKernel failed: an illegal memory access was encountered

With USE_JACOBIAN=0 it runs but performance is a little bit worse than original code.

Can you give an example of a scenario where you got the 10~30% boost in performance?

Thanks

fmg75 · 2025-11-12T15:09:31Z

Hi, I get this error

GPU 0, CallGpuKernel failed: an illegal memory access was encountered

when USE_JACOBIAN=1

Do you have any idea why?

make clean; make SM=89 USE_JACOBIAN=1 PROFILE=release -j

rm -f RCKangaroo.o GpuKang.o Ec.o utils.o ./RCGpuCore.o rckangaroo
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c RCKangaroo.cpp -o RCKangaroo.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c GpuKang.cpp -o GpuKang.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c Ec.cpp -o Ec.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -c utils.cpp -o utils.o
/usr/local/cuda-12.8/bin/nvcc -std=c++17 -arch=sm_89 -O3 -Xptxas -O3 -Xptxas -dlcm=ca -Xfatbin=-compress-all -DUSE_JACOBIAN=1 -Xcompiler -ffunction-sections -Xcompiler -fdata-sections -c RCGpuCore.cu -o RCGpuCore.o
g++-9 -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1 -o rckangaroo RCKangaroo.o GpuKang.o Ec.o utils.o ./RCGpuCore.o -L/usr/local/cuda-12.8/lib64 -lcudart -pthread

./rckangaroo -start 1000000000000000000000 -range 84 -dp 18 -pubkey 0329c4574a4fd8c810b7e42a4b398882b381bcd85e40c6883712912d167c83e73a
********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 6, CUDA driver/runtime: 12.8/12.8
GPU 0: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 1, L2 size: 65536 KB
GPU 1: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 3, L2 size: 65536 KB
GPU 2: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 6, L2 size: 65536 KB
GPU 3: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 7, L2 size: 65536 KB
GPU 4: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 8, L2 size: 65536 KB
GPU 5: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 9, L2 size: 65536 KB
Total GPUs for work: 6

MAIN MODE

Solving public key
X: 29C4574A4FD8C810B7E42A4B398882B381BCD85E40C6883712912D167C83E73A
Y: 0E02C3AFD79913AB0961C95F12498F36A72FFA35C93AF27CEE30010FA6B51C53
Offset: 0000000000000000000000000000000000000000001000000000000000000000

Solving point: Range 84 bits, DP 18, start...
SOTA method, estimated ops: 2^42.202, RAM for DPs: 0.906 GB. DP and GPU overheads not included!
Estimated DPs per kangaroo: 6.542.
GPU 0: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 1: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 2: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 3: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 4: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 5: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPUs started...
GPU 2, CallGpuKernel failed: an illegal memory access was encountered
GPU 1, CallGpuKernel failed: an illegal memory access was encountered
GPU 0, CallGpuKernel failed: an illegal memory access was encountered
GPU 5, CallGpuKernel failed: an illegal memory access was encountered
GPU 4, CallGpuKernel failed: an illegal memory access was encountered
GPU 3, CallGpuKernel failed: an illegal memory access was encountered

With USE_JACOBIAN=0 it runs but performance is a little bit worse than original code.

Can you give an example of a scenario where you got the 10~30% boost in performance?

Thanks

What version of which do you use?
Try compiling with:

Syntax: ./build.sh <USE_JACOBIAN 0|1> <profile: release|debug>

./build.sh 86 1 release # RTX 3060 (SM 8.6), Jacobian ON
./build.sh 86 0 release # Jacobian OFF (affine) for A/B

./rckangaroo -dp 14 -range 70 -start 100000000000 -pubkey 0290e6900a58d33393bc1097b5aed31f2e4e7cbd3e5466af958665bc0121248483 -tames tames_dp14_71.dat

To see a performance improvement, you should first generate the .dat file!

FreedomLabsIO · 2025-11-12T15:33:16Z

./build.sh 89 1 release
== CCFLAGS:   -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1
== NVCCFLAGS: -std=c++17 -arch=sm_89 -O3 -Xptxas -O3 -Xptxas -dlcm=ca -Xfatbin=-compress-all -DUSE_JACOBIAN=1 -Xcompiler -ffunction-sections -Xcompiler -fdata-sections
== Listo: ./rckangaroo

./rckangaroo -dp 14 -range 70 -start 100000000000 -pubkey 0290e6900a58d33393bc1097b5aed31f2e4e7cbd3e5466af958665bc0121248483 -tames tames_dp14_71.dat -max 10
********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 6, CUDA driver/runtime: 12.8/12.8
GPU 0: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 1, L2 size: 65536 KB
GPU 1: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 3, L2 size: 65536 KB
GPU 2: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 6, L2 size: 65536 KB
GPU 3: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 7, L2 size: 65536 KB
GPU 4: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 8, L2 size: 65536 KB
GPU 5: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 9, L2 size: 65536 KB
Total GPUs for work: 6

TAMES GENERATION MODE

Solving point: Range 70 bits, DP 14, start...
SOTA method, estimated ops: 2^35.202, RAM for DPs: 0.277 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^38.524, max RAM for DPs: 1.086 GB
Estimated DPs per kangaroo: 0.818. DP overhead is big, use less DP value if possible!
GPU 0: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 1: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 2: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 3: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 4: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 5: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPUs started...
GPU 0, CallGpuKernel failed: an illegal memory access was encountered
GPU 1, CallGpuKernel failed: an illegal memory access was encountered
GPU 4, CallGpuKernel failed: an illegal memory access was encountered
GPU 2, CallGpuKernel failed: an illegal memory access was encountered
GPU 5, CallGpuKernel failed: an illegal memory access was encountered
GPU 3, CallGpuKernel failed: an illegal memory access was encountered

I'm using CUDA 12.8 and g++-9 (tried with g++-13 too).

fmg75 · 2025-11-12T16:20:59Z

./build.sh 89 1 release
== CCFLAGS:   -std=c++17 -I/usr/local/cuda-12.8/include -O3 -DNDEBUG -ffunction-sections -fdata-sections -DUSE_JACOBIAN=1
== NVCCFLAGS: -std=c++17 -arch=sm_89 -O3 -Xptxas -O3 -Xptxas -dlcm=ca -Xfatbin=-compress-all -DUSE_JACOBIAN=1 -Xcompiler -ffunction-sections -Xcompiler -fdata-sections
== Listo: ./rckangaroo

./rckangaroo -dp 14 -range 70 -start 100000000000 -pubkey 0290e6900a58d33393bc1097b5aed31f2e4e7cbd3e5466af958665bc0121248483 -tames tames_dp14_71.dat -max 10
********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 6, CUDA driver/runtime: 12.8/12.8
GPU 0: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 1, L2 size: 65536 KB
GPU 1: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 3, L2 size: 65536 KB
GPU 2: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 6, L2 size: 65536 KB
GPU 3: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 7, L2 size: 65536 KB
GPU 4: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 8, L2 size: 65536 KB
GPU 5: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 9, L2 size: 65536 KB
Total GPUs for work: 6

TAMES GENERATION MODE

Solving point: Range 70 bits, DP 14, start...
SOTA method, estimated ops: 2^35.202, RAM for DPs: 0.277 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^38.524, max RAM for DPs: 1.086 GB
Estimated DPs per kangaroo: 0.818. DP overhead is big, use less DP value if possible!
GPU 0: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 1: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 2: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 3: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 4: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPU 5: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPUs started...
GPU 0, CallGpuKernel failed: an illegal memory access was encountered
GPU 1, CallGpuKernel failed: an illegal memory access was encountered
GPU 4, CallGpuKernel failed: an illegal memory access was encountered
GPU 2, CallGpuKernel failed: an illegal memory access was encountered
GPU 5, CallGpuKernel failed: an illegal memory access was encountered
GPU 3, CallGpuKernel failed: an illegal memory access was encountered

I'm using CUDA 12.8 and g++-9 (tried with g++-13 too).

I see it's finally complete?
Suspicious, some conflict with the multiple GPUs.
Perhaps you could try with a single GPU, for example, -gpu 0?

FreedomLabsIO · 2025-11-12T17:09:52Z

I see it's finally complete? Suspicious, some conflict with the multiple GPUs. Perhaps you could try with a single GPU, for example, -gpu 0?

Same issue.

./rckangaroo -dp 16 -range 70 -start 100000000000 -pubkey 0290e6900a58d33393bc1097b5aed31f2e4e7cbd3e5466af958665bc0121248483 -tames tames_dp14_71.dat -max 10 -gpu 0

********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 6, CUDA driver/runtime: 12.8/12.8
GPU 0: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 1, L2 size: 65536 KB
Total GPUs for work: 1

TAMES GENERATION MODE

Solving point: Range 70 bits, DP 16, start...
SOTA method, estimated ops: 2^35.202, RAM for DPs: 0.210 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^38.524, max RAM for DPs: 0.412 GB
Estimated DPs per kangaroo: 1.227. DP overhead is big, use less DP value if possible!
GPU 0: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPUs started...
GPU 0, CallGpuKernel failed: an illegal memory access was encountered

fmg75 · 2025-11-12T17:51:49Z

I see it's finally complete? Suspicious, some conflict with the multiple GPUs. Perhaps you could try with a single GPU, for example, -gpu 0?

Same issue.

./rckangaroo -dp 16 -range 70 -start 100000000000 -pubkey 0290e6900a58d33393bc1097b5aed31f2e4e7cbd3e5466af958665bc0121248483 -tames tames_dp14_71.dat -max 10 -gpu 0

********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 6, CUDA driver/runtime: 12.8/12.8
GPU 0: NVIDIA GeForce RTX 4080 SUPER, 15.58 GB, 80 CUs, cap 8.9, PCI 1, L2 size: 65536 KB
Total GPUs for work: 1

TAMES GENERATION MODE

Solving point: Range 70 bits, DP 16, start...
SOTA method, estimated ops: 2^35.202, RAM for DPs: 0.210 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^38.524, max RAM for DPs: 0.412 GB
Estimated DPs per kangaroo: 1.227. DP overhead is big, use less DP value if possible!
GPU 0: allocated 1501 MB, 491520 kangaroos. OldGpuMode: No
GPUs started...
GPU 0, CallGpuKernel failed: an illegal memory access was encountered

It could be an incompatibility with your newer 12.8 CUDA chip.

❯ ./rckangaroo -dp 14 -range 67 -tames tames_67.dat -max 10 (base)

               RCKangaroo v3.0  (c) 2024 RetiredCoder                    *

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 1, CUDA driver/runtime: 12.4/12.0
GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU, 5.79 GB, 30 CUs, cap 8.6, PCI 1, L2 size: 3072 KB
Total GPUs for work: 1

TAMES GENERATION MODE

Solving point: Range 67 bits, DP 14, start...
SOTA method, estimated ops: 2^33.702, RAM for DPs: 0.219 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^37.024, max RAM for DPs: 0.505 GB
Estimated DPs per kangaroo: 0.867. DP overhead is big, use less DP value if possible!
GPU 0: allocated 2899 MB, 983040 kangaroos. OldGpuMode: Yes
GPUs started...
GEN: Speed: 343 MKeys/s, Err: 0, DPs: 419K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 799 MKeys/s, Err: 0, DPs: 958K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 808 MKeys/s, Err: 0, DPs: 1438K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 802 MKeys/s, Err: 0, DPs: 1918K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 795 MKeys/s, Err: 0, DPs: 2399K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 792 MKeys/s, Err: 0, DPs: 2879K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 798 MKeys/s, Err: 0, DPs: 3358K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 804 MKeys/s, Err: 0, DPs: 3839K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 803 MKeys/s, Err: 0, DPs: 4379K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 802 MKeys/s, Err: 0, DPs: 4860K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 801 MKeys/s, Err: 0, DPs: 5340K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 800 MKeys/s, Err: 0, DPs: 5820K/852K, Time: 0d:00h:02m/0d:00h:00m
GEN: Speed: 799 MKeys/s, Err: 0, DPs: 6300K/852K, Time: 0d:00h:02m/0d:00h:00m
GEN: Speed: 798 MKeys/s, Err: 0, DPs: 6780K/852K, Time: 0d:00h:02m/0d:00h:00m
GEN: Speed: 796 MKeys/s, Err: 0, DPs: 7321K/852K, Time: 0d:00h:02m/0d:00h:00m
GEN: Speed: 794 MKeys/s, Err: 0, DPs: 7800K/852K, Time: 0d:00h:02m/0d:00h:00m
GEN: Speed: 793 MKeys/s, Err: 0, DPs: 8280K/852K, Time: 0d:00h:02m/0d:00h:00m
Operations limit reached
Stopping work ...
saving tames...
tames saved

talebi · 2025-11-17T13:09:26Z

I got the same issues with rtx 4070 ti. single gpu

fmg75 · 2025-11-18T11:01:57Z

I got the same issues with rtx 4070 ti. single gpu

I'd like to know why. What version of CUDA do you have installed?
It would be great if you could test the performance of modular inversion with the Montgomery trick and Jacobian coordinates. It really does affect performance. Could you try using CUDA driver 12.4/12.0?

talebi · 2025-11-18T13:05:07Z

./rckangaroo -dp 14 -range 67 -tames tames_67.dat -max 10
********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 1, CUDA driver/runtime: 13.0/12.1
GPU 0: NVIDIA GeForce RTX 4070 Ti, 11.99 GB, 60 CUs, cap 8.9, PCI 1, L2 size: 49152 KB
Total GPUs for work: 1

TAMES GENERATION MODE

Solving point: Range 67 bits, DP 14, start...
SOTA method, estimated ops: 2^33.702, RAM for DPs: 0.219 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^37.024, max RAM for DPs: 0.505 GB
Estimated DPs per kangaroo: 2.313. DP overhead is big, use less DP value if possible!
GPU 0: allocated 1128 MB, 368640 kangaroos. OldGpuMode: No
GPUs started...
GPU 0, CallGpuKernel failed: an illegal memory access was encountered
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 751 MKeys/s, Err: 1, DPs: 0K/852K, Time: 0d:00h:01m/0d:00h:00m

fmg75 · 2025-11-18T13:32:54Z

The bug is in the GPU kernel code for newer GPUs (when OldGpuMode = No). There's an illegal memory access in the implementation optimized for GPUs with large L2 caches.

I made a small change to the code. It should work!

fmg75 · 2025-11-18T19:36:06Z

Please if you can let me know if this fix worked and can you run the code on RTX 40XX or 50XX

talebi · 2025-11-19T00:51:30Z

Please if you can let me know if this fix worked and can you run the code on RTX 40XX or 50XX

error is gone but unfortunately dont see any progress. I can let it run for hours but nothing:

root@88a1b4e6d576:/RCKangaroo# ./rckangaroo -dp 14 -range 67 -tames tames_67.dat -max 10
********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 1, CUDA driver/runtime: 13.0/12.1
GPU 0: NVIDIA GeForce RTX 4070 Ti, 11.99 GB, 60 CUs, cap 8.9, PCI 1, L2 size: 49152 KB
Total GPUs for work: 1

TAMES GENERATION MODE

Solving point: Range 67 bits, DP 14, start...
SOTA method, estimated ops: 2^33.702, RAM for DPs: 0.219 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^37.024, max RAM for DPs: 0.505 GB
Estimated DPs per kangaroo: 0.434. DP overhead is big, use less DP value if possible!
GPU 0: allocated 5787 MB, 1966080 kangaroos. OldGpuMode: Yes
GPUs started...
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:01m/0d:00h:00m
GEN: Speed: 1966080 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:01m/0d:00h:00m

FreedomLabsIO · 2025-11-19T03:40:19Z

Please if you can let me know if this fix worked and can you run the code on RTX 40XX or 50XX

Same issue as @talebi

an illegal memory access was encountered error is gone, but DPs do not increase.

./rckangaroo -range 71 -dp 16 -start 0 -tames tames71_v15.dat -max 10 -pubkey 0290e6900a58d33393bc1097b5aed31f2e4e7cbd3e5466af958665bc0121248483

********************************************************************************
*                    RCKangaroo v3.0  (c) 2024 RetiredCoder                    *
********************************************************************************

This software is free and open-source: https://github.com/RetiredC
It demonstrates fast GPU implementation of SOTA Kangaroo method for solving ECDLP
Linux version
CUDA devices: 1, CUDA driver/runtime: 13.0/12.8
GPU 0: NVIDIA GeForce RTX 5070 Ti, 15.47 GB, 70 CUs, cap 12.0, PCI 1, L2 size: 49152 KB
Total GPUs for work: 1

TAMES GENERATION MODE

Solving point: Range 71 bits, DP 16, start...
SOTA method, estimated ops: 2^35.702, RAM for DPs: 0.219 GB. DP and GPU overheads not included!
Max allowed number of ops: 2^39.024, max RAM for DPs: 0.505 GB
Estimated DPs per kangaroo: 0.372. DP overhead is big, use less DP value if possible!
GPU 0: allocated 6749 MB, 2293760 kangaroos. OldGpuMode: Yes
GPUs started...
GEN: Speed: 2293760 MKeys/s, Err: 0, DPs: 0K/852K, Time: 0d:00h:00m/0d:00h:00m

fmg75 added 3 commits August 22, 2025 16:54

GPU: Warp-aggregated DP emission (+10–30% perf) & new .dat v1.6 format

57bd5ff

"docs: add CHANGELOG for v1.6"

0ad194a

.gitingnore

ac4b043

fmg75 added 2 commits November 18, 2025 10:28

GpuKangs[GpuCnt]->IsOldGpu = true;

f9fb658

GpuKangs[GpuCnt]->IsOldGpu = true

75cec4c

Conversation

fmg75 commented Aug 22, 2025

Uh oh!

FreedomLabsIO commented Nov 12, 2025

Uh oh!

fmg75 commented Nov 12, 2025

Syntax: ./build.sh <USE_JACOBIAN 0|1> <profile: release|debug>

Uh oh!

FreedomLabsIO commented Nov 12, 2025

Uh oh!

fmg75 commented Nov 12, 2025

Uh oh!

FreedomLabsIO commented Nov 12, 2025

Uh oh!

fmg75 commented Nov 12, 2025

Uh oh!

talebi commented Nov 17, 2025

Uh oh!

fmg75 commented Nov 18, 2025

Uh oh!

talebi commented Nov 18, 2025

Uh oh!

fmg75 commented Nov 18, 2025

Uh oh!

fmg75 commented Nov 18, 2025

Uh oh!

talebi commented Nov 19, 2025

Uh oh!

FreedomLabsIO commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants