@nospecialize testf/_compare forwarders in CUDA tests.#3117
Merged
Conversation
The main test/helpers.jl \`testf\` and the private clones in lib/cublas, lib/cufft, lib/cusolver are all thin forwarders that call out to deepcopy / adapt / ≈. Each unique (f, xs...) call site was getting its own compiled method even though the body does no per-type compute. Matches the GPUArrays-side fix on TestSuite.compare / test_result (that one was the #1 testsuite compile hotspot until it was @nospecialize'd — 322 events dropped to 0 on the broadcasting testset trace). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 85552c4 | Previous: 22b2689 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
101042 ns |
102014.5 ns |
0.99 |
array/accumulate/Float32/dims=1 |
76713 ns |
77428 ns |
0.99 |
array/accumulate/Float32/dims=1L |
1586552 ns |
1587073 ns |
1.00 |
array/accumulate/Float32/dims=2 |
143947.5 ns |
144782 ns |
0.99 |
array/accumulate/Float32/dims=2L |
658019 ns |
661233 ns |
1.00 |
array/accumulate/Int64/1d |
118225 ns |
118923 ns |
0.99 |
array/accumulate/Int64/dims=1 |
80053.5 ns |
80472.5 ns |
0.99 |
array/accumulate/Int64/dims=1L |
1695730 ns |
1706560 ns |
0.99 |
array/accumulate/Int64/dims=2 |
156555 ns |
156801 ns |
1.00 |
array/accumulate/Int64/dims=2L |
962158 ns |
962520 ns |
1.00 |
array/broadcast |
20369 ns |
20669 ns |
0.99 |
array/construct |
1256.9 ns |
1256.8 ns |
1.00 |
array/copy |
18009 ns |
18092 ns |
1.00 |
array/copyto!/cpu_to_gpu |
216616 ns |
217195 ns |
1.00 |
array/copyto!/gpu_to_cpu |
280853 ns |
284274 ns |
0.99 |
array/copyto!/gpu_to_gpu |
10742 ns |
10873 ns |
0.99 |
array/iteration/findall/bool |
134816 ns |
134986 ns |
1.00 |
array/iteration/findall/int |
149572 ns |
151080 ns |
0.99 |
array/iteration/findfirst/bool |
81437 ns |
81781 ns |
1.00 |
array/iteration/findfirst/int |
83808 ns |
84023 ns |
1.00 |
array/iteration/findmin/1d |
86422.5 ns |
87902.5 ns |
0.98 |
array/iteration/findmin/2d |
117166 ns |
117417 ns |
1.00 |
array/iteration/logical |
198169.5 ns |
201691.5 ns |
0.98 |
array/iteration/scalar |
66685 ns |
66086 ns |
1.01 |
array/permutedims/2d |
51987.5 ns |
52480 ns |
0.99 |
array/permutedims/3d |
52434 ns |
53206 ns |
0.99 |
array/permutedims/4d |
51399.5 ns |
52056 ns |
0.99 |
array/random/rand/Float32 |
13113 ns |
13513 ns |
0.97 |
array/random/rand/Int64 |
25032 ns |
25225 ns |
0.99 |
array/random/rand!/Float32 |
8447.666666666666 ns |
9480 ns |
0.89 |
array/random/rand!/Int64 |
21770 ns |
21874.5 ns |
1.00 |
array/random/randn/Float32 |
43447 ns |
39930.5 ns |
1.09 |
array/random/randn!/Float32 |
30894 ns |
30795 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
34244 ns |
35149.5 ns |
0.97 |
array/reductions/mapreduce/Float32/dims=1 |
49219.5 ns |
40034.5 ns |
1.23 |
array/reductions/mapreduce/Float32/dims=1L |
51349 ns |
51335 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
56457 ns |
56606 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
69468.5 ns |
69556 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
42976.5 ns |
42597 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1 |
43024.5 ns |
50815 ns |
0.85 |
array/reductions/mapreduce/Int64/dims=1L |
87461 ns |
87295 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
59307 ns |
59460 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
84710 ns |
84825 ns |
1.00 |
array/reductions/reduce/Float32/1d |
34755 ns |
35215.5 ns |
0.99 |
array/reductions/reduce/Float32/dims=1 |
41015.5 ns |
40067 ns |
1.02 |
array/reductions/reduce/Float32/dims=1L |
51446 ns |
51370 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
56728.5 ns |
56646 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
70065 ns |
69814 ns |
1.00 |
array/reductions/reduce/Int64/1d |
42968 ns |
42572 ns |
1.01 |
array/reductions/reduce/Int64/dims=1 |
42021.5 ns |
50859 ns |
0.83 |
array/reductions/reduce/Int64/dims=1L |
87380 ns |
87191 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
59393 ns |
59743 ns |
0.99 |
array/reductions/reduce/Int64/dims=2L |
84703.5 ns |
84667 ns |
1.00 |
array/reverse/1d |
17939 ns |
18121 ns |
0.99 |
array/reverse/1dL |
68538 ns |
68646 ns |
1.00 |
array/reverse/1dL_inplace |
65727 ns |
65864 ns |
1.00 |
array/reverse/1d_inplace |
10232.833333333332 ns |
10292.666666666668 ns |
0.99 |
array/reverse/2d |
20679 ns |
21138 ns |
0.98 |
array/reverse/2dL |
72699 ns |
73135 ns |
0.99 |
array/reverse/2dL_inplace |
65688 ns |
65713 ns |
1.00 |
array/reverse/2d_inplace |
11080 ns |
11182 ns |
0.99 |
array/sorting/1d |
2735481 ns |
2735147 ns |
1.00 |
array/sorting/2d |
1068965 ns |
1073189 ns |
1.00 |
array/sorting/by |
3304941 ns |
3303865.5 ns |
1.00 |
cuda/synchronization/context/auto |
1182.1 ns |
1112 ns |
1.06 |
cuda/synchronization/context/blocking |
938.6515151515151 ns |
888.829268292683 ns |
1.06 |
cuda/synchronization/context/nonblocking |
7634.4 ns |
6852.5 ns |
1.11 |
cuda/synchronization/stream/auto |
996 ns |
970.8 ns |
1.03 |
cuda/synchronization/stream/blocking |
796.6421052631579 ns |
800.5102040816327 ns |
1.00 |
cuda/synchronization/stream/nonblocking |
8034.1 ns |
7101.299999999999 ns |
1.13 |
integration/byval/reference |
143711 ns |
143926 ns |
1.00 |
integration/byval/slices=1 |
145657 ns |
145976 ns |
1.00 |
integration/byval/slices=2 |
284549 ns |
284590 ns |
1.00 |
integration/byval/slices=3 |
423087 ns |
423197 ns |
1.00 |
integration/cudadevrt |
102345 ns |
102511 ns |
1.00 |
integration/volumerhs |
23471812 ns |
23505751 ns |
1.00 |
kernel/indexing |
13153 ns |
13253 ns |
0.99 |
kernel/indexing_checked |
13896 ns |
13903 ns |
1.00 |
kernel/launch |
2147.222222222222 ns |
2079.8888888888887 ns |
1.03 |
kernel/occupancy |
664.1314102564102 ns |
693.52 ns |
0.96 |
kernel/rand |
17278.5 ns |
15424 ns |
1.12 |
latency/import |
3821736397.5 ns |
3843588033 ns |
0.99 |
latency/precompile |
4587847485 ns |
4581549814.5 ns |
1.00 |
latency/ttfp |
4396109132 ns |
4430341918 ns |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3117 +/- ##
=======================================
Coverage 16.58% 16.58%
=======================================
Files 120 120
Lines 9586 9586
=======================================
Hits 1590 1590
Misses 7996 7996 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The main test/helpers.jl `testf` and the private clones in lib/cublas, lib/cufft, lib/cusolver are all thin forwarders that call out to deepcopy / adapt / ≈. Each unique (f, xs...) call site was getting its own compiled method even though the body does no per-type compute.