-
Notifications
You must be signed in to change notification settings - Fork 0
JTP orders of magnitude slower than SEQ (even with substantial work sizes) #61
Copy link
Copy link
Open
Description
What steps will reproduce the problem?
1. Execute a kernel in JTP mode (see attached test code).
What is the expected output? What do you see instead?
Expected: the execution time of JTP is similar to or less than SEQ.
See: the execution time of JTP is orders of magnitude higher than SEQ for work
sizes up to about 32000.
What version of the product are you using? On what operating system?
aparapi-2012-05-06.zip (R#407, May 6).
Ubuntu 12.04 x64
nVidia gt540m, driver version 295.40, cuda toolkit 4.2.9
Intel Core i7-2630QM (2GHz, quad-core (pretend 8 with hyper-threading...))
Please provide any additional information below.
I tested a kernel that performs numerous functions on the work item ID
(trigonometric, cube root, exponential) which are all added or multiplied
together. The kernel was tested over work loads ranging in size from 2 to
1048576, over 1024 iterations for each size.
In a subsequent test I tried executing the kernel in JTP mode with a group size
of 4 to match the number of CPU cores (rather than letting Aparapi choose the
group size). The results were much improved for work sizes up to 262000 (but
slightly worse for work sizes larger than this), see the second set of results
below. So perhaps this is simply a matter of working out how to choose a good
group size (number of threads?) in JTP mode.
Results, letting Aparapi choose group size in JTP mode:
2: SEQ: 0.009s JTP: 0.148s GPU: 0.335s
4: SEQ: 0.015s JTP: 0.289s GPU: 0.135s
8: SEQ: 0.005s JTP: 0.361s GPU: 0.144s
16: SEQ: 0.01s JTP: 0.628s GPU: 0.123s
32: SEQ: 0.015s JTP: 1.193s GPU: 0.118s
64: SEQ: 0.028s JTP: 2.792s GPU: 0.117s
128: SEQ: 0.054s JTP: 6.153s GPU: 0.108s
256: SEQ: 0.112s JTP: 14.786s GPU: 0.12s
512: SEQ: 0.211s JTP: 15.251s GPU: 0.111s
1024: SEQ: 0.402s JTP: 15.263s GPU: 0.124s
2048: SEQ: 0.754s JTP: 15.662s GPU: 0.151s
4096: SEQ: 1.467s JTP: 15.655s GPU: 0.167s
8192: SEQ: 2.844s JTP: 15.806s GPU: 0.256s
16384: SEQ: 5.747s JTP: 15.932s GPU: 0.399s
32768: SEQ: 11.366s JTP: 16.49s GPU: 0.701s
65536: SEQ: 22.775s JTP: 17.414s GPU: 1.313s
131072: SEQ: 45.818s JTP: 21.927s GPU: 2.538s
262144: SEQ: 91.924s JTP: 32.749s GPU: 4.974s
524288: SEQ: 183.459s JTP: 56.879s GPU: 9.852s
1048576: SEQ: 369.247s JTP: 102.847s GPU: 19.615s
Results when specifying a group size matching the number of CPU cores in JTP
mode:
2: SEQ: 0.008s JTP: 0.266s GPU: 0.325s
4: SEQ: 0.003s JTP: 0.218s GPU: 0.133s
8: SEQ: 0.005s JTP: 0.193s GPU: 0.131s
16: SEQ: 0.009s JTP: 0.187s GPU: 0.125s
32: SEQ: 0.014s JTP: 0.17s GPU: 0.122s
64: SEQ: 0.027s JTP: 0.175s GPU: 0.117s
128: SEQ: 0.054s JTP: 0.176s GPU: 0.116s
256: SEQ: 0.108s JTP: 0.191s GPU: 0.135s
512: SEQ: 0.219s JTP: 0.23s GPU: 0.1s
1024: SEQ: 0.403s JTP: 0.292s GPU: 0.11s
2048: SEQ: 0.749s JTP: 0.389s GPU: 0.13s
4096: SEQ: 1.454s JTP: 0.599s GPU: 0.157s
8192: SEQ: 2.872s JTP: 1.03s GPU: 0.235s
16384: SEQ: 5.714s JTP: 1.938s GPU: 0.389s
32768: SEQ: 11.297s JTP: 4.155s GPU: 0.695s
65536: SEQ: 22.803s JTP: 7.823s GPU: 1.305s
131072: SEQ: 46.006s JTP: 15.562s GPU: 2.525s
262144: SEQ: 92.34s JTP: 30.026s GPU: 4.968s
524288: SEQ: 184.077s JTP: 61.684s GPU: 9.839s
1048576: SEQ: 370.805s JTP: 121.218s GPU: 19.595s
Original issue reported on code.google.com by oliver.c...@gmail.com on 9 Aug 2012 at 6:07
Attachments:
Reactions are currently unavailable