Skip to content

JTP orders of magnitude slower than SEQ (even with substantial work sizes) #61

@GoogleCodeExporter

Description

@GoogleCodeExporter
What steps will reproduce the problem?
1. Execute a kernel in JTP mode (see attached test code).

What is the expected output? What do you see instead?
Expected: the execution time of JTP is similar to or less than SEQ.
See: the execution time of JTP is orders of magnitude higher than SEQ for work 
sizes up to about 32000.

What version of the product are you using? On what operating system?
aparapi-2012-05-06.zip (R#407, May 6).
Ubuntu 12.04 x64
nVidia gt540m, driver version 295.40, cuda toolkit 4.2.9
Intel Core i7-2630QM (2GHz, quad-core (pretend 8 with hyper-threading...))

Please provide any additional information below.
I tested a kernel that performs numerous functions on the work item ID 
(trigonometric, cube root, exponential) which are all added or multiplied 
together. The kernel was tested over work loads ranging in size from 2 to 
1048576, over 1024 iterations for each size.

In a subsequent test I tried executing the kernel in JTP mode with a group size 
of 4 to match the number of CPU cores (rather than letting Aparapi choose the 
group size). The results were much improved for work sizes up to 262000 (but 
slightly worse for work sizes larger than this), see the second set of results 
below. So perhaps this is simply a matter of working out how to choose a good 
group size (number of threads?) in JTP mode.

Results, letting Aparapi choose group size in JTP mode:
2:  SEQ: 0.009s JTP: 0.148s GPU: 0.335s 
4:  SEQ: 0.015s JTP: 0.289s GPU: 0.135s 
8:  SEQ: 0.005s JTP: 0.361s GPU: 0.144s 
16:     SEQ: 0.01s  JTP: 0.628s GPU: 0.123s 
32:     SEQ: 0.015s JTP: 1.193s GPU: 0.118s 
64:     SEQ: 0.028s JTP: 2.792s GPU: 0.117s 
128:    SEQ: 0.054s JTP: 6.153s GPU: 0.108s 
256:    SEQ: 0.112s JTP: 14.786s    GPU: 0.12s  
512:    SEQ: 0.211s JTP: 15.251s    GPU: 0.111s 
1024:   SEQ: 0.402s JTP: 15.263s    GPU: 0.124s 
2048:   SEQ: 0.754s JTP: 15.662s    GPU: 0.151s 
4096:   SEQ: 1.467s JTP: 15.655s    GPU: 0.167s 
8192:   SEQ: 2.844s JTP: 15.806s    GPU: 0.256s 
16384:  SEQ: 5.747s JTP: 15.932s    GPU: 0.399s 
32768:  SEQ: 11.366s    JTP: 16.49s GPU: 0.701s 
65536:  SEQ: 22.775s    JTP: 17.414s    GPU: 1.313s 
131072:     SEQ: 45.818s    JTP: 21.927s    GPU: 2.538s 
262144:     SEQ: 91.924s    JTP: 32.749s    GPU: 4.974s 
524288:     SEQ: 183.459s   JTP: 56.879s    GPU: 9.852s 
1048576:    SEQ: 369.247s   JTP: 102.847s   GPU: 19.615s


Results when specifying a group size matching the number of CPU cores in JTP 
mode:
2:  SEQ: 0.008s JTP: 0.266s GPU: 0.325s 
4:  SEQ: 0.003s JTP: 0.218s GPU: 0.133s 
8:  SEQ: 0.005s JTP: 0.193s GPU: 0.131s 
16:     SEQ: 0.009s JTP: 0.187s GPU: 0.125s 
32:     SEQ: 0.014s JTP: 0.17s  GPU: 0.122s 
64:     SEQ: 0.027s JTP: 0.175s GPU: 0.117s 
128:    SEQ: 0.054s JTP: 0.176s GPU: 0.116s 
256:    SEQ: 0.108s JTP: 0.191s GPU: 0.135s 
512:    SEQ: 0.219s JTP: 0.23s  GPU: 0.1s   
1024:   SEQ: 0.403s JTP: 0.292s GPU: 0.11s  
2048:   SEQ: 0.749s JTP: 0.389s GPU: 0.13s  
4096:   SEQ: 1.454s JTP: 0.599s GPU: 0.157s 
8192:   SEQ: 2.872s JTP: 1.03s  GPU: 0.235s 
16384:  SEQ: 5.714s JTP: 1.938s GPU: 0.389s 
32768:  SEQ: 11.297s    JTP: 4.155s GPU: 0.695s 
65536:  SEQ: 22.803s    JTP: 7.823s GPU: 1.305s 
131072:     SEQ: 46.006s    JTP: 15.562s    GPU: 2.525s 
262144:     SEQ: 92.34s JTP: 30.026s    GPU: 4.968s 
524288:     SEQ: 184.077s   JTP: 61.684s    GPU: 9.839s 
1048576:    SEQ: 370.805s   JTP: 121.218s   GPU: 19.595s

Original issue reported on code.google.com by oliver.c...@gmail.com on 9 Aug 2012 at 6:07

Attachments:

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions