Hardware runtime decision and utility#29
Hardware runtime decision and utility#29Brandon-Wilson wants to merge 4 commits intoharrism:masterfrom
Conversation
Hardware selection is now a runtime decision, such that precompiled binaries can execute on a system without a GPU without crashing. In addition, the target can be selected at runtime to define custom hardware selection logic. Several utility features have been added, including device runtime functions (getting the number of devices / setting the active device), basic shared memory support, and 3D execution grids. There has been careful consideration with all of these changes to avoid compatibility issues with pre-existing code.
|
NOTE: My advisors at JPL are interested in using HEMI / CUDA for image processing on upcoming missions, but would like to maintain full CPU compatibility. Preferably, these changes could be pushed to the master branch for support in the future. Thank you for your time! |
harrism
left a comment
There was a problem hiding this comment.
Overall I'm not sure some of the added complexity is in the spirit of hemi's original design. Please address the comments. I also think there's too much in this for one pull request. Please separate the CPU/GPU runtime execution selection from all of the utility pieces. That would make it easier for me to test and integrate.
Also, I won't accept any new functionality without tests. Please add tests (see the google tests already included in hemi for reference) for your new functionality.
| } | ||
|
|
||
| template <typename Function, typename... Arguments> | ||
| cudaError_t getMaxBlocksForDevice(unsigned int& NumBlocks, ExecutionPolicy &p) |
There was a problem hiding this comment.
A comment to explain what this getMaxBlocksForDevice... function is for vs. configureGrid would help me review this.
|
|
||
| namespace hemi | ||
| { | ||
| ///////Global Grid Gets////////////////////////////////////////////////////// |
There was a problem hiding this comment.
This comment doesn't match the style of other comments in the project. And what do you mean by "Gets"?
| return 0; | ||
| #endif | ||
| } | ||
| int globalThreadIndex() { |
| } | ||
| int globalThreadIndex() { | ||
| #ifdef HEMI_DEV_CODE | ||
| return (blockIdx.x + blockIdx.y * gridDim.x + blockIdx.z * gridDim.x * gridDim.y) // get block idx |
There was a problem hiding this comment.
OK, so you went full-3D for the global index.... I think this needs more design thought. The common case is 1D, and this is going to make the common case slow. I would leave globalThreadIndex() alone, and then make new accessors such as : globalThreadIndex1D, globalThreadIndex2D, globalThreadIndex3D. Or something like that. Do you use a lot of 3D blocks/grids?
| HEMI_DEV_CALLABLE_INLINE | ||
| int globalThreadCount() { | ||
| #ifdef HEMI_DEV_CODE | ||
| return blockDim.x * gridDim.x * blockDim.y * gridDim.y * blockDim.z * gridDim.z; |
There was a problem hiding this comment.
See my comment about globalThreadIndex above.
| @@ -0,0 +1,26 @@ | |||
| /////////////////////////////////////////////////////////////////////////////// | |||
There was a problem hiding this comment.
I would consider renaming this hemi.h and renaming hemi.h hemi_core.h... (So the convenient default is to include everything, but if you want to pare it down you can).
hemi/launch.h
Outdated
|
|
||
| // Automatic parallel launch | ||
| template <typename Function, typename... Arguments> | ||
| void launch(Arguments... args); |
There was a problem hiding this comment.
What is a launch without a kernel?
hemi/launch.h
Outdated
|
|
||
| // Launch function with an explicit execution policy / configuration | ||
| template <typename Function, typename... Arguments> | ||
| void launch(const ExecutionPolicy &p, Arguments... args); |
| // | ||
| // Automatic parallel launch | ||
| // | ||
| template <typename Function, typename... Arguments> |
There was a problem hiding this comment.
How does one pass a concrete function reference to this launch()? Can you show a working example?
| ExecutionPolicy p = policy; | ||
| if (p.getLocation() == hemi::device && queryForDevice()) | ||
| { | ||
| checkCuda(configureGrid(p, Kernel<Function, Arguments...>)); |
There was a problem hiding this comment.
Please fix spacing ("Kernel << <" --> "Kernel<<<"
Hardware selection is now a runtime decision, such that precompiled
binaries can execute on a system without a GPU without crashing. In
addition, the target can be selected at runtime to define custom
hardware selection logic.
Several utility features have been added, including device runtime
functions (getting the number of devices / setting the active device),
basic shared memory support, and 3D execution grids.
There has been careful consideration with all of these changes to avoid
compatibility issues with pre-existing code.