VTK-m perform cuda kernel scheduling parameter sweep studies
Capturing discussion found at: !1576 (merged)
Tasks:
-
Refactor GetGridsAndBlocks to call a function that will compute the 1D or 3D grid and block values. This would replace getNumSMs
but use the samecall_once
caching strategy. -
Refactor GetGridsAndBlocks 3D to use the computed values for the special cases( flat in X, small grids ). -
Add in environment controls to allow for parameter sweeps. This would go into the function that replaces getNumSMs
so that enviornment can control all. -
Remove cuda/internal/TaskTuner.h
and related code which isn't needed now that we have environment controls. -
Do parameter sweep studies on all hardware we have access to, and build out a spreadsheet of values that are better for each hardware component -
Add these computed values to the function that computes the 1D/3D grid and block values.
A rough idea on how we can encode the best grid and block values is (https://godbolt.org/z/8aqMWM):
enum struct GPU_ARCH { OTHER, PASCAL, VOLTA, TURING };
enum struct GPU_STRATA { ANY, CONSUMER, WORKSTATION, HPC, };
struct Presets
{
GPU_ARCH architecture;
GPU_STRATA strata;
int one_d_blocks_per_sm;
int one_d_grids_per_block;
int three_d_blocks_per_sm;
int three_d_grids_per_block[3];
};
void BuildSchedulingPresets(std::vector<Presets>& presets)
{
presets = std::vector<Presets>{
{ GPU_ARCH::OTHER, GPU_STRATA::ANY, 32, 128, 32, {4, 4, 4} },
{ GPU_ARCH::PASCAL, GPU_STRATA::ANY, 32, 128, 32, {4, 4, 4} },
{ GPU_ARCH::PASCAL, GPU_STRATA::WORKSTATION, 32, 256, 32, {8, 8, 8} },
{ GPU_ARCH::PASCAL, GPU_STRATA::HPC, 128, 512, 128, {8, 8, 8} },
{ GPU_ARCH::VOLTA, GPU_STRATA::ANY, 32, 128, 32, {4, 4, 4} },
{ GPU_ARCH::VOLTA, GPU_STRATA::WORKSTATION, 32, 256, 32, {8, 8, 8} },
{ GPU_ARCH::VOLTA, GPU_STRATA::HPC, 128, 512, 128, {8, 8, 8} },
{ GPU_ARCH::TURING, GPU_STRATA::ANY, 32, 512, 21, {16, 16, 16} },
};
}