dim3 block(4, 2)
dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y);
I found this code in Professional CUDA C Programming on page 53. It's meant to be a naive example of matrix multiplication.
nx is the number of columns and
ny is the number of rows.
Can you explain how the grid size is computed? Why is
block.x added to
nx and then subtracted by
There is a preview (https://books.google.com/books?id=_Z7rnAEACAAJ&printsec=frontcover#v=onepage&q&f=false) but page 53 is missing.
Best How To :
This is the standard CUDA idiom for determining the minimum number of blocks in each dimension (the "grid") that completely cover the desired input. This could be expressed as
ceil(nx/block.x), that is, figure out how many blocks are needed to cover the desired size, then round up.
But full floating point division and ceil is more expensive than necessary. Instead, since C defines integer division as a "floor" operation, you can add the divisor - 1 before dividing to the get the effect of a "ceiling" operation.
Try a few examples: If
nx = 10, then
nx + block.x - 1 is 13, and by integer divison, you need 3 blocks of size 4.
As you noted in the comment, +block.x pushes up floor to ceiling and the -1 is for numbers that divide perfectly into the divisor. e.g. (12 + 4)/4 would be 4 when we actually want (12+4-1)/4 which 3