I would like to calculate the sum of all columns and the sum of all rows of a matrix in CUDA. One way of doing that is to use the `SGEMV`

subroutine from BLAS, multiplying the matrix by a vector of 1s.

However, this leads to **two** scans of the matrix, assuming it is much bigger than the L1 cache: one for rows and another for columns. Additionally, I plan to further modify the code for other operators, and so this is why I am writing my own kernel.

My approach so far has been to break the matrix into submatrices of size `32 x 32`

. Each thread block loads such a submatrix into shared memory, calculates the sums of rows and colums of the submatrix, and adds them atomically to the appropriate output (`row`

and `col`

below). This way, the matrix data only needs to be read from VRAM once.

For simplicity, the code assumes that the matrix is `n x n`

, `n % 32 == 0`

and the thread block is `32 x 32`

```
__global__ void sum_cols_and_rows(size_t n, const float* matrix, float* col, float* row)
{
__shared__ float sh[32][32];
size_t x = blockDim.x * blockIdx.x + threadIdx.x;
size_t y = blockDim.y * blockIdx.y + threadIdx.y;
float sum = matrix[x + n * y];
sh[threadIdx.x][threadIdx.y] = sum;
for(unsigned w = 16; w >= 1; w /= 2)
sum += __shfl_down(sum, w);
const size_t laneID = threadIdx.x & 0x1f; // 32-1
if(laneID == 0)
atomicAdd(row + y, sum);
__syncthreads();
sum = sh[threadIdx.y][threadIdx.x]; // swapped indexes
for(unsigned w = 16; w >= 1; w /= 2)
sum += __shfl_down(sum, w);
if(laneID == 0)
atomicAdd(col + blockDim.x * blockIdx.x + threadIdx.y, sum);
}
// launch :
sum_cols_and_rows<<<dim3(n/32, n/32), dim3(32, 32), 32*32*sizeof(float)>>>(n, matrix, col, row);
```

However, the performance is rather disappointing. I am seeing about 20% of the theoretical 224GB/s memory bandwidth on GTX 980, even on large matrices, *e.g.* 16384x16384.

Is there any way to make this approach the theoretical bandwidth limit?