What I want to do is feed in my m x n matrix, and in parallel, construct n square diagonal matrices for each column of the matrix, perform an operation on each square diagonal matrix, and then recombine the result. How do I do this?

So far, I start of with an m x n matrix; the result from a previous matrix computation where each element is calculated using the function y = f(g(x)).

This gives me a matrix with n column elements [f1, f2...fn] where each fn represents a column vector of height m.

From here, I want to differentiate each column of the matrix with respect to g(x). Differentiating fn(x) w.r.t. g(x) results in a square matrix with elements f'(x). Under constraint, this square matrix reduces to a Jacobian with the elements of each row along the diagonal of the square matrix, and equal to fn', all other elements equaling zero.

Hence the reason why it is necessary to construct the diagonal for each of the vector rows fn.

To do this, I take a target vector defined as A(hA x 1) which was extracted from the larger A(m x n) matrix. I then prepared a zeroed matrix defined as C(hA x hA) which will be used to hold the diagonals.

The aim is to diagonalize the vector A into a square matrix with each element of A sitting on the diagonal of C, everything else being zero.

There are probably more efficient ways to accomplish this using some pre-built routine without building a whole new kernel, but please be aware that for these purposes, this method is necessary.

The kernel code (which works) to accomplish this is shown here:

```
_cudaDiagonalizeTest << <5, 1 >> >(d_A, matrix_size.uiWA, matrix_size.uiHA, d_C, matrix_size.uiWC, matrix_size.uiHC);
__global__ void _cudaDiagonalizeTest(float *A, int wA, int hA, float *C, int wC, int hC)
{
int ix, iy, idx;
ix = blockIdx.x * blockDim.x + threadIdx.x;
iy = blockIdx.y * blockDim.y + threadIdx.y;
idx = iy * wA + ix;
C[idx * (wC + 1)] = A[idx];
}
```

I am a bit suspicious that this is a very naive approach to a solution and was wondering if someone could give an example of how I could do the same using

a) reduction

b) thrust

For vectors of large row size, I would like to be able to use the GPU's multithreading capabilities to chunk the task into small jobs, and combine each result at the end with __syncthreads().

The picture below shows what the desired result is.

I have read NVIDIA's article on reduction, but did not manage to achieve the desired results.

Any assistance or explanation would be very much welcomed.

Thanks.

Matrix A is the target with 4 columns. I want to take each column, and copy its elements into Matrix B as a diagonal, iterating through each column.