I'm working on a piece of CUDA C++ code and need each thread to, essentially, access a 2D array in global memory by BOTH row-major AND column-major. Specifically, I need each thread-block to:
- generate it's own 1-d array (let's say, gridDim # of elements)
- Write these to global memory
- Read the n-th element of each written array, where n is block ID.
The way I see it, only the write OR the read can be coalesced, and the other will be accessing a separate cache line for each element (and perform terribly). I've read that texture memory has a 2-d caching mechanism, but don't know if it can be used to improve this situation.
BTW I am using a GTX 770, so its a GK104 Kepler card with compute capability 3.0.
Any help or advice would be greatly appreciated! Thanks.