This blog post explains how memory bank conflicts kill the transpose function's performance.
Now I can't but wonder: does the same happen on a "normal" cpu (in a multithreaded context)? Or is this specific to CUDA/OpenCL? Or does it not even appear in modern CPUs because of their relatively large cache sizes?
Best How To :
There have been bank conflicts since the earliest vector processing CPUs from the 1960's It's caused by interleaved memory or multi-channel memory access.
Interleaved memory access or MCMA solves the problem to slow RAM access, by phasing access to each word of memory from different banks or via different channels. But there is a side effect, memory access from the same bank takes longer than accessing memory from the adjacent bank.
From Wikipedia on the 1980's Cray 2 http://en.wikipedia.org/wiki/Cray-2
"Main memory banks were arranged in quadrants to be accessed at the same time, allowing programmers to scatter their data across memory to gain higher parallelism. The downside to this approach is that the cost of setting up the scatter/gather unit in the foreground processor was fairly high. Stride conflicts corresponding to the number of memory banks suffered a performance penalty (latency) as occasionally happened in power-of-2 FFT-based algorithms. As the Cray 2 had a much larger memory than Cray 1's or X-MPs, this problem was easily rectified by adding an extra unused element to an array to spread the work out"