I'm just looking at the following output and trying to wrap my mind around the numbers:
==2906== Profiling result: Time(%) Time Calls Avg Min Max Name 23.04% 10.9573s 16436 666.67us 64.996us 1.5927ms sgemm_sm35_ldg_tn_32x16x64x8x16 22.28% 10.5968s 14088 752.18us 612.13us 1.6235ms sgemm_sm_heavy_nt_ldg 18.09% 8.60573s 14088 610.86us 513.05us 1.2504ms sgemm_sm35_ldg_nn_128x8x128x16x16 16.48% 7.84050s 68092 115.15us 1.8240us 503.00us void axpy_kernel_val<float, int=0>(cublasAxpyParamsVal<float>) ... 0.25% 117.53ms 4744 24.773us 896ns 11.803ms [CUDA memcpy HtoD] 0.23% 107.32ms 37582 2.8550us 1.8880us 8.9556ms [CUDA memcpy DtoH] ... ==2906== API calls: Time(%) Time Calls Avg Min Max Name 83.47% 41.8256s 42326 988.18us 16.923us 13.332ms cudaMemcpy 9.27% 4.64747s 326372 14.239us 10.846us 11.601ms cudaLaunch 1.49% 745.12ms 1502720 495ns 379ns 1.7092ms cudaSetupArgument 1.37% 688.09ms 4702 146.34us 879ns 615.09ms cudaFree ...
When it comes to optimizing memory access, what are the numbers I really need to look at when comparing different implementations? It first looks like
memcpy only takes
117.53+107.32ms (in both directions), but then there is this API call
41.8256s, which is much more. Also, the min/avg/max columns don't add up between the upper and the lower output block.
Why is there a difference and what is the "true" number that is important for me to optimize the memory transfer?
EDIT: second question is: is there a way to figure out who is calling e.g.
axpy_kernel_val (and how many times)?