I have got problem with shuffle instruction in CUDA 5.0.
This is snippet of my kernel. It is inside the loop. Print is there only for debug purpose because I can't use ordinary debugger:
...
tex_val = tex2D(srcTexRef, threadIdx.x + w, y_pos);
if (threadIdx.x == 0)
{
left = left_value[y_pos];
}
else
{
printf("thread %d; shfl value: %f \n", threadIdx.x, __shfl_up(value, 1));
left = __shfl_up(value, 1);
}
printf("thread %d; value: %f; tex_val: %f; left: %f \n", threadIdx.x, value, tex_val, left);
...
From that I get this output:
l0: ITERATION 1
l1: thread 0; value: 0; tex_val: 1; left: 4
l2:
l3: ITERATION 2
l4: thread 1; shfl value: 0
l5: thread 0; value: 5; tex_val: 1; left: 5
l6: thread 1; value: 0; tex_val: 1; left: 0
l7:
l8: ITERATION 3
l9: thread 1; shfl value: 0
l10: thread 2; shfl value: 1
l11: thread 0; value: 6; tex_val: 1; left: 6
l12: thread 1; value: 1; tex_val: 1; left: 0
l13: thread 2; value: 2; tex_val: 1; left: 1
...
From the output I can see that thread 1 doesn't get value from thread 0 in any iteration even though I can clearly see that it has value (line 4 - shfl value is 0; line 5 - value is 5). Thread 2 and higher can get value from lower thread. Where am I making mistake? Is it happening because of the branching?