PCI Express has full speed in both directions. There should be no "deadlock" like you may experience in a synchronous MPI communication that needs handshaking before proceeding. As Robert mentioned in a comment "accessing data over PCIE bus is a lot slower than accessing it from on-board memory". However, it...
cuda,gpu,gpu-programming,multi-gpu
So is it possible to choose the fisrt GPU from the first host thread and the second gpu from the second host thread by cudaSetDevice() call? Yes, it's possible. An example is given in the cudaOpenMP sample code for this type of usage, (excerpting): .... omp_set_num_threads(num_gpus); // create as...
You have to install the drivers for your integrated onboard gpu. This can be done by booting up while using iGPU from bios settings, and your pc shall be able to load the drivers it needs on its own. For my Ivy bridge, the bios settings are these: Go...
cuda,gpgpu,nvidia,gpu-programming,multi-gpu
No. The GPU atomics are only atomic across the GPU performing the operation. They do not work on host memory or nonlocal device memory. I'm sure it is a roadmap item for NVIDIA to address these limitations on future platforms, esp. with NVLink....
http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__DEVICE_g418c299b069c4803bfb7cab4943da383.html cudaError_t cudaSetDevice ( int device ) Sets device as the current device for the calling host thread. Any device memory subsequently allocated from this host thread using cudaMalloc(), cudaMallocPitch() or cudaMallocArray() will be physically resident on device. Any host memory allocated from this host thread using cudaMallocHost() or cudaHostAlloc()...