I'm performing a simple test which is comparing the access latency of data allocated with malloc() and data allocated with cudaHostAlloc() from the host (the cpu is performing the accesses). I noticed that accessing data allocated with cudaHostAlloc() is much slower than accessing data allocated with malloc() on the Jetson Tk1.
This is not the case for discrete GPUs and seems to only be applicable to TK1. After some investigations, I found that data allocated with cudaHostAlloc() is memory mapped (mmap) into /dev/nvmap areas of the process address space. This is not the case for normal malloc'd data which is mapped on the process heap. I understand that this mapping might be necessary to allow the GPU to access the data since cudaHostAlloc'd data has to be visible from both the host and device.
My question is the following: Where does the overhead of accessing cudaHostAlloc'd data from the host come from? Is data mapped to /dev/nvmap uncached on the CPU caches?
Best How To :
I believe I have found the reason for this behaviour. After further investigations (using Linux trace events and looking at the nvmap driver code) I found that the source of the overhead comes from the fact that data allocated with
cudaHostAlloc() are marked "uncacheable" using the
NVMAP_HANDLE_UNCACHEABLE flag. A call to
pgprot_noncached() is made to insure that the relevant PTEs are marked uncacheable.
The behaviour of host accesses to data allocated using
cudaMallocManaged() is different. Data will be cached (using the flag
NVMAP_HANDLE_CACHEABLE). Therefore accesses to this data from the host is equivalent to
malloc()'d data. It is also important to note that the CUDA runtime does not allow device (GPU) accesses to any data allocated with
cudaMallocManaged() concurrently with the host, and such an action would generate a segfault. The runtime, however, allows concurrent accesses to
cudaHostAlloc()'d data on both the device and host, and I believe this is one of the reasons for making
cudaHostAlloc()'d data uncacheable.