Menu
  • HOME
  • TAGS

Is there any way I can have a barrier within Device code that is controlled by Host?

Tag: cuda,barrier

For example, my code is something like this (but it doesn't work and the kernel stalls):

__device__ __managed__ int x;

__global__ void kernel() {

    // do something 

    while(x == 1); // a barrier 

    // do the rest 
}

int main() {
    x = 1;
    kernel<<< 1, 1 >>>();
    x = 0;

    //...
}

Is there anyway I can do this?

Best How To :

You cannot do this with the current implementation of managed memory because managed memory requires exclusive access to managed data by the device, when kernels are running. Host access to managed data during the time when kernels are running will lead to undefined behavior, typically seg fault.

This should be possible using zero-copy techniques, however, including the volatile recommendation from @Cicada.

Here's a worked example:

$ cat t736.cu
#include <stdio.h>
#include <unistd.h>

__global__ void mykernel(volatile int *idata, volatile int *odata){

  *odata = *idata;
  while (*idata == 1);
  *odata = *idata+5;
}

int main(){

  int *idata, *odata;

  cudaHostAlloc(&idata, sizeof(int), cudaHostAllocMapped);
  cudaHostAlloc(&odata, sizeof(int), cudaHostAllocMapped);

  *odata = 0;
  *idata = 1;  // set barrier
  mykernel<<<1,1>>>(idata, odata);
  sleep(1);
  printf("odata = %d\n", *odata); // expect this to be 1
  *idata = 0; // release barrier
  sleep(1);
  printf("odata = %d\n", *odata); // expect this to be 5
  cudaDeviceSynchronize(); // if kernel is hung, we will hang
  return 0;
}


$ nvcc -o t736 t736.cu
$ cuda-memcheck ./t736
========= CUDA-MEMCHECK
odata = 1
odata = 5
========= ERROR SUMMARY: 0 errors
$

The above assumes a linux 64 bit environment.

NVCC CUDA cross compiling cannot find “-lcudart”

linux,cuda,ld,nvcc

It turns out that the CUDA installer I was using from NVIDIA will not allow me to cross compile for my CARMA board, but it has to be downloaded from the manufacturer SECO.

Update a D3D9 texture from CUDA

c#,cuda,sharpdx,direct3d9,managed-cuda

As hinted by the commenter, I’ve tried creating a single instance of CudaDirectXInteropResource along with the D3D texture. It worked. It’s counter-intuitive and undocumented, but it looks like cuGraphicsUnregisterResource destroys the newly written data. At least on my machine with GeForce GTX 960, Cuda 7.0 and Windows 8.1 x64. So,...

nvcc/CUDA 6.5 & c++11(future) - gcc 4.4.7

c++11,gcc,cuda,future

C++11 support is added officially in CUDA 7.0. And you need GCC version 4.7 or later to have C++11 support. See details here: http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-compiler-new-features

Linear algebra libraries and dynamic parallelism in CUDA

cuda,gpu,gpgpu

CUBLAS library functions can be called from device code. Thrust algorithms can be called from device code. Various CURAND functions can be called from device code. Other libraries that are part of the CUDA toolkit at this time (i.e. CUDA 7) -- CUFFT, CUSPARSE, CUSOLVER -- can only be used...

How does CUDA's cudaMemcpyFromSymbol work?

cuda

I believe the details are that for each __device__ variable, cudafe creates a normal global variable as in C and also a CUDA-specific PTX variable. The global C variable is used so that the host program can refer to the variable by its address, and the PTX variable is used...

thrust exception bulk_kernel_by_value in transform_reduce

c++,c++11,cuda

The problem is that f_obj within struct FuncEvalF is a const FunctionObj<T>&. It is instantiated as a temporary on the host FunctionObj<float>(), but the reference to it is not valid anymore later on. One way to fix this is to create a copy of it instead of holding a reference...

Stream compaction with Thrust; best practices and fastest way?

c++,cuda,gpgpu,thrust,sparse-array

What you have appeared to have overlooked is that copy_if returns an iterator which points to the end of the copied data from the stream compaction operation. So all that is required is this: //copies to device thrust::device_vector<int> d_src = h_src; //Result vector thrust::device_vector<int> d_res(d_src.size()); //Copy non-zero elements from d_src...

cudaMalloc vs cudaMalloc3D performance for a 2D array

c,cuda

The performance difference you observe is mostly due to the increased instruction overhead in the pitched memory indexing scheme. Because your array size is a large power of two in the major direction, it is very likely that the pitched array allocated with cudaMalloc3D is the same size as the...

Building CUDA-aware openMPI on Ubuntu 12.04 cannot find cuda.h

ubuntu,cuda,mpi

OK, I think I fixed the problem. The conftest.c seems to be looking for cuda.h in /usr/include, instead of the supposed /usr/local/cuda/include. The problem is solved once I created a soft link of cuda.h and cuda_runtime_api.h.

Understanding Dynamic Parallelism in CUDA

multithreading,cuda

Say I launch a child grid from one thread in a block at threadIdx.x==0. Can I assume that all other threads in the parent grid have finished executing up to the point I launched the child grid as well? No. You can make no assumptions about the state of...

Do I need to free device_ptr returned by thrust?

c++,pointers,cuda,thrust

thrust::min_element returns an iterator. You should not free it directly.

OpenCL barrier of finding max in a block

parallel-processing,max,opencl,gpgpu,barrier

Why the unroll part do not need to sync thread after each step is done? The sample is incorrect, a barrier is indeed required after each step. It looks like the sample is written in warp-synchronous style, which is a way of exploiting the underlying execution mechanism of the...

Understanding Memory Replays and In-Flight Requests

caching,cuda

Effective load throughput is not the only metric that determines the performance of your kernel! A kernel with perfectly coalesced loads will always have a lower effective load throughput than the equivalent, non coalesced kernel, but that alone says nothing about its execution time: in the end, the one metric...

Threads syncronization in CUDA

c++,multithreading,cuda

You can use a simple loop, and specify the threads you want to do the work in each iteration. Something like: for (int z = 0; z < zmax; z++) { if (threadIdx.z == z) { //do whatever with x and y } __syncthreads(); } In each iteration, threads with...

How can I pass a struct to a kernel in JCuda

java,struct,cuda,jni,jcuda

(The author of JCuda here (not "JCUDA", please)) As mentioned in the forum post linked from the comment: It is not impossible to use structs in CUDA kernels and fill them from JCuda side. It is just very complicated, and rarely beneficial. For the reason of why it is rarely...

How many parallel threads i can run on my nvidia graphic card in cuda programming?

cuda

That depends on the CUDA Version used I think. Compute capability(version) V1.0 V1.2 V2.x V3.0-X.X Maximum number of resident threads per multiprocessor 768 1024 1536 2048 Amount of local memory per thread 16 KB 512 KB Maximum number of threads per block 512 1024 If found this peace of Information...

Cuda cub:Device Scan

cuda,gpu,nvcc,cub,scan

CUB is an evolving library, which means that as new features are introduced in CUDA, CUB may evolve in newer releases to take advantage of those. If you then attempt to use a newer CUB release with an older CUDA version, you may run into compatibility issues. This usage by...

cudaMemcpyToSymbol in pycuda

python,cuda,pycuda

The PyCUDA implementation directly follows the CUDA driver API, so you can use any driver API code you can find as a model, but there are two things required to make this work: Use the module function module.get_global() to retrieve the device pointer to the symbol within the compiled source...

Faster Matrix Multiplication in CUDA

c,cuda,matrix-multiplication

Firstly, be really sure this is what you want to do. Without describing the manipulations you want to do, it's hard to comment on this, but be aware that matrix multiplication is an n-cubed operation. If your manipulations are not the same complexity, chances are you'll do better simply using...

purposely causing bank conflicts for shared memory on CUDA device

cuda,gpu,shared-memory,bank-conflict

Lets walk through the ptx that it generates. //Declare some registers .reg .s32 %r<5>; .reg .s64 %rd<4>; // demoted variable .shared .align 4 .u32 _Z6kernelv$__cuda_local_var_35411_30_non_const_mywarp; //load tid in register r1 mov.u32 %r1, %tid.x; //multiple tid*5000+5 and store in r2 mad.lo.s32 %r2, %r1, 50000, 5; //store result in shared memory st.shared.u32...

cuda device function and templates

c++,templates,cuda

As you've already pointed out, this is not correct: template<typename Op> I believe it should be: template<Op op> The following code seems to work correctly for me (removing the texturing since it's extraneous to the issue): $ cat t755.cu #include <stdio.h> #include <math.h> typedef float(*Op)(float, float); __device__ float mymax(float a,...

Is prefix scan CUDA sample code in gpugems3 correct?

cuda,gpu,nvidia,prefix-sum

It seems that you've made at least 1 error in transcribing the code from the GPU Gems 3 chapter into your kernel. This line is incorrect: temp[bi] += g_idata[ai]; it should be: temp[bi] += temp[ai]; When I make that one change to the code you have now posted, it seems...

CUDA: Group every n-th point of array passed to GPU

c++,c,arrays,cuda

So, here's the solution I came up with after simplifying my case. There was a problem with memory usage - I tried to store / read different amount of data than I claimed to use when allocating it. I hope it will be helpful for anyone in the future: #include...

How to load data in global memory into shared memory SAFELY in CUDA?

c++,cuda,shared-memory

Consider one warp of the thread block finishing the first iteration and starting the next one, while other warps are still working on the first iteration. If you don't have __syncthreads at label sync2, you will end up with this warp writing to shared memory while others are reading from...

Practice computing grid size for CUDA

cuda,nvidia

This is the standard CUDA idiom for determining the minimum number of blocks in each dimension (the "grid") that completely cover the desired input. This could be expressed as ceil(nx/block.x), that is, figure out how many blocks are needed to cover the desired size, then round up. But full floating...

Interfacing cuSOLVER-sparse using PyCUDA

python,cuda,ctypes,pycuda,cusolver

Setting descrA to ctypes.c_void_p() and replacing cusparseMatDescr_t in the cusolverSpDcsrlsvqr wrapper with ctypes.c_void_p should solve the problem.

Using a data pointer with CUDA (and integrated memory)

c++,memory-management,cuda

The pointer has to be created (i.e. allocated) with cudaHostAlloc, even on integrated systems like Jetson. The reason for this is that the GPU requires (zero-copy) memory to be pinned, i.e. removed from the host demand-paging system. Ordinary allocations are subject to demand-paging, and may not be used as zero-copy...

Can an unsigned long long int be used to store the output from clock64()?

cuda

There are various atomic functions which support atomic operations on unsigned long long int (ie. a 64-bit unsigned integer), such as atomicCAS, atomicExch and atomicAdd. And if you have a cc3.5 or higher GPU you have even more options. Referring to the documentation on clock64(): long long int clock64(); when...

cuMemcpyDtoH yields CUDA_ERROR_INVALID_VALUE

java,scala,ubuntu,cuda,jcuda

Here val deviceOutput = new CUdeviceptr() cuMemAlloc(deviceOutput, SI) you are allocating SI bytes - which is 4 bytes, as the size of one int. Writing more than 4 bytes to this device pointer will mess up things. It should be cuMemAlloc(deviceOutput, SI * numElements) And similarly, I think that the...

Running CUDA programs on Quadro K620m

cuda,nvidia

This is normal, when the driver packaged in the CUDA installer is "older" than your GPU. You should retain your current GPU driver, and go ahead with the CUDA toolkit installation, but de-select the option to install the GPU driver. Your existing driver should work fine. ...

How do you build the example CUDA Thrust device sort?

c++,visual-studio-2010,sorting,cuda,thrust

As @JaredHoberock pointed out, probably the key issue is that you are trying to compile a .cpp file. You need to rename that file to .cu and also make sure it is being compiled by nvcc. After you fix that, you will probably run into another issue. This is not...

Building a tiny R package with CUDA and Rcpp

r,cuda,rcpp

Going through your package there are multiple aspects that need to be changed. You shouldn't use a 'Makefile' but a 'Makevars' file instead to improve compatibility for multiple architecture builds. Try to follow the standard variable names (e.g. CPPC should be CXX), this makes everything play together much better. Don't...

Problems with floating-point additions. Ignoring some small values

math,cuda,floating-point

The example implicitly declares a binary floating point format, which has an arbitrary precision exponent, but only 2 bits in the mantissa. All numbers are of format 1.xx * 2^n. When one performs floating point addition, one must de-normalize or scale the arguments to have the same exponent. 0.25 =...

Tesla k20m interoperability with Direct3D 11

cuda,direct3d,tesla

No, this won't be possible. K20m can be used (with some effort) with OpenGL graphics on Linux, but at least up through windows 8.x, you won't be able to use K20m as a D3D device in Windows. The K20m does not publish a VGA classcode in PCI configuration space, which...

'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D

c,cuda

The reason the error doesn't occur on this line: REAL tmp = unew_row[j]; // no error on this line is because the compiler is optimizing that line out. It doesn't do anything useful, and so the compiler completely eliminates it. The compiler warning: xxx.cu(87): warning: variable "tmp" was declared but...

direct global memory access using cuda

c++,cuda

q1- lets say i have copy one array onto device through stream1 using cudaMemCpyAsync; would i be able to access the values of that array in different stream say 2? Yes, the array da is accessible in both kernels you have shown. However, an important question is whether or...

Best way to achieve CUDA Vector Diagonalization

matrix,cuda

I created a simple example based on thrust. It uses column-major order to store the matrices in a thrust::device_vector. It should scale well with larger row/column counts. Another approach could be based off the thrust strided_range example. This example does what you want (fill the diagonals based on the input...

Why use memset when using CUDA?

c,cuda,nvidia

This is the answer given by njuffa in the comments: ...The content of GPU memory doesn't change between invocations of the application. In case of a program failure, we would want to avoid picking up good data from a previous run, which may lead (erroneously) to a belief that the...

CUDA strange behavior accessing vector

c++,cuda

I'm pretty sure this is not correct: fftKernel<<<BLOCK_SIZE, gridSize>>>(...); The first kernel configuration parameter should be the grid dimensions. The second should be the block dimensions. I was able to reproduce the error you indicate in the code you have posted. When I reverse those two parameters: fftKernel<<<gridSize, BLOCK_SIZE>>>(...); the...

How to pass struct containing array to the kernel in CUDA?

c,arrays,cuda

We have seen three different codes in this question in the first 24 hours of its existence. This answer addresses the final evolution. The underlying problem you are having is with this type of operation: cudaMalloc(&p_gpu, sizeof(Pass)); cudaMalloc(&p_gpu -> pass, 5 * sizeof(int)); The second cudaMalloc is illegal. This is...

Access violation reading location when calling cudaMemcpy2DToArray

c++,arrays,opencv,cuda

ImgSrc_f does not point to a contiguous 512x512 chunk of memory. Try changing float *ImgSrc_f[512]; for (int i=0; i<512; i++) ImgSrc_f[i] = (float *)malloc(512 * sizeof(float)); for(int i=0;i<512;i++) for(int j=0;j<512;j++) { ImgSrc_f[i][j]=ImgSrc.at<float>(i,j); } to something like float *ImgSrc_f; ImgSrc_f = (float *)malloc(512 * 512 * sizeof(float)); for(int i=0;i<512;i++) for(int j=0;j<512;j++)...

cuda-memcheck fails to detect memory leak in an R package

r,memory-leaks,cuda,valgrind

This is not valid CUDA code: extern "C" void someCUDAcode() { int a; CUDA_CALL(cudaMalloc((void**) &a, sizeof(int))); mykernel<<<1, 1>>>(1); // CUDA_CALL(cudaFree(&a)); } When we want to do a cudaMalloc operation, we use pointers in C, not ordinary variables, like this: int *a; CUDA_CALL(cudaMalloc((void**) &a, sizeof(int))); When we want to free a...

Why does Hyper-Q selectively overlap async HtoD and DtoH transfer on my cc5.2 hardware?

cuda

What you are observing is probably an artifact of running the code on a Windows WDDM platform. The WDDM subsystem has a lot of latency which other platforms are not hampered by, so to improve overall performance, the CUDA WDDM driver performs command batching. This can interfere with the expect...

Amount of cores per SM and threads per block in CUDA

cuda

Is that better to create grids with blocks, containing 128 threads each? Will such code run faster? Optimal block size depends on the problem. It's a idea for your block size to be a multiple of the warp size. Other factors are occupancy considerations, and shared memory usage. Does...

Reduce by key on device array

cuda,parallel-processing,thrust

Thrust interprets ordinary pointers as pointing to data on the host: thrust::reduce_by_key(d_list, d_list+n, d_ones, C, D,cmp); Therefore thrust will call the host path for the above algorithm, and it will seg fault when it attempts to dereference those pointers in host code. This is covered in the thrust getting started...

cuda thrust: selective copying and resizing results

cuda,thrust

Thrust doesn't resize vectors as part of any algorithm call. The size of vectors going into a thrust algorithm will be exactly the size of those vectors coming out of the algorithm. shrink_to_fit also has no impact on a vector's size, but it may impact the capacity, which has to...

how to generalize square matrix multiplication to handle arbitrary dimensions

c,cuda,parallel-processing,matrix-multiplication

This code will work for very specific dimensions but not for others. It will work for square matrix multiplication when width is exactly equal to the product of your block dimension (number of threads - 20 in the code you have shown) and your grid dimension (number of blocks -...

What is version of cuda for nvidia 304.125

ubuntu,cuda,ubuntu-14.04,nvidia

304.xx is a driver that will support CUDA 5 and previous (does not support newer CUDA versions.) If you want to reinstall ubuntu to create a clean setup, the linux getting started guide has all the instructions needed to set up CUDA 7 if that is your intent. I believe...

CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments

cuda,gpgpu,gpu-programming,cublas

The only actual problem in your code is here: cudaMalloc( &d_x,sizeof(d_x) ); sizeof(d_x) is just the size of a pointer. You can fix it like this: cudaMalloc( &d_x,sizeof(x) ); If you want to find out if a CUBLAS API call is failing, then you should check the return code of...

pthread_barrier_wait hangs after creation of all threads

multithreading,pthreads,barrier

The barrier expects to be waited on NUM_THREADS times, but only one thread, the main thread, actually calls pthread_barrier_wait. If you want to synchronize main with your worker threads, you'll need to initialize the barrier for NUM_WORKER_THREADS + 1....