Menu
  • HOME
  • TAGS

Cuda grid size limitations

Tag: cuda

Are there limitations as to what I can set the grid size of a CUDA kernel to be? I ran into a problem where kernels were not launching with a grid size of 33 x 33 but were able to launch when the grid size was 32 x 32. Is there any reason for this to occur? Or is it likely that changing the number of blocks from 32 x 32 to 33 x 33 broke some other constraint?

dim3 blockSize(8, 8);
dim3 gridSize(32, 32);

cudaDeviceSynchronize();
set_start<<<gridSize, blockSize>>>(some_params);

The above works.

dim3 blockSize(8, 8);
dim3 gridSize(33, 33);

cudaDeviceSynchronize();
set_start<<<gridSize, blockSize>>>(some_params);

The above does not work.

kernel & main:

__global__
void set_start(double * const H , double * const HU , double * const HV , 
           double * const E , const int Na)
{
int j = threadIdx.x + blockIdx.x*blockDim.x + 1;
int i = threadIdx.y + blockIdx.y*blockDim.y + 1;

if(i >= Na-1 || j >= Na-1)
    return;

H[i*Na+j]  = 1.0 + exp(-100.0*((E[j-1]-0.75)*(E[j-1]-0.75)+(E[i-1]-0.75)*(E[i-1]-0.75))) + 0.5*exp(-100.0*((E[j-1]-0.75)*(E[j-1]-0.75)+(E[i-1]-0.25)*(E[i-1]-0.25)));
HU[i*Na+j] = 0; 
HV[i*Na+j] = 0;
}

int main(int argc, char** argv){

double* E_d;
cudaMalloc(&E_d, sizeof(double) * (Nh+1));
set_E<<<64, (Nh/64) + 1>>>(E_d, dx, Nh);

int Na = 259;
double *H_d, *HU_d, *HV_d, *Ht_d, *HUt_d, *HVt_d;

cudaMalloc(&H_d , sizeof(double) * Na * Na);
cudaMalloc(&HU_d, sizeof(double) * Na * Na);
cudaMalloc(&HV_d, sizeof(double) * Na * Na);

dim3 blockSize(8, 8);
//dim3 gridSize(((Na-1)/blockSize.x) + 1, ((Na-1)/blockSize.y) + 1);
//dim3 gridSize(33, 33);
dim3 gridSize(32, 32);

cudaDeviceSynchronize();
set_start<<<blockSize, gridSize>>>(H_d, HU_d, HV_d, E_d, Na);
}

This was on CUDA 7.0.

Best How To :

You have block size and grid size mixed up when calling the kernel.

set_start<<<blockSize, gridSize>>>(H_d, HU_d, HV_d, E_d, Na);

should read:

set_start<<<gridSize, blockSize>>>(H_d, HU_d, HV_d, E_d, Na);

Because of this bug you are actually trying to launch a grid of size blockSize, and blocks of size gridSize. It would appear that the maximum size of a block on your GPU is 1024 threads, so launching blocks of 33x33 fails.

Faster Matrix Multiplication in CUDA

c,cuda,matrix-multiplication

Firstly, be really sure this is what you want to do. Without describing the manipulations you want to do, it's hard to comment on this, but be aware that matrix multiplication is an n-cubed operation. If your manipulations are not the same complexity, chances are you'll do better simply using...

CUDA: Group every n-th point of array passed to GPU

c++,c,arrays,cuda

So, here's the solution I came up with after simplifying my case. There was a problem with memory usage - I tried to store / read different amount of data than I claimed to use when allocating it. I hope it will be helpful for anyone in the future: #include...

Building a tiny R package with CUDA and Rcpp

r,cuda,rcpp

Going through your package there are multiple aspects that need to be changed. You shouldn't use a 'Makefile' but a 'Makevars' file instead to improve compatibility for multiple architecture builds. Try to follow the standard variable names (e.g. CPPC should be CXX), this makes everything play together much better. Don't...

purposely causing bank conflicts for shared memory on CUDA device

cuda,gpu,shared-memory,bank-conflict

Lets walk through the ptx that it generates. //Declare some registers .reg .s32 %r<5>; .reg .s64 %rd<4>; // demoted variable .shared .align 4 .u32 _Z6kernelv$__cuda_local_var_35411_30_non_const_mywarp; //load tid in register r1 mov.u32 %r1, %tid.x; //multiple tid*5000+5 and store in r2 mad.lo.s32 %r2, %r1, 50000, 5; //store result in shared memory st.shared.u32...

cudaMalloc vs cudaMalloc3D performance for a 2D array

c,cuda

The performance difference you observe is mostly due to the increased instruction overhead in the pitched memory indexing scheme. Because your array size is a large power of two in the major direction, it is very likely that the pitched array allocated with cudaMalloc3D is the same size as the...

cuda-memcheck fails to detect memory leak in an R package

r,memory-leaks,cuda,valgrind

This is not valid CUDA code: extern "C" void someCUDAcode() { int a; CUDA_CALL(cudaMalloc((void**) &a, sizeof(int))); mykernel<<<1, 1>>>(1); // CUDA_CALL(cudaFree(&a)); } When we want to do a cudaMalloc operation, we use pointers in C, not ordinary variables, like this: int *a; CUDA_CALL(cudaMalloc((void**) &a, sizeof(int))); When we want to free a...

Building CUDA-aware openMPI on Ubuntu 12.04 cannot find cuda.h

ubuntu,cuda,mpi

OK, I think I fixed the problem. The conftest.c seems to be looking for cuda.h in /usr/include, instead of the supposed /usr/local/cuda/include. The problem is solved once I created a soft link of cuda.h and cuda_runtime_api.h.

cuMemcpyDtoH yields CUDA_ERROR_INVALID_VALUE

java,scala,ubuntu,cuda,jcuda

Here val deviceOutput = new CUdeviceptr() cuMemAlloc(deviceOutput, SI) you are allocating SI bytes - which is 4 bytes, as the size of one int. Writing more than 4 bytes to this device pointer will mess up things. It should be cuMemAlloc(deviceOutput, SI * numElements) And similarly, I think that the...

thrust exception bulk_kernel_by_value in transform_reduce

c++,c++11,cuda

The problem is that f_obj within struct FuncEvalF is a const FunctionObj<T>&. It is instantiated as a temporary on the host FunctionObj<float>(), but the reference to it is not valid anymore later on. One way to fix this is to create a copy of it instead of holding a reference...

Access violation reading location when calling cudaMemcpy2DToArray

c++,arrays,opencv,cuda

ImgSrc_f does not point to a contiguous 512x512 chunk of memory. Try changing float *ImgSrc_f[512]; for (int i=0; i<512; i++) ImgSrc_f[i] = (float *)malloc(512 * sizeof(float)); for(int i=0;i<512;i++) for(int j=0;j<512;j++) { ImgSrc_f[i][j]=ImgSrc.at<float>(i,j); } to something like float *ImgSrc_f; ImgSrc_f = (float *)malloc(512 * 512 * sizeof(float)); for(int i=0;i<512;i++) for(int j=0;j<512;j++)...

How do you build the example CUDA Thrust device sort?

c++,visual-studio-2010,sorting,cuda,thrust

As @JaredHoberock pointed out, probably the key issue is that you are trying to compile a .cpp file. You need to rename that file to .cu and also make sure it is being compiled by nvcc. After you fix that, you will probably run into another issue. This is not...

Problems with floating-point additions. Ignoring some small values

math,cuda,floating-point

The example implicitly declares a binary floating point format, which has an arbitrary precision exponent, but only 2 bits in the mantissa. All numbers are of format 1.xx * 2^n. When one performs floating point addition, one must de-normalize or scale the arguments to have the same exponent. 0.25 =...

cuda thrust: selective copying and resizing results

cuda,thrust

Thrust doesn't resize vectors as part of any algorithm call. The size of vectors going into a thrust algorithm will be exactly the size of those vectors coming out of the algorithm. shrink_to_fit also has no impact on a vector's size, but it may impact the capacity, which has to...

Using a data pointer with CUDA (and integrated memory)

c++,memory-management,cuda

The pointer has to be created (i.e. allocated) with cudaHostAlloc, even on integrated systems like Jetson. The reason for this is that the GPU requires (zero-copy) memory to be pinned, i.e. removed from the host demand-paging system. Ordinary allocations are subject to demand-paging, and may not be used as zero-copy...

MVAPICH on multi-GPU causes Segmentation fault

cuda,mvapich2

I'm afraid MVAPICH does not support yet using multiple GPUs in the same process (source: mailing list). Advanced memory transfer operations require storing device-specific structures, so unless there is explicit support for multiple devices, I'm afraid there is no way to make your code run. On the other side, you...

How does CUDA's cudaMemcpyFromSymbol work?

cuda

I believe the details are that for each __device__ variable, cudafe creates a normal global variable as in C and also a CUDA-specific PTX variable. The global C variable is used so that the host program can refer to the variable by its address, and the PTX variable is used...

What is version of cuda for nvidia 304.125

ubuntu,cuda,ubuntu-14.04,nvidia

304.xx is a driver that will support CUDA 5 and previous (does not support newer CUDA versions.) If you want to reinstall ubuntu to create a clean setup, the linux getting started guide has all the instructions needed to set up CUDA 7 if that is your intent. I believe...

Running CUDA programs on Quadro K620m

cuda,nvidia

This is normal, when the driver packaged in the CUDA installer is "older" than your GPU. You should retain your current GPU driver, and go ahead with the CUDA toolkit installation, but de-select the option to install the GPU driver. Your existing driver should work fine. ...

Amount of cores per SM and threads per block in CUDA

cuda

Is that better to create grids with blocks, containing 128 threads each? Will such code run faster? Optimal block size depends on the problem. It's a idea for your block size to be a multiple of the warp size. Other factors are occupancy considerations, and shared memory usage. Does...

direct global memory access using cuda

c++,cuda

q1- lets say i have copy one array onto device through stream1 using cudaMemCpyAsync; would i be able to access the values of that array in different stream say 2? Yes, the array da is accessible in both kernels you have shown. However, an important question is whether or...

Update a D3D9 texture from CUDA

c#,cuda,sharpdx,direct3d9,managed-cuda

As hinted by the commenter, I’ve tried creating a single instance of CudaDirectXInteropResource along with the D3D texture. It worked. It’s counter-intuitive and undocumented, but it looks like cuGraphicsUnregisterResource destroys the newly written data. At least on my machine with GeForce GTX 960, Cuda 7.0 and Windows 8.1 x64. So,...

Why does Hyper-Q selectively overlap async HtoD and DtoH transfer on my cc5.2 hardware?

cuda

What you are observing is probably an artifact of running the code on a Windows WDDM platform. The WDDM subsystem has a lot of latency which other platforms are not hampered by, so to improve overall performance, the CUDA WDDM driver performs command batching. This can interfere with the expect...

Reduce by key on device array

cuda,parallel-processing,thrust

Thrust interprets ordinary pointers as pointing to data on the host: thrust::reduce_by_key(d_list, d_list+n, d_ones, C, D,cmp); Therefore thrust will call the host path for the above algorithm, and it will seg fault when it attempts to dereference those pointers in host code. This is covered in the thrust getting started...

'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D

c,cuda

The reason the error doesn't occur on this line: REAL tmp = unew_row[j]; // no error on this line is because the compiler is optimizing that line out. It doesn't do anything useful, and so the compiler completely eliminates it. The compiler warning: xxx.cu(87): warning: variable "tmp" was declared but...

Why use memset when using CUDA?

c,cuda,nvidia

This is the answer given by njuffa in the comments: ...The content of GPU memory doesn't change between invocations of the application. In case of a program failure, we would want to avoid picking up good data from a previous run, which may lead (erroneously) to a belief that the...

How to pass struct containing array to the kernel in CUDA?

c,arrays,cuda

We have seen three different codes in this question in the first 24 hours of its existence. This answer addresses the final evolution. The underlying problem you are having is with this type of operation: cudaMalloc(&p_gpu, sizeof(Pass)); cudaMalloc(&p_gpu -> pass, 5 * sizeof(int)); The second cudaMalloc is illegal. This is...

CUDA strange behavior accessing vector

c++,cuda

I'm pretty sure this is not correct: fftKernel<<<BLOCK_SIZE, gridSize>>>(...); The first kernel configuration parameter should be the grid dimensions. The second should be the block dimensions. I was able to reproduce the error you indicate in the code you have posted. When I reverse those two parameters: fftKernel<<<gridSize, BLOCK_SIZE>>>(...); the...

Stream compaction with Thrust; best practices and fastest way?

c++,cuda,gpgpu,thrust,sparse-array

What you have appeared to have overlooked is that copy_if returns an iterator which points to the end of the copied data from the stream compaction operation. So all that is required is this: //copies to device thrust::device_vector<int> d_src = h_src; //Result vector thrust::device_vector<int> d_res(d_src.size()); //Copy non-zero elements from d_src...

Understanding Dynamic Parallelism in CUDA

multithreading,cuda

Say I launch a child grid from one thread in a block at threadIdx.x==0. Can I assume that all other threads in the parent grid have finished executing up to the point I launched the child grid as well? No. You can make no assumptions about the state of...

Threads syncronization in CUDA

c++,multithreading,cuda

You can use a simple loop, and specify the threads you want to do the work in each iteration. Something like: for (int z = 0; z < zmax; z++) { if (threadIdx.z == z) { //do whatever with x and y } __syncthreads(); } In each iteration, threads with...

How can I pass a struct to a kernel in JCuda

java,struct,cuda,jni,jcuda

(The author of JCuda here (not "JCUDA", please)) As mentioned in the forum post linked from the comment: It is not impossible to use structs in CUDA kernels and fill them from JCuda side. It is just very complicated, and rarely beneficial. For the reason of why it is rarely...

cuda device function and templates

c++,templates,cuda

As you've already pointed out, this is not correct: template<typename Op> I believe it should be: template<Op op> The following code seems to work correctly for me (removing the texturing since it's extraneous to the issue): $ cat t755.cu #include <stdio.h> #include <math.h> typedef float(*Op)(float, float); __device__ float mymax(float a,...

Interfacing cuSOLVER-sparse using PyCUDA

python,cuda,ctypes,pycuda,cusolver

Setting descrA to ctypes.c_void_p() and replacing cusparseMatDescr_t in the cusolverSpDcsrlsvqr wrapper with ctypes.c_void_p should solve the problem.

how to generalize square matrix multiplication to handle arbitrary dimensions

c,cuda,parallel-processing,matrix-multiplication

This code will work for very specific dimensions but not for others. It will work for square matrix multiplication when width is exactly equal to the product of your block dimension (number of threads - 20 in the code you have shown) and your grid dimension (number of blocks -...

Is prefix scan CUDA sample code in gpugems3 correct?

cuda,gpu,nvidia,prefix-sum

It seems that you've made at least 1 error in transcribing the code from the GPU Gems 3 chapter into your kernel. This line is incorrect: temp[bi] += g_idata[ai]; it should be: temp[bi] += temp[ai]; When I make that one change to the code you have now posted, it seems...

How many parallel threads i can run on my nvidia graphic card in cuda programming?

cuda

That depends on the CUDA Version used I think. Compute capability(version) V1.0 V1.2 V2.x V3.0-X.X Maximum number of resident threads per multiprocessor 768 1024 1536 2048 Amount of local memory per thread 16 KB 512 KB Maximum number of threads per block 512 1024 If found this peace of Information...

cudaMemcpyToSymbol in pycuda

python,cuda,pycuda

The PyCUDA implementation directly follows the CUDA driver API, so you can use any driver API code you can find as a model, but there are two things required to make this work: Use the module function module.get_global() to retrieve the device pointer to the symbol within the compiled source...

Practice computing grid size for CUDA

cuda,nvidia

This is the standard CUDA idiom for determining the minimum number of blocks in each dimension (the "grid") that completely cover the desired input. This could be expressed as ceil(nx/block.x), that is, figure out how many blocks are needed to cover the desired size, then round up. But full floating...

NVCC CUDA cross compiling cannot find “-lcudart”

linux,cuda,ld,nvcc

It turns out that the CUDA installer I was using from NVIDIA will not allow me to cross compile for my CARMA board, but it has to be downloaded from the manufacturer SECO.

Can an unsigned long long int be used to store the output from clock64()?

cuda

There are various atomic functions which support atomic operations on unsigned long long int (ie. a 64-bit unsigned integer), such as atomicCAS, atomicExch and atomicAdd. And if you have a cc3.5 or higher GPU you have even more options. Referring to the documentation on clock64(): long long int clock64(); when...

Do I need to free device_ptr returned by thrust?

c++,pointers,cuda,thrust

thrust::min_element returns an iterator. You should not free it directly.

Why does cuSOLVER cusolverSpDcsrlsvchol not work?

c++,cuda,linear-algebra,solver,cusolver

1.cusolverSpScsrlsvchol returns wrong results for x: 1 3.33333 2.33333 1 3.33333 2.33333 1.33333 1 2.33333 1.33333 0.666667 1 1 1 1 1 You said: A is positive definite and symmetric. No, it is not. It is not symmetric. cusolverSpcsrlsvqr() has no requirement that the A matrix be symmetric. cusolverSpcsrlsvchol()...

Linear algebra libraries and dynamic parallelism in CUDA

cuda,gpu,gpgpu

CUBLAS library functions can be called from device code. Thrust algorithms can be called from device code. Various CURAND functions can be called from device code. Other libraries that are part of the CUDA toolkit at this time (i.e. CUDA 7) -- CUFFT, CUSPARSE, CUSOLVER -- can only be used...

How to load data in global memory into shared memory SAFELY in CUDA?

c++,cuda,shared-memory

Consider one warp of the thread block finishing the first iteration and starting the next one, while other warps are still working on the first iteration. If you don't have __syncthreads at label sync2, you will end up with this warp writing to shared memory while others are reading from...

nvcc/CUDA 6.5 & c++11(future) - gcc 4.4.7

c++11,gcc,cuda,future

C++11 support is added officially in CUDA 7.0. And you need GCC version 4.7 or later to have C++11 support. See details here: http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-compiler-new-features

Understanding Memory Replays and In-Flight Requests

caching,cuda

Effective load throughput is not the only metric that determines the performance of your kernel! A kernel with perfectly coalesced loads will always have a lower effective load throughput than the equivalent, non coalesced kernel, but that alone says nothing about its execution time: in the end, the one metric...

Cuda cub:Device Scan

cuda,gpu,nvcc,cub,scan

CUB is an evolving library, which means that as new features are introduced in CUDA, CUB may evolve in newer releases to take advantage of those. If you then attempt to use a newer CUB release with an older CUDA version, you may run into compatibility issues. This usage by...

CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments

cuda,gpgpu,gpu-programming,cublas

The only actual problem in your code is here: cudaMalloc( &d_x,sizeof(d_x) ); sizeof(d_x) is just the size of a pointer. You can fix it like this: cudaMalloc( &d_x,sizeof(x) ); If you want to find out if a CUBLAS API call is failing, then you should check the return code of...

Best way to achieve CUDA Vector Diagonalization

matrix,cuda

I created a simple example based on thrust. It uses column-major order to store the matrices in a thrust::device_vector. It should scale well with larger row/column counts. Another approach could be based off the thrust strided_range example. This example does what you want (fill the diagonals based on the input...

Tesla k20m interoperability with Direct3D 11

cuda,direct3d,tesla

No, this won't be possible. K20m can be used (with some effort) with OpenGL graphics on Linux, but at least up through windows 8.x, you won't be able to use K20m as a D3D device in Windows. The K20m does not publish a VGA classcode in PCI configuration space, which...