Menu
  • HOME
  • TAGS

How to load data in global memory into shared memory SAFELY in CUDA?

c++,cuda,shared-memory

Consider one warp of the thread block finishing the first iteration and starting the next one, while other warps are still working on the first iteration. If you don't have __syncthreads at label sync2, you will end up with this warp writing to shared memory while others are reading from...

Can an unsigned long long int be used to store the output from clock64()?

cuda

There are various atomic functions which support atomic operations on unsigned long long int (ie. a 64-bit unsigned integer), such as atomicCAS, atomicExch and atomicAdd. And if you have a cc3.5 or higher GPU you have even more options. Referring to the documentation on clock64(): long long int clock64(); when...

batch CUDA solution of sparse banded Ax=b for various b's

c++,cuda,sparse-matrix,matrix-factorization,cusolver

I'm currently working on something similar myself. I decided to basically wrap the conjugate gradient and level-0 incomplete cholesky preconditioned conjugate gradient solvers utility samples that came with the CUDA SDK into a small class. You can find them in your CUDA_HOME directory under the path: samples/7_CUDALibraries/conjugateGradient and /Developer/NVIDIA/CUDA-samples/7_CUDALibraries/conjugateGradientPrecond Basically,...

MVAPICH on multi-GPU causes Segmentation fault

cuda,mvapich2

I'm afraid MVAPICH does not support yet using multiple GPUs in the same process (source: mailing list). Advanced memory transfer operations require storing device-specific structures, so unless there is explicit support for multiple devices, I'm afraid there is no way to make your code run. On the other side, you...

Reduce by key on device array

cuda,parallel-processing,thrust

Thrust interprets ordinary pointers as pointing to data on the host: thrust::reduce_by_key(d_list, d_list+n, d_ones, C, D,cmp); Therefore thrust will call the host path for the above algorithm, and it will seg fault when it attempts to dereference those pointers in host code. This is covered in the thrust getting started...

Cuda cub:Device Scan

cuda,gpu,nvcc,cub,scan

CUB is an evolving library, which means that as new features are introduced in CUDA, CUB may evolve in newer releases to take advantage of those. If you then attempt to use a newer CUB release with an older CUDA version, you may run into compatibility issues. This usage by...

Why my cuda program became slower after using 128 threads on blocks?

c++,cuda,tesla

After providing us with the exact time differences in one of the comments, i.e.: 280ms for up to 128 threads, 386ms for 129+ threads, I think it indirectly supports my theory of issue being related to warp scheduling. Look at the GK210 whitepaper, which is a chip used in K80:...

using dictionary in pycuda

python,cuda,pycuda

None of what you are proposing is possible. In PyCUDA, you cannot Pass a dictionary to a kernel Pass a list to a kernel Directly translate a dictionary to C++ structure in device code Directly translate a list to a C++ linear array in device PyCUDA can use Python classes...

nvcc/CUDA 6.5 & c++11(future) - gcc 4.4.7

c++11,gcc,cuda,future

C++11 support is added officially in CUDA 7.0. And you need GCC version 4.7 or later to have C++11 support. See details here: http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-compiler-new-features

CudaMemCpy returns cudaErrorInvalidValue on copying vector

c++,opencv,cuda

Yes, you can copy from std::vector using cudaMemcpy. You don't have your sizes set up correctly: void computeDepthChangeMap(unsigned char* depthChangeMap, size_t size, std::vector<cv::Point3f>* input, float dcf, int width, int height) { ... cudaStatus = cudaMalloc((void**)&dev_input, size); ^^^^ cudaStatus = cudaMemcpy(dev_input, &input[0], sizeof(cv::Point3f)*size, cudaMemcpyHostToDevice); ^^^^^^^^^^^^^^^^^ These size parameters should all be...

Why does Hyper-Q selectively overlap async HtoD and DtoH transfer on my cc5.2 hardware?

cuda

What you are observing is probably an artifact of running the code on a Windows WDDM platform. The WDDM subsystem has a lot of latency which other platforms are not hampered by, so to improve overall performance, the CUDA WDDM driver performs command batching. This can interfere with the expect...

Building a tiny R package with CUDA and Rcpp

r,cuda,rcpp

Going through your package there are multiple aspects that need to be changed. You shouldn't use a 'Makefile' but a 'Makevars' file instead to improve compatibility for multiple architecture builds. Try to follow the standard variable names (e.g. CPPC should be CXX), this makes everything play together much better. Don't...

Running CUDA programs on Quadro K620m

cuda,nvidia

This is normal, when the driver packaged in the CUDA installer is "older" than your GPU. You should retain your current GPU driver, and go ahead with the CUDA toolkit installation, but de-select the option to install the GPU driver. Your existing driver should work fine. ...

Shuffle instruction in CUDA not working

c++,multithreading,cuda,shuffle

Yes, it's because of the branching. Quoting from the CUDA programming guide B.14.2: The __shfl() intrinsics permit exchanging of a variable between threads within a warp without use of shared memory. The exchange occurs simultaneously for all active threads within the warp, ... and Threads may only read data from...

What's wrong with this simple cuda program?

cuda

(Posting my comment as an answer) Please check all your API calls for errors. Most likely is that your block size of width*height threads is too large for a single block....

Linear algebra libraries and dynamic parallelism in CUDA

cuda,gpu,gpgpu

CUBLAS library functions can be called from device code. Thrust algorithms can be called from device code. Various CURAND functions can be called from device code. Other libraries that are part of the CUDA toolkit at this time (i.e. CUDA 7) -- CUFFT, CUSPARSE, CUSOLVER -- can only be used...

cuda calc distance of two points

cuda

There are at least 4 ideas, some of which have already been stated in the comments: Transform your point distance storage from AoS format: struct DataPoint { float pfDimens[3]; }; to SoA format: struct DataPoint { float pfDimens_x[NPTS]; float pfDimens_y[NPTS]; float pfDimens_z[NPTS]; }; this will enable full coalescing on loading...

Update a D3D9 texture from CUDA

c#,cuda,sharpdx,direct3d9,managed-cuda

As hinted by the commenter, I’ve tried creating a single instance of CudaDirectXInteropResource along with the D3D texture. It worked. It’s counter-intuitive and undocumented, but it looks like cuGraphicsUnregisterResource destroys the newly written data. At least on my machine with GeForce GTX 960, Cuda 7.0 and Windows 8.1 x64. So,...

CUDA, “illegal memory access was encountered” in Memcpy

c++,cuda

There are two things at issue here. The first is this: __constant__ Sphere s[SPHERES_COUNT]; int main () { ...... kernel<<<grids, threads>>>(dev_bitmap, s); ...... In host code, s is a host memory variable which provides a handle for the CUDA runtime to hook up with the device constant memory symbol. It...

Losing Unified Memory support in CUDA 7.0 under Windows 10

cuda

Windows 10 is not a supported platform for CUDA at this time. Switch to a supported platform, and your Unified Memory operations should begin working again.

Compiling and Linking pure C and CUDA code [warning: implicit declaration of function]

c,compilation,cuda,gcc-warning,nvcc

this link: https://devtalk.nvidia.com/default/topic/388072/calling-cuda-functions-from-a-c-file/ answers your question:,. basically: in the .c file #include <stdio.h> #include <stdlib.h> #include <string.h> #include <cuda.h> extern void kernel_wrapper(int *a, int *b); int main(int argc, char *argv[]) { int a = 2; int b = 3; kernel_wrapper(&a, &b); return 0; } and in the .cu file; __global__...

Threads syncronization in CUDA

c++,multithreading,cuda

You can use a simple loop, and specify the threads you want to do the work in each iteration. Something like: for (int z = 0; z < zmax; z++) { if (threadIdx.z == z) { //do whatever with x and y } __syncthreads(); } In each iteration, threads with...

Tesla k20m interoperability with Direct3D 11

cuda,direct3d,tesla

No, this won't be possible. K20m can be used (with some effort) with OpenGL graphics on Linux, but at least up through windows 8.x, you won't be able to use K20m as a D3D device in Windows. The K20m does not publish a VGA classcode in PCI configuration space, which...

cudaMemcpyToSymbol in pycuda

python,cuda,pycuda

The PyCUDA implementation directly follows the CUDA driver API, so you can use any driver API code you can find as a model, but there are two things required to make this work: Use the module function module.get_global() to retrieve the device pointer to the symbol within the compiled source...

Interfacing cuSOLVER-sparse using PyCUDA

python,cuda,ctypes,pycuda,cusolver

Setting descrA to ctypes.c_void_p() and replacing cusparseMatDescr_t in the cusolverSpDcsrlsvqr wrapper with ctypes.c_void_p should solve the problem.

cuda-memcheck fails to detect memory leak in an R package

r,memory-leaks,cuda,valgrind

This is not valid CUDA code: extern "C" void someCUDAcode() { int a; CUDA_CALL(cudaMalloc((void**) &a, sizeof(int))); mykernel<<<1, 1>>>(1); // CUDA_CALL(cudaFree(&a)); } When we want to do a cudaMalloc operation, we use pointers in C, not ordinary variables, like this: int *a; CUDA_CALL(cudaMalloc((void**) &a, sizeof(int))); When we want to free a...

How many parallel threads i can run on my nvidia graphic card in cuda programming?

cuda

That depends on the CUDA Version used I think. Compute capability(version) V1.0 V1.2 V2.x V3.0-X.X Maximum number of resident threads per multiprocessor 768 1024 1536 2048 Amount of local memory per thread 16 KB 512 KB Maximum number of threads per block 512 1024 If found this peace of Information...

Practice computing grid size for CUDA

cuda,nvidia

This is the standard CUDA idiom for determining the minimum number of blocks in each dimension (the "grid") that completely cover the desired input. This could be expressed as ceil(nx/block.x), that is, figure out how many blocks are needed to cover the desired size, then round up. But full floating...

How to pass struct containing array to the kernel in CUDA?

c,arrays,cuda

We have seen three different codes in this question in the first 24 hours of its existence. This answer addresses the final evolution. The underlying problem you are having is with this type of operation: cudaMalloc(&p_gpu, sizeof(Pass)); cudaMalloc(&p_gpu -> pass, 5 * sizeof(int)); The second cudaMalloc is illegal. This is...

Understanding Dynamic Parallelism in CUDA

multithreading,cuda

Say I launch a child grid from one thread in a block at threadIdx.x==0. Can I assume that all other threads in the parent grid have finished executing up to the point I launched the child grid as well? No. You can make no assumptions about the state of...

CUDA strange behavior accessing vector

c++,cuda

I'm pretty sure this is not correct: fftKernel<<<BLOCK_SIZE, gridSize>>>(...); The first kernel configuration parameter should be the grid dimensions. The second should be the block dimensions. I was able to reproduce the error you indicate in the code you have posted. When I reverse those two parameters: fftKernel<<<gridSize, BLOCK_SIZE>>>(...); the...

cusolverSpDcsrlsvlu or QR method using CUDA

cuda,cusolver

You are using the Host version of the API, but you are passing device variables to it: cudaMalloc((void **)&dCsrValA, sizeof(double)* 16 ); ... cusolverStatus_t pakao = cusolverSpDcsrlsvluHost(handleSolver,5, totalNnz, descr, dCsrValA, dCsrRowPtrA, dCsrColIndA, dY, tol, reorder, dX, singularity); ^^^^ ^^ Referring to the cusolver documentation: We see that for the host...

Function pointer (to other kernel) as kernel arg in CUDA

c++,cuda,function-pointers,gpgpu

It seems like this question comes up from time to time, although the previous examples I can think of have to do with calling __device__ functions instead of __global__ functions. In general it's illegal to take the address of a device entity (variable, function) in host code. One possible method...

CUDA thread execution order

multithreading,cuda

However, the thread order in every block is always 0,1,2,3. Why is this happening? I thought it would be random too With 4 threads per block you are only launching one warp per block. A warp is the unit of execution (and scheduling, and resource assignment) in CUDA, not...

Understanding CUDA profiler output (nvprof)

cuda,memcpy,nvprof

The difference in total time is due to the fact that work is launched to the GPU asynchronously. If you have a long running kernel or set of kernels with no explicit synchronisation to the host, and follow them with a call to cudaMemcpy, the cudaMemcpy call will be launched...

GPU Programming Strategy

c++,cuda,gpu

This is my 2 cents. The drawbacks of having 1 very large array: harder to resize, so if you intent on resizing indiviual layers. Go for a large block. As Daniel said it might be hard to find a contiguous block of memory(take in mind that something might feel large....

how to generalize square matrix multiplication to handle arbitrary dimensions

c,cuda,parallel-processing,matrix-multiplication

This code will work for very specific dimensions but not for others. It will work for square matrix multiplication when width is exactly equal to the product of your block dimension (number of threads - 20 in the code you have shown) and your grid dimension (number of blocks -...

How to give texture to a Point cloud? OpenGL / CUDA / C++

c++,opencv,opengl,cuda,kernel

I found a solution. In the kernel part I have a new uchar4 mapped input for the color, both input pointers must have same size for a correct relation between point and color (float4 *pos and uchar4 *color). Since d_image is a gray scale image (disparity image) and the one...

cuda device function and templates

c++,templates,cuda

As you've already pointed out, this is not correct: template<typename Op> I believe it should be: template<Op op> The following code seems to work correctly for me (removing the texturing since it's extraneous to the issue): $ cat t755.cu #include <stdio.h> #include <math.h> typedef float(*Op)(float, float); __device__ float mymax(float a,...

Is prefix scan CUDA sample code in gpugems3 correct?

cuda,gpu,nvidia,prefix-sum

It seems that you've made at least 1 error in transcribing the code from the GPU Gems 3 chapter into your kernel. This line is incorrect: temp[bi] += g_idata[ai]; it should be: temp[bi] += temp[ai]; When I make that one change to the code you have now posted, it seems...

Does cuda 7 fully support lambda on device code?

c++,c++11,lambda,cuda

Currently it is not possible. Quoting from Mark Harris: That isn't supported today in CUDA, because the lambda is host code. Passing lambdas from host to device is a challenging problem, but it is something we will investigate for a future CUDA release. What you can do in CUDA 7...

Using a data pointer with CUDA (and integrated memory)

c++,memory-management,cuda

The pointer has to be created (i.e. allocated) with cudaHostAlloc, even on integrated systems like Jetson. The reason for this is that the GPU requires (zero-copy) memory to be pinned, i.e. removed from the host demand-paging system. Ordinary allocations are subject to demand-paging, and may not be used as zero-copy...

Extracting raw data from template for use in CUDA

c++,templates,opencv,cuda

As it turns out you can pass the raw data as you would using a normal vector. void computation(ps::IntegralImage2D<float, 3> iim_xyz){ cv::Vec<double, 3>* d_img_fst = 0; cudaErrorCheck(cudaMalloc((void**)&d_img_fst, sizeof(cv::Vec<double, 3>)*(iim_xyz.img_fst.size()))); cudaErrorCheck(cudaMemcpy(d_img_fst, &iim_xyz.img_fst[0], sizeof(cv::Vec<double, 3>)*(iim_xyz.img_fst.size()), cudaMemcpyHostToDevice)); //.. } __device__ double* getFirstOrderSum(unsigned start_x, unsigned start_y, unsigned width, unsigned...

Storing non-square images in a volume

cuda

The width and height arguments were switched in the kernel call, and that was the reason for the error. Everything works once that was corrected. This is the corrected kernel: __global__ void copySlice2Volume2(float *buffer, float *slice, int width, int height, int frameIdx) { int tid = (blockIdx.x * width) +...

How to run CUDA/OpenGL interop (particle) sample from a remote machine

ubuntu,cuda,x11,freeglut,nsight

I'm going to provide a lengthy answer because I just worked through this. However, before proceeding down this path, I'd encourage you to try a solution like NoMachine NX which should already have some of this capability built-in. It may meet your needs. You can get access to a remote...

Problems with floating-point additions. Ignoring some small values

math,cuda,floating-point

The example implicitly declares a binary floating point format, which has an arbitrary precision exponent, but only 2 bits in the mantissa. All numbers are of format 1.xx * 2^n. When one performs floating point addition, one must de-normalize or scale the arguments to have the same exponent. 0.25 =...

Why cuSparse is much slower than cuBlas for sparse matrix multiplication

matrix,cuda,multiplication,sparse,cublas

I don't think that you can classify a matrix with half zeros as "sparse": the timing you have found are reasonable (actually the sparse algorithm is behaving pretty well!). Sparse algorithms are efficient only when considering matrices where most of the elements are zeros (for example, matrices coming out from...

cuda thrust: selective copying and resizing results

cuda,thrust

Thrust doesn't resize vectors as part of any algorithm call. The size of vectors going into a thrust algorithm will be exactly the size of those vectors coming out of the algorithm. shrink_to_fit also has no impact on a vector's size, but it may impact the capacity, which has to...

sharing cuda texture among multiple streams

cuda,textures

Textures in CUDA work as cached memory. Having multiple streams on a SMX looking for memory in the same texture location could only improve cache hits.

How to create a Cuda module without a host compiler

cuda

It doesn't look like there is a method to do as you'd like. I would compile with `nvcc --keep --ptx code.cu" and go through the compilation step by step. Doing this I could not see evidence that what you'd like to do is possible using nvcc.

cuda 7.0 installation compatible hardware not found

windows,cuda,installation

You can't use CUDA 7 with a GeForce 8400 GS. That is a compute capability 1.1 device and support for that was dropped in CUDA 7. Install CUDA 6.5 instead. You can keep your 341.44 driver that you already have installed. If you use CUDA 6.5, be sure to select...

caffe Debug build: stray '"' character in nvcc command

compilation,cuda,nvcc,caffe

Turns out it was a typo. In project properties->Debug->CUDA C/C++->Device, instead of compute_30,sm_30 I had `compute_30, sm_30` that is, with a space separator....

Passing dynamic array of structs to GPU kernel

c++,cuda,structure,dynamic-memory-allocation

You need to allocate memory for struct member array. Test *test = malloc(sizeof(Test)*n); for(int i = 0; i < n; i++) test[i]->array = malloc(size); I would suggest to read this answer to cope up with other issues after this fix....

Best way to achieve CUDA Vector Diagonalization

matrix,cuda

I created a simple example based on thrust. It uses column-major order to store the matrices in a thrust::device_vector. It should scale well with larger row/column counts. Another approach could be based off the thrust strided_range example. This example does what you want (fill the diagonals based on the input...

Cuda cusolver can't link in Visual studio 2013

c++,visual-studio,visual-studio-2013,cuda,cusolver

I've just tested this on Windows 7 64-bit, Visual Studio 2013 Community, CUDA 7. start by opening the vectorAdd cuda sample code. Be sure you can build this code correctly. (It should be a x64 project. CUDA 7 does not support 32-bit projects or operating systems on windows.) If you...

Reading event counters with concurrent exection

cuda,profiling,nvidia

No. nvprof v7.5 and earlier does not support collection of performance counters in a way that is useful for investigating the performance of concurrent kernels. I recommend you submit a feature request through the NVIDIA developer program. This is on the teams task list. Customer feedback helps move features up...

Amount of cores per SM and threads per block in CUDA

cuda

Is that better to create grids with blocks, containing 128 threads each? Will such code run faster? Optimal block size depends on the problem. It's a idea for your block size to be a multiple of the warp size. Other factors are occupancy considerations, and shared memory usage. Does...

Why use max and min macro in __global__ kernel of CUDA not giving correct answer?

c++,cuda,macros

Getting rid of the (float) to clear the clutter, your macros look like this: #define min(a, b) (a > b)? b: a #define max(a, b) (a > b)? a: b And example use (simplifying a few variable names): imgx = min(max(x + aipx, 0), nc-1); will expand to: imgx =...

Is there any way I can have a barrier within Device code that is controlled by Host?

cuda,barrier

You cannot do this with the current implementation of managed memory because managed memory requires exclusive access to managed data by the device, when kernels are running. Host access to managed data during the time when kernels are running will lead to undefined behavior, typically seg fault. This should be...

Template __host__ __device__ calling host defined functions

cuda

You cannot prevent instantiation of either half of a __host__ __device__ function template instantiation. If you instantiate the function by calling it on the host (device), the compiler will also instantiate the device (host) half. The best you can do for your use case as of CUDA 7.0 is to...

'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D

c,cuda

The reason the error doesn't occur on this line: REAL tmp = unew_row[j]; // no error on this line is because the compiler is optimizing that line out. It doesn't do anything useful, and so the compiler completely eliminates it. The compiler warning: xxx.cu(87): warning: variable "tmp" was declared but...

NVCC CUDA cross compiling cannot find “-lcudart”

linux,cuda,ld,nvcc

It turns out that the CUDA installer I was using from NVIDIA will not allow me to cross compile for my CARMA board, but it has to be downloaded from the manufacturer SECO.

cuMemcpyDtoH yields CUDA_ERROR_INVALID_VALUE

java,scala,ubuntu,cuda,jcuda

Here val deviceOutput = new CUdeviceptr() cuMemAlloc(deviceOutput, SI) you are allocating SI bytes - which is 4 bytes, as the size of one int. Writing more than 4 bytes to this device pointer will mess up things. It should be cuMemAlloc(deviceOutput, SI * numElements) And similarly, I think that the...

direct global memory access using cuda

c++,cuda

q1- lets say i have copy one array onto device through stream1 using cudaMemCpyAsync; would i be able to access the values of that array in different stream say 2? Yes, the array da is accessible in both kernels you have shown. However, an important question is whether or...

OpenGL 4.5 Buffer Texture : extensions support

opengl,cuda,textures,buffer,opengl-4

I solved this. I was in a hurry and stupidly missed a very important thing on a wiki page: https://www.opengl.org/wiki/Buffer_Texture Access in shaders In GLSL, buffer textures can only be accessed with the texelFetch​ function. This function takes pixel offsets into the texture rather than normalized texture coordinates. The sampler...

CUDA parallel program flow

c++,c,multithreading,cuda

The root cause of this problem was supplying an uninitialised device variable as a kernel argument. In this kernel call: dijkstras<<<1,N>>>(d_a, d_b, d_n); d_n had been allocated memory, but never assigned a value, resulting in undefined behaviour within the kernel. I would contend this proved hard for the original poster...

Calling a CUDA “Hello World” from Haskell using the FFI gives wrong results

haskell,cuda,ffi

Maybe unrelated, but I was able to reproduce your error on a Mac with separate on-board and discrete graphics cards. When "Automatic graphics switching" is enabled in System Preferences (and no 3D graphics applications are running), I get the same "no CUDA-capable device is detected" error. When I turn off...

cudaMalloc vs cudaMalloc3D performance for a 2D array

c,cuda

The performance difference you observe is mostly due to the increased instruction overhead in the pitched memory indexing scheme. Because your array size is a large power of two in the major direction, it is very likely that the pitched array allocated with cudaMalloc3D is the same size as the...

Returning to host code in pyCUDA after asynchronous kernel launch

python,python-2.7,cuda,ipython,pycuda

I figured this out, so am posting my solution. Even though asynchronous memcpy's are non-blocking, I discovered that doing a memcpy using the same stream as a running kernel does not work. My solution was to create another stream: strm2 = driver.Stream() and then change d_inShot like so: d_inShot.set_async(h_inShot.astype(np.uint16), stream...

Understanding Memory Replays and In-Flight Requests

caching,cuda

Effective load throughput is not the only metric that determines the performance of your kernel! A kernel with perfectly coalesced loads will always have a lower effective load throughput than the equivalent, non coalesced kernel, but that alone says nothing about its execution time: in the end, the one metric...

cudaMemcpy doesn't work in 64 bits

cuda,32bit-64bit

This is not correct: int* a_h[100]; It should be: int a_h[100]; You want an array of 100 int values. Not an array of 100 int pointers. int values don't change their size between 32 and 64 bits. int pointers do change their size....

CUDA Nsight Debug Focus Block not active

c++,cuda,nsight

The Nsight VSE CUDA Debugger is a hardware debugger which means that it can only show state for thread blocks that are allocated to SMs at the time you stop execution. The error "Block not active" means that that block you are requesting is not currently allocated to a SM....

Memory allocation on GPU for dynamic array of structs

c,struct,cuda,dynamic-memory-allocation

This is allocating a new pointer to host memory: test[i].array = (char*)malloc(size * sizeof(char)); This is copying data to that region in host memory: memcpy(test[i].array, temp, size * sizeof(char)); This is overwriting the previously allocated pointer to host memory (from step 1 above) with a new pointer to device...

How does CUDA's cudaMemcpyFromSymbol work?

cuda

I believe the details are that for each __device__ variable, cudafe creates a normal global variable as in C and also a CUDA-specific PTX variable. The global C variable is used so that the host program can refer to the variable by its address, and the PTX variable is used...

Cuda grid size limitations

cuda

You have block size and grid size mixed up when calling the kernel. set_start<<<blockSize, gridSize>>>(H_d, HU_d, HV_d, E_d, Na); should read: set_start<<<gridSize, blockSize>>>(H_d, HU_d, HV_d, E_d, Na); Because of this bug you are actually trying to launch a grid of size blockSize, and blocks of size gridSize. It would appear...

Cuda: Copy host data to shared memory array

c++,cuda

If you understand the scope and size limitations of shared memory, then the question appears to be how to dynamically reserved memory for the shared memory array how to use the dynamic shared memory within the kernel Your kernel becomes something like this: __shared__ Mystruct *d_s; __global__ void init(Mystruct *theStructArray){...

Branch and predicated instructions

cuda,simd

Instruction predication means that an instruction is conditionally executed by a thread depending on a predicate. Threads for which the predicate is true execute the instruction, the rest do nothing. For example: var = 0; // Not taken by all threads if (condition) { var = 1; } else {...

Why does cuSOLVER cusolverSpDcsrlsvchol not work?

c++,cuda,linear-algebra,solver,cusolver

1.cusolverSpScsrlsvchol returns wrong results for x: 1 3.33333 2.33333 1 3.33333 2.33333 1.33333 1 2.33333 1.33333 0.666667 1 1 1 1 1 You said: A is positive definite and symmetric. No, it is not. It is not symmetric. cusolverSpcsrlsvqr() has no requirement that the A matrix be symmetric. cusolverSpcsrlsvchol()...

Making some, but not all, (CUDA) memory accesses uncached

caching,cuda,gpgpu

Only if you compile that kernel individually, because this is an instruction level feature which is enabled by code generation. You could also use inline PTX assembler to issue ld.global.cg instructions for a particular load operation within a kernel [see here for details]. No, it is an instruction level...

Compiler 'cl.exe' in PATH different than the one specified with -ccbin

visual-studio-2013,cuda,intel-c++

Only the microsoft cl.exe compiler is supported on Windows. The supported platforms are indicated in the windows getting started document. The intel compiler is supported as the host compiler on linux however....

thrust exception bulk_kernel_by_value in transform_reduce

c++,c++11,cuda

The problem is that f_obj within struct FuncEvalF is a const FunctionObj<T>&. It is instantiated as a temporary on the host FunctionObj<float>(), but the reference to it is not valid anymore later on. One way to fix this is to create a copy of it instead of holding a reference...

How do you build the example CUDA Thrust device sort?

c++,visual-studio-2010,sorting,cuda,thrust

As @JaredHoberock pointed out, probably the key issue is that you are trying to compile a .cpp file. You need to rename that file to .cu and also make sure it is being compiled by nvcc. After you fix that, you will probably run into another issue. This is not...

Stream compaction with Thrust; best practices and fastest way?

c++,cuda,gpgpu,thrust,sparse-array

What you have appeared to have overlooked is that copy_if returns an iterator which points to the end of the copied data from the stream compaction operation. So all that is required is this: //copies to device thrust::device_vector<int> d_src = h_src; //Result vector thrust::device_vector<int> d_res(d_src.size()); //Copy non-zero elements from d_src...

CUDA Matrix Addition Execution Time with Variation in Block and Grid Dimensions

cuda

There are (at least) 3 or 4 issues: CUDA threadblocks are limited to 1024 threads total. That means the product of block.x * block.y * block.z must be less than or equal to 1024. So your block dimensions above 32x32 are simply failing to run the kernel at all. You...

Access violation reading location when calling cudaMemcpy2DToArray

c++,arrays,opencv,cuda

ImgSrc_f does not point to a contiguous 512x512 chunk of memory. Try changing float *ImgSrc_f[512]; for (int i=0; i<512; i++) ImgSrc_f[i] = (float *)malloc(512 * sizeof(float)); for(int i=0;i<512;i++) for(int j=0;j<512;j++) { ImgSrc_f[i][j]=ImgSrc.at<float>(i,j); } to something like float *ImgSrc_f; ImgSrc_f = (float *)malloc(512 * 512 * sizeof(float)); for(int i=0;i<512;i++) for(int j=0;j<512;j++)...

CUDA: Group every n-th point of array passed to GPU

c++,c,arrays,cuda

So, here's the solution I came up with after simplifying my case. There was a problem with memory usage - I tried to store / read different amount of data than I claimed to use when allocating it. I hope it will be helpful for anyone in the future: #include...

In CUDA / Thrust, how can I access a vector element's neighbor during a for-each operation?

c++,cuda,thrust

Fancy iterators are the key to this sort of operation, which isn't all that intuitive in thrust. You can use the zip_iterator to create tuples of values which can then be iterated over, so for a typical f(x[i-1], x[i], x[i+1]) type function, you get something like this: #include <iostream> #include...

Do I need to free device_ptr returned by thrust?

c++,pointers,cuda,thrust

thrust::min_element returns an iterator. You should not free it directly.

How can I pass a struct to a kernel in JCuda

java,struct,cuda,jni,jcuda

(The author of JCuda here (not "JCUDA", please)) As mentioned in the forum post linked from the comment: It is not impossible to use structs in CUDA kernels and fill them from JCuda side. It is just very complicated, and rarely beneficial. For the reason of why it is rarely...

Building CUDA-aware openMPI on Ubuntu 12.04 cannot find cuda.h

ubuntu,cuda,mpi

OK, I think I fixed the problem. The conftest.c seems to be looking for cuda.h in /usr/include, instead of the supposed /usr/local/cuda/include. The problem is solved once I created a soft link of cuda.h and cuda_runtime_api.h.

Why use memset when using CUDA?

c,cuda,nvidia

This is the answer given by njuffa in the comments: ...The content of GPU memory doesn't change between invocations of the application. In case of a program failure, we would want to avoid picking up good data from a previous run, which may lead (erroneously) to a belief that the...

purposely causing bank conflicts for shared memory on CUDA device

cuda,gpu,shared-memory,bank-conflict

Lets walk through the ptx that it generates. //Declare some registers .reg .s32 %r<5>; .reg .s64 %rd<4>; // demoted variable .shared .align 4 .u32 _Z6kernelv$__cuda_local_var_35411_30_non_const_mywarp; //load tid in register r1 mov.u32 %r1, %tid.x; //multiple tid*5000+5 and store in r2 mad.lo.s32 %r2, %r1, 50000, 5; //store result in shared memory st.shared.u32...

Finding reason for the inaccurate results, copying code from research paper

cuda,parallel-processing,research

My problem is that when I launch more than 22 threads per block then most of the time I get inaccurate results(sometimes all zeros), Can somebody kindly point what I might be doing wrong, I wasn't able to build the code because the header files seem to be missing...

Faster Matrix Multiplication in CUDA

c,cuda,matrix-multiplication

Firstly, be really sure this is what you want to do. Without describing the manipulations you want to do, it's hard to comment on this, but be aware that matrix multiplication is an n-cubed operation. If your manipulations are not the same complexity, chances are you'll do better simply using...

What is version of cuda for nvidia 304.125

ubuntu,cuda,ubuntu-14.04,nvidia

304.xx is a driver that will support CUDA 5 and previous (does not support newer CUDA versions.) If you want to reinstall ubuntu to create a clean setup, the linux getting started guide has all the instructions needed to set up CUDA 7 if that is your intent. I believe...