cuda,gpu,shared-memory,bank-conflict
Lets walk through the ptx that it generates. //Declare some registers .reg .s32 %r<5>; .reg .s64 %rd<4>; // demoted variable .shared .align 4 .u32 _Z6kernelv$__cuda_local_var_35411_30_non_const_mywarp; //load tid in register r1 mov.u32 %r1, %tid.x; //multiple tid*5000+5 and store in r2 mad.lo.s32 %r2, %r1, 50000, 5; //store result in shared memory st.shared.u32...
No. nvprof v7.5 and earlier does not support collection of performance counters in a way that is useful for investigating the performance of concurrent kernels. I recommend you submit a feature request through the NVIDIA developer program. This is on the teams task list. Customer feedback helps move features up...
Say I launch a child grid from one thread in a block at threadIdx.x==0. Can I assume that all other threads in the parent grid have finished executing up to the point I launched the child grid as well? No. You can make no assumptions about the state of...
ImgSrc_f does not point to a contiguous 512x512 chunk of memory. Try changing float *ImgSrc_f[512]; for (int i=0; i<512; i++) ImgSrc_f[i] = (float *)malloc(512 * sizeof(float)); for(int i=0;i<512;i++) for(int j=0;j<512;j++) { ImgSrc_f[i][j]=ImgSrc.at<float>(i,j); } to something like float *ImgSrc_f; ImgSrc_f = (float *)malloc(512 * 512 * sizeof(float)); for(int i=0;i<512;i++) for(int j=0;j<512;j++)...
This is my 2 cents. The drawbacks of having 1 very large array: harder to resize, so if you intent on resizing indiviual layers. Go for a large block. As Daniel said it might be hard to find a contiguous block of memory(take in mind that something might feel large....
CUB is an evolving library, which means that as new features are introduced in CUDA, CUB may evolve in newer releases to take advantage of those. If you then attempt to use a newer CUB release with an older CUDA version, you may run into compatibility issues. This usage by...
If you understand the scope and size limitations of shared memory, then the question appears to be how to dynamically reserved memory for the shared memory array how to use the dynamic shared memory within the kernel Your kernel becomes something like this: __shared__ Mystruct *d_s; __global__ void init(Mystruct *theStructArray){...
Thrust doesn't resize vectors as part of any algorithm call. The size of vectors going into a thrust algorithm will be exactly the size of those vectors coming out of the algorithm. shrink_to_fit also has no impact on a vector's size, but it may impact the capacity, which has to...
I created a simple example based on thrust. It uses column-major order to store the matrices in a thrust::device_vector. It should scale well with larger row/column counts. Another approach could be based off the thrust strided_range example. This example does what you want (fill the diagonals based on the input...
You have block size and grid size mixed up when calling the kernel. set_start<<<blockSize, gridSize>>>(H_d, HU_d, HV_d, E_d, Na); should read: set_start<<<gridSize, blockSize>>>(H_d, HU_d, HV_d, E_d, Na); Because of this bug you are actually trying to launch a grid of size blockSize, and blocks of size gridSize. It would appear...
python,python-2.7,cuda,ipython,pycuda
I figured this out, so am posting my solution. Even though asynchronous memcpy's are non-blocking, I discovered that doing a memcpy using the same stream as a running kernel does not work. My solution was to create another stream: strm2 = driver.Stream() and then change d_inShot like so: d_inShot.set_async(h_inShot.astype(np.uint16), stream...
No, this won't be possible. K20m can be used (with some effort) with OpenGL graphics on Linux, but at least up through windows 8.x, you won't be able to use K20m as a D3D device in Windows. The K20m does not publish a VGA classcode in PCI configuration space, which...
This is not valid CUDA code: extern "C" void someCUDAcode() { int a; CUDA_CALL(cudaMalloc((void**) &a, sizeof(int))); mykernel<<<1, 1>>>(1); // CUDA_CALL(cudaFree(&a)); } When we want to do a cudaMalloc operation, we use pointers in C, not ordinary variables, like this: int *a; CUDA_CALL(cudaMalloc((void**) &a, sizeof(int))); When we want to free a...
c,cuda,parallel-processing,matrix-multiplication
This code will work for very specific dimensions but not for others. It will work for square matrix multiplication when width is exactly equal to the product of your block dimension (number of threads - 20 in the code you have shown) and your grid dimension (number of blocks -...
That depends on the CUDA Version used I think. Compute capability(version) V1.0 V1.2 V2.x V3.0-X.X Maximum number of resident threads per multiprocessor 768 1024 1536 2048 Amount of local memory per thread 16 KB 512 KB Maximum number of threads per block 512 1024 If found this peace of Information...
Getting rid of the (float) to clear the clutter, your macros look like this: #define min(a, b) (a > b)? b: a #define max(a, b) (a > b)? a: b And example use (simplifying a few variable names): imgx = min(max(x + aipx, 0), nc-1); will expand to: imgx =...
cuda,parallel-processing,thrust
Thrust interprets ordinary pointers as pointing to data on the host: thrust::reduce_by_key(d_list, d_list+n, d_ones, C, D,cmp); Therefore thrust will call the host path for the above algorithm, and it will seg fault when it attempts to dereference those pointers in host code. This is covered in the thrust getting started...
(Posting my comment as an answer) Please check all your API calls for errors. Most likely is that your block size of width*height threads is too large for a single block....
Instruction predication means that an instruction is conditionally executed by a thread depending on a predicate. Threads for which the predicate is true execute the instruction, the rest do nothing. For example: var = 0; // Not taken by all threads if (condition) { var = 1; } else {...
Textures in CUDA work as cached memory. Having multiple streams on a SMX looking for memory in the same texture location could only improve cache hits.
q1- lets say i have copy one array onto device through stream1 using cudaMemCpyAsync; would i be able to access the values of that array in different stream say 2? Yes, the array da is accessible in both kernels you have shown. However, an important question is whether or...
However, the thread order in every block is always 0,1,2,3. Why is this happening? I thought it would be random too With 4 threads per block you are only launching one warp per block. A warp is the unit of execution (and scheduling, and resource assignment) in CUDA, not...
Fancy iterators are the key to this sort of operation, which isn't all that intuitive in thrust. You can use the zip_iterator to create tuples of values which can then be iterated over, so for a typical f(x[i-1], x[i], x[i+1]) type function, you get something like this: #include <iostream> #include...
The PyCUDA implementation directly follows the CUDA driver API, so you can use any driver API code you can find as a model, but there are two things required to make this work: Use the module function module.get_global() to retrieve the device pointer to the symbol within the compiled source...
Maybe unrelated, but I was able to reproduce your error on a Mac with separate on-board and discrete graphics cards. When "Automatic graphics switching" is enabled in System Preferences (and no 3D graphics applications are running), I get the same "no CUDA-capable device is detected" error. When I turn off...
There are (at least) 3 or 4 issues: CUDA threadblocks are limited to 1024 threads total. That means the product of block.x * block.y * block.z must be less than or equal to 1024. So your block dimensions above 32x32 are simply failing to run the kernel at all. You...
c++,cuda,linear-algebra,solver,cusolver
1.cusolverSpScsrlsvchol returns wrong results for x: 1 3.33333 2.33333 1 3.33333 2.33333 1.33333 1 2.33333 1.33333 0.666667 1 1 1 1 1 You said: A is positive definite and symmetric. No, it is not. It is not symmetric. cusolverSpcsrlsvqr() has no requirement that the A matrix be symmetric. cusolverSpcsrlsvchol()...
cuda,parallel-processing,research
My problem is that when I launch more than 22 threads per block then most of the time I get inaccurate results(sometimes all zeros), Can somebody kindly point what I might be doing wrong, I wasn't able to build the code because the header files seem to be missing...
ubuntu,cuda,x11,freeglut,nsight
I'm going to provide a lengthy answer because I just worked through this. However, before proceeding down this path, I'd encourage you to try a solution like NoMachine NX which should already have some of this capability built-in. It may meet your needs. You can get access to a remote...
c++,visual-studio,visual-studio-2013,cuda,cusolver
I've just tested this on Windows 7 64-bit, Visual Studio 2013 Community, CUDA 7. start by opening the vectorAdd cuda sample code. Be sure you can build this code correctly. (It should be a x64 project. CUDA 7 does not support 32-bit projects or operating systems on windows.) If you...
OK, I think I fixed the problem. The conftest.c seems to be looking for cuda.h in /usr/include, instead of the supposed /usr/local/cuda/include. The problem is solved once I created a soft link of cuda.h and cuda_runtime_api.h.
c#,cuda,sharpdx,direct3d9,managed-cuda
As hinted by the commenter, I’ve tried creating a single instance of CudaDirectXInteropResource along with the D3D texture. It worked. It’s counter-intuitive and undocumented, but it looks like cuGraphicsUnregisterResource destroys the newly written data. At least on my machine with GeForce GTX 960, Cuda 7.0 and Windows 8.1 x64. So,...
c,struct,cuda,dynamic-memory-allocation
This is allocating a new pointer to host memory: test[i].array = (char*)malloc(size * sizeof(char)); This is copying data to that region in host memory: memcpy(test[i].array, temp, size * sizeof(char)); This is overwriting the previously allocated pointer to host memory (from step 1 above) with a new pointer to device...
c++,multithreading,cuda,shuffle
Yes, it's because of the branching. Quoting from the CUDA programming guide B.14.2: The __shfl() intrinsics permit exchanging of a variable between threads within a warp without use of shared memory. The exchange occurs simultaneously for all active threads within the warp, ... and Threads may only read data from...
This is not correct: int* a_h[100]; It should be: int a_h[100]; You want an array of 100 int values. Not an array of 100 int pointers. int values don't change their size between 32 and 64 bits. int pointers do change their size....
matrix,cuda,multiplication,sparse,cublas
I don't think that you can classify a matrix with half zeros as "sparse": the timing you have found are reasonable (actually the sparse algorithm is behaving pretty well!). Sparse algorithms are efficient only when considering matrices where most of the elements are zeros (for example, matrices coming out from...
Windows 10 is not a supported platform for CUDA at this time. Switch to a supported platform, and your Unified Memory operations should begin working again.
Here val deviceOutput = new CUdeviceptr() cuMemAlloc(deviceOutput, SI) you are allocating SI bytes - which is 4 bytes, as the size of one int. Writing more than 4 bytes to this device pointer will mess up things. It should be cuMemAlloc(deviceOutput, SI * numElements) And similarly, I think that the...
The width and height arguments were switched in the kernel call, and that was the reason for the error. Everything works once that was corrected. This is the corrected kernel: __global__ void copySlice2Volume2(float *buffer, float *slice, int width, int height, int frameIdx) { int tid = (blockIdx.x * width) +...
After providing us with the exact time differences in one of the comments, i.e.: 280ms for up to 128 threads, 386ms for 129+ threads, I think it indirectly supports my theory of issue being related to warp scheduling. Look at the GK210 whitepaper, which is a chip used in K80:...
Consider one warp of the thread block finishing the first iteration and starting the next one, while other warps are still working on the first iteration. If you don't have __syncthreads at label sync2, you will end up with this warp writing to shared memory while others are reading from...
c++,cuda,structure,dynamic-memory-allocation
You need to allocate memory for struct member array. Test *test = malloc(sizeof(Test)*n); for(int i = 0; i < n; i++) test[i]->array = malloc(size); I would suggest to read this answer to cope up with other issues after this fix....
The Nsight VSE CUDA Debugger is a hardware debugger which means that it can only show state for thread blocks that are allocated to SMs at the time you stop execution. The error "Block not active" means that that block you are requesting is not currently allocated to a SM....
I found a solution. In the kernel part I have a new uchar4 mapped input for the color, both input pointers must have same size for a correct relation between point and color (float4 *pos and uchar4 *color). Since d_image is a gray scale image (disparity image) and the one...
So, here's the solution I came up with after simplifying my case. There was a problem with memory usage - I tried to store / read different amount of data than I claimed to use when allocating it. I hope it will be helpful for anyone in the future: #include...
visual-studio-2013,cuda,intel-c++
Only the microsoft cl.exe compiler is supported on Windows. The supported platforms are indicated in the windows getting started document. The intel compiler is supported as the host compiler on linux however....
The performance difference you observe is mostly due to the increased instruction overhead in the pitched memory indexing scheme. Because your array size is a large power of two in the major direction, it is very likely that the pitched array allocated with cudaMalloc3D is the same size as the...
This is normal, when the driver packaged in the CUDA installer is "older" than your GPU. You should retain your current GPU driver, and go ahead with the CUDA toolkit installation, but de-select the option to install the GPU driver. Your existing driver should work fine. ...
What you are observing is probably an artifact of running the code on a Windows WDDM platform. The WDDM subsystem has a lot of latency which other platforms are not hampered by, so to improve overall performance, the CUDA WDDM driver performs command batching. This can interfere with the expect...
c++,cuda,gpgpu,thrust,sparse-array
What you have appeared to have overlooked is that copy_if returns an iterator which points to the end of the copied data from the stream compaction operation. So all that is required is this: //copies to device thrust::device_vector<int> d_src = h_src; //Result vector thrust::device_vector<int> d_res(d_src.size()); //Copy non-zero elements from d_src...
The problem is that f_obj within struct FuncEvalF is a const FunctionObj<T>&. It is instantiated as a temporary on the host FunctionObj<float>(), but the reference to it is not valid anymore later on. One way to fix this is to create a copy of it instead of holding a reference...
Yes, you can copy from std::vector using cudaMemcpy. You don't have your sizes set up correctly: void computeDepthChangeMap(unsigned char* depthChangeMap, size_t size, std::vector<cv::Point3f>* input, float dcf, int width, int height) { ... cudaStatus = cudaMalloc((void**)&dev_input, size); ^^^^ cudaStatus = cudaMemcpy(dev_input, &input[0], sizeof(cv::Point3f)*size, cudaMemcpyHostToDevice); ^^^^^^^^^^^^^^^^^ These size parameters should all be...
It seems that you've made at least 1 error in transcribing the code from the GPU Gems 3 chapter into your kernel. This line is incorrect: temp[bi] += g_idata[ai]; it should be: temp[bi] += temp[ai]; When I make that one change to the code you have now posted, it seems...
(The author of JCuda here (not "JCUDA", please)) As mentioned in the forum post linked from the comment: It is not impossible to use structs in CUDA kernels and fill them from JCuda side. It is just very complicated, and rarely beneficial. For the reason of why it is rarely...
Turns out it was a typo. In project properties->Debug->CUDA C/C++->Device, instead of compute_30,sm_30 I had `compute_30, sm_30` that is, with a space separator....
thrust::min_element returns an iterator. You should not free it directly.
Going through your package there are multiple aspects that need to be changed. You shouldn't use a 'Makefile' but a 'Makevars' file instead to improve compatibility for multiple architecture builds. Try to follow the standard variable names (e.g. CPPC should be CXX), this makes everything play together much better. Don't...
You cannot do this with the current implementation of managed memory because managed memory requires exclusive access to managed data by the device, when kernels are running. Host access to managed data during the time when kernels are running will lead to undefined behavior, typically seg fault. This should be...
The reason the error doesn't occur on this line: REAL tmp = unew_row[j]; // no error on this line is because the compiler is optimizing that line out. It doesn't do anything useful, and so the compiler completely eliminates it. The compiler warning: xxx.cu(87): warning: variable "tmp" was declared but...
As it turns out you can pass the raw data as you would using a normal vector. void computation(ps::IntegralImage2D<float, 3> iim_xyz){ cv::Vec<double, 3>* d_img_fst = 0; cudaErrorCheck(cudaMalloc((void**)&d_img_fst, sizeof(cv::Vec<double, 3>)*(iim_xyz.img_fst.size()))); cudaErrorCheck(cudaMemcpy(d_img_fst, &iim_xyz.img_fst[0], sizeof(cv::Vec<double, 3>)*(iim_xyz.img_fst.size()), cudaMemcpyHostToDevice)); //.. } __device__ double* getFirstOrderSum(unsigned start_x, unsigned start_y, unsigned width, unsigned...
This is the standard CUDA idiom for determining the minimum number of blocks in each dimension (the "grid") that completely cover the desired input. This could be expressed as ceil(nx/block.x), that is, figure out how many blocks are needed to cover the desired size, then round up. But full floating...
There are various atomic functions which support atomic operations on unsigned long long int (ie. a 64-bit unsigned integer), such as atomicCAS, atomicExch and atomicAdd. And if you have a cc3.5 or higher GPU you have even more options. Referring to the documentation on clock64(): long long int clock64(); when...
You can't use CUDA 7 with a GeForce 8400 GS. That is a compute capability 1.1 device and support for that was dropped in CUDA 7. Install CUDA 6.5 instead. You can keep your 341.44 driver that you already have installed. If you use CUDA 6.5, be sure to select...
It turns out that the CUDA installer I was using from NVIDIA will not allow me to cross compile for my CARMA board, but it has to be downloaded from the manufacturer SECO.
There are at least 4 ideas, some of which have already been stated in the comments: Transform your point distance storage from AoS format: struct DataPoint { float pfDimens[3]; }; to SoA format: struct DataPoint { float pfDimens_x[NPTS]; float pfDimens_y[NPTS]; float pfDimens_z[NPTS]; }; this will enable full coalescing on loading...
I believe the details are that for each __device__ variable, cudafe creates a normal global variable as in C and also a CUDA-specific PTX variable. The global C variable is used so that the host program can refer to the variable by its address, and the PTX variable is used...
Is that better to create grids with blocks, containing 128 threads each? Will such code run faster? Optimal block size depends on the problem. It's a idea for your block size to be a multiple of the warp size. Other factors are occupancy considerations, and shared memory usage. Does...
c++,visual-studio-2010,sorting,cuda,thrust
As @JaredHoberock pointed out, probably the key issue is that you are trying to compile a .cpp file. You need to rename that file to .cu and also make sure it is being compiled by nvcc. After you fix that, you will probably run into another issue. This is not...
The root cause of this problem was supplying an uninitialised device variable as a kernel argument. In this kernel call: dijkstras<<<1,N>>>(d_a, d_b, d_n); d_n had been allocated memory, but never assigned a value, resulting in undefined behaviour within the kernel. I would contend this proved hard for the original poster...
I'm pretty sure this is not correct: fftKernel<<<BLOCK_SIZE, gridSize>>>(...); The first kernel configuration parameter should be the grid dimensions. The second should be the block dimensions. I was able to reproduce the error you indicate in the code you have posted. When I reverse those two parameters: fftKernel<<<gridSize, BLOCK_SIZE>>>(...); the...
Firstly, be really sure this is what you want to do. Without describing the manipulations you want to do, it's hard to comment on this, but be aware that matrix multiplication is an n-cubed operation. If your manipulations are not the same complexity, chances are you'll do better simply using...
Currently it is not possible. Quoting from Mark Harris: That isn't supported today in CUDA, because the lambda is host code. Passing lambdas from host to device is a challenging problem, but it is something we will investigate for a future CUDA release. What you can do in CUDA 7...
c++,cuda,sparse-matrix,matrix-factorization,cusolver
I'm currently working on something similar myself. I decided to basically wrap the conjugate gradient and level-0 incomplete cholesky preconditioned conjugate gradient solvers utility samples that came with the CUDA SDK into a small class. You can find them in your CUDA_HOME directory under the path: samples/7_CUDALibraries/conjugateGradient and /Developer/NVIDIA/CUDA-samples/7_CUDALibraries/conjugateGradientPrecond Basically,...
cuda,gpgpu,gpu-programming,cublas
The only actual problem in your code is here: cudaMalloc( &d_x,sizeof(d_x) ); sizeof(d_x) is just the size of a pointer. You can fix it like this: cudaMalloc( &d_x,sizeof(x) ); If you want to find out if a CUBLAS API call is failing, then you should check the return code of...
The difference in total time is due to the fact that work is launched to the GPU asynchronously. If you have a long running kernel or set of kernels with no explicit synchronisation to the host, and follow them with a call to cudaMemcpy, the cudaMemcpy call will be launched...
C++11 support is added officially in CUDA 7.0. And you need GCC version 4.7 or later to have C++11 support. See details here: http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-compiler-new-features
It doesn't look like there is a method to do as you'd like. I would compile with `nvcc --keep --ptx code.cu" and go through the compilation step by step. Doing this I could not see evidence that what you'd like to do is possible using nvcc.
c,compilation,cuda,gcc-warning,nvcc
this link: https://devtalk.nvidia.com/default/topic/388072/calling-cuda-functions-from-a-c-file/ answers your question:,. basically: in the .c file #include <stdio.h> #include <stdlib.h> #include <string.h> #include <cuda.h> extern void kernel_wrapper(int *a, int *b); int main(int argc, char *argv[]) { int a = 2; int b = 3; kernel_wrapper(&a, &b); return 0; } and in the .cu file; __global__...
ubuntu,cuda,ubuntu-14.04,nvidia
304.xx is a driver that will support CUDA 5 and previous (does not support newer CUDA versions.) If you want to reinstall ubuntu to create a clean setup, the linux getting started guide has all the instructions needed to set up CUDA 7 if that is your intent. I believe...
We have seen three different codes in this question in the first 24 hours of its existence. This answer addresses the final evolution. The underlying problem you are having is with this type of operation: cudaMalloc(&p_gpu, sizeof(Pass)); cudaMalloc(&p_gpu -> pass, 5 * sizeof(int)); The second cudaMalloc is illegal. This is...
CUBLAS library functions can be called from device code. Thrust algorithms can be called from device code. Various CURAND functions can be called from device code. Other libraries that are part of the CUDA toolkit at this time (i.e. CUDA 7) -- CUFFT, CUSPARSE, CUSOLVER -- can only be used...
The pointer has to be created (i.e. allocated) with cudaHostAlloc, even on integrated systems like Jetson. The reason for this is that the GPU requires (zero-copy) memory to be pinned, i.e. removed from the host demand-paging system. Ordinary allocations are subject to demand-paging, and may not be used as zero-copy...
As you've already pointed out, this is not correct: template<typename Op> I believe it should be: template<Op op> The following code seems to work correctly for me (removing the texturing since it's extraneous to the issue): $ cat t755.cu #include <stdio.h> #include <math.h> typedef float(*Op)(float, float); __device__ float mymax(float a,...
You can use a simple loop, and specify the threads you want to do the work in each iteration. Something like: for (int z = 0; z < zmax; z++) { if (threadIdx.z == z) { //do whatever with x and y } __syncthreads(); } In each iteration, threads with...
There are two things at issue here. The first is this: __constant__ Sphere s[SPHERES_COUNT]; int main () { ...... kernel<<<grids, threads>>>(dev_bitmap, s); ...... In host code, s is a host memory variable which provides a handle for the CUDA runtime to hook up with the device constant memory symbol. It...
You are using the Host version of the API, but you are passing device variables to it: cudaMalloc((void **)&dCsrValA, sizeof(double)* 16 ); ... cusolverStatus_t pakao = cusolverSpDcsrlsvluHost(handleSolver,5, totalNnz, descr, dCsrValA, dCsrRowPtrA, dCsrColIndA, dY, tol, reorder, dX, singularity); ^^^^ ^^ Referring to the cusolver documentation: We see that for the host...
Effective load throughput is not the only metric that determines the performance of your kernel! A kernel with perfectly coalesced loads will always have a lower effective load throughput than the equivalent, non coalesced kernel, but that alone says nothing about its execution time: in the end, the one metric...
Only if you compile that kernel individually, because this is an instruction level feature which is enabled by code generation. You could also use inline PTX assembler to issue ld.global.cg instructions for a particular load operation within a kernel [see here for details]. No, it is an instruction level...
None of what you are proposing is possible. In PyCUDA, you cannot Pass a dictionary to a kernel Pass a list to a kernel Directly translate a dictionary to C++ structure in device code Directly translate a list to a C++ linear array in device PyCUDA can use Python classes...
The example implicitly declares a binary floating point format, which has an arbitrary precision exponent, but only 2 bits in the mantissa. All numbers are of format 1.xx * 2^n. When one performs floating point addition, one must de-normalize or scale the arguments to have the same exponent. 0.25 =...
I'm afraid MVAPICH does not support yet using multiple GPUs in the same process (source: mailing list). Advanced memory transfer operations require storing device-specific structures, so unless there is explicit support for multiple devices, I'm afraid there is no way to make your code run. On the other side, you...
This is the answer given by njuffa in the comments: ...The content of GPU memory doesn't change between invocations of the application. In case of a program failure, we would want to avoid picking up good data from a previous run, which may lead (erroneously) to a belief that the...
opengl,cuda,textures,buffer,opengl-4
I solved this. I was in a hurry and stupidly missed a very important thing on a wiki page: https://www.opengl.org/wiki/Buffer_Texture Access in shaders In GLSL, buffer textures can only be accessed with the texelFetch function. This function takes pixel offsets into the texture rather than normalized texture coordinates. The sampler...