Menu
  • HOME
  • TAGS

Finding reason for the inaccurate results, copying code from research paper

Tag: cuda,parallel-processing,research

I am trying to replicate the linear programming solver that this person has made

http://www.idi.ntnu.no/~elster/master-studs/spampinato/spampinato-linear-prog-gpu-report.pdf.

First of, the device I am using is Quadro FX 1800M with compute capability 1.2.

My problem is that when I launch more than 22 threads per block then most of the time I get inaccurate results(sometimes all zeros), however in unique cases I get accurate results when I launch even 512 threads per block.

Here are some test runs that I made. (Sequential Implies a CPU based Version) used for comparison

Iteration No 1 : of Sequential Version
Optimum Found 24.915583
Elapsed time: 0.001049725

Iteration No 1: of Parallel Version
BS-(Number of Threads) = : 20
Optimum found: 24.915583

Iteration No 2: of Parallel Version
BS-(Number of Threads) = : 256
Optimum found: 24.915607

Iteration No 3: of Parallel Version
BS-(Number of Threads) = : 512
Optimum found: 24.917068

Iteration No 4: of Parallel Version
BS-(Number of Threads) = : 2
Optimum found: 24.915583

Iteration No 5: of Parallel Version
BS-(Number of Threads) = : 456 
Optimum found: -30693000299230806209574138333792043008.000000

Iteration No 6: of Parallel Version
BS-(Number of Threads) = : 456
Problem unsolvable: either qth==0 or loop too long.

Iteration No 7: of Parallel Version
BS-(Number of Threads) = : 512
Optimum found: 25.010513

Iteration No 8: of Parallel Version
BS-(Number of Threads) = : 256
Problem unsolvable: either qth==0 or loop too long.

Iteration No 9: of Parallel Version
BS-(Number of Threads) = : 256
Optimum found: 0.000000

Iteration No 10: of Parallel Version
BS-(Number of Threads) = : 512
Optimum found: 0.000000

Can somebody kindly point what I might be doing wrong, I know that I haven't posted the code but I am assuming that the code is correct as I am copying it from the research paper and the problem is on my end.

I should also point out that I am getting the following error when compiling the cuda code

ptxas /tmp/tmpxft_000017e7_00000000-10_culiblp.ptx, line 263; warning : Double is not supported. Demoting to float

Might this be a reason for the results ?

Thank you for your attention.

Best How To :

My problem is that when I launch more than 22 threads per block then most of the time I get inaccurate results(sometimes all zeros),

Can somebody kindly point what I might be doing wrong,

I wasn't able to build the code because the header files seem to be missing from the paper. I could try and construct those, but one thing I noticed is that the variable (or constant) BS doesn't seem to be defined anywhere. So I'm guessing it was originally defined in culiblp.h (which is not provided.)

Looking at culiblp.cu in the paper, I notice some kernel launches like this:

init_AInD<<<dim3(kn, km1), dim3(BS, BS)>>>(devA, devD, m, n);
                           ^^^^^^^^^^^^

This is creating a 2D threadblock of dimensions BS*BS. So if you set BS to a value greater than 22, the product will exceed 512 threads, which is the maximum for your cc1.x GPU. In that case, setting BS to a value higher than 22 will cause that kernel launch to fail.

I believe this is certainly a contributing factor to code failure when BS is larger than 22.

You could prove this out by running your code with cuda-memcheck. Also, if you plan to work with this code, I'd suggest adding proper cuda error checking.

The apparent, occasional successes with values higher than 22 for BS could possibly be explained if you had done a successful run (let's say with BS at 22 or less) immediately prior. It's possible that even with a failed kernel, if the previous successful intermediate data is left in memory from the previous run, that things will seem to produce the correct results.

Access violation reading location when calling cudaMemcpy2DToArray

c++,arrays,opencv,cuda

ImgSrc_f does not point to a contiguous 512x512 chunk of memory. Try changing float *ImgSrc_f[512]; for (int i=0; i<512; i++) ImgSrc_f[i] = (float *)malloc(512 * sizeof(float)); for(int i=0;i<512;i++) for(int j=0;j<512;j++) { ImgSrc_f[i][j]=ImgSrc.at<float>(i,j); } to something like float *ImgSrc_f; ImgSrc_f = (float *)malloc(512 * 512 * sizeof(float)); for(int i=0;i<512;i++) for(int j=0;j<512;j++)...

cuMemcpyDtoH yields CUDA_ERROR_INVALID_VALUE

java,scala,ubuntu,cuda,jcuda

Here val deviceOutput = new CUdeviceptr() cuMemAlloc(deviceOutput, SI) you are allocating SI bytes - which is 4 bytes, as the size of one int. Writing more than 4 bytes to this device pointer will mess up things. It should be cuMemAlloc(deviceOutput, SI * numElements) And similarly, I think that the...

Async await usage for MongoDB repository

c#,mongodb,asynchronous,parallel-processing,async-await

When you call an async method you should await the returned task, which you can only do in an async method, etc. Awaiting the task makes sure you continue execution only after the operation completed, otherwise the operation and the code after it would run concurrently. So your code should...

running a thread in parallel

c#,.net,multithreading,parallel-processing

I suggest looking into Task Parallel Library. Starting with the .NET Framework 4, the TPL is the preferred way to write multithreaded and parallel code. Since you also need the result back from GetTXPower, I would use a Task<double> for it. Task<double> task = Task.Factory.StartNew<double>(GetTXPower); Depending on when you need...

Tesla k20m interoperability with Direct3D 11

cuda,direct3d,tesla

No, this won't be possible. K20m can be used (with some effort) with OpenGL graphics on Linux, but at least up through windows 8.x, you won't be able to use K20m as a D3D device in Windows. The K20m does not publish a VGA classcode in PCI configuration space, which...

OpenMP Matrix-Vector Multiplication Executes on Only One Thread

c++,multithreading,parallel-processing,openmp,mex

Did you include this snippet in #pragma omp parallel{...} or you might be missing the word parallel?

Understanding Memory Replays and In-Flight Requests

caching,cuda

Effective load throughput is not the only metric that determines the performance of your kernel! A kernel with perfectly coalesced loads will always have a lower effective load throughput than the equivalent, non coalesced kernel, but that alone says nothing about its execution time: in the end, the one metric...

Why does Hyper-Q selectively overlap async HtoD and DtoH transfer on my cc5.2 hardware?

cuda

What you are observing is probably an artifact of running the code on a Windows WDDM platform. The WDDM subsystem has a lot of latency which other platforms are not hampered by, so to improve overall performance, the CUDA WDDM driver performs command batching. This can interfere with the expect...

Understanding Dynamic Parallelism in CUDA

multithreading,cuda

Say I launch a child grid from one thread in a block at threadIdx.x==0. Can I assume that all other threads in the parent grid have finished executing up to the point I launched the child grid as well? No. You can make no assumptions about the state of...

Algorithm for [inclusive/exclusive]_scan in parallel proposal N3554

c++,algorithm,parallel-processing,c++14

Parallel prefix sum is a classical distributed programming algorithm, which elegantly uses a reduction followed by a distribution (as illustrated in the article). The key observation is that you can compute parts of the partial sums before you know the leading terms.

Faster Matrix Multiplication in CUDA

c,cuda,matrix-multiplication

Firstly, be really sure this is what you want to do. Without describing the manipulations you want to do, it's hard to comment on this, but be aware that matrix multiplication is an n-cubed operation. If your manipulations are not the same complexity, chances are you'll do better simply using...

using “foreach” for running different classifiers in R

r,foreach,parallel-processing

The main problem is that you're not enclosing the body of the foreach loop in curly braces. Because %dopar% is a binary operator, you have to be careful about precedence, which is why I recommend always using curly braces. Also, you shouldn't use c as the combine function. Since svm...

Reactive pipeline - how to control parallelism?

c#,.net,parallel-processing,system.reactive

Merge provides an overload which takes a max concurrency. Its signature looks like: IObservable<T> Merge<T>(this IObservable<IObservable<T>> source, int maxConcurrency); Here is what it would look like with your example (I refactored some of the other code as well, which you can take or leave): return Observable //Reactive while loop also...

NVCC CUDA cross compiling cannot find “-lcudart”

linux,cuda,ld,nvcc

It turns out that the CUDA installer I was using from NVIDIA will not allow me to cross compile for my CARMA board, but it has to be downloaded from the manufacturer SECO.

Proper way to handle a group of asynchronous calls in parallel

c#,parallel-processing,async-await,task

Change this: foreach(var service in services) { Console.WriteLine("Running " + service.Name); var _serviceResponse = await client.PostAsync(_baseURL + service.Id.ToString(), null); Console.WriteLine(service.Name + " responded with " + _serviceRepsonse.StatusCode); } to this: var serviceCallTaskList = new List<Task<HttpResponseMessage>>(); foreach(var service in services) { Console.WriteLine("Running " + service.Name); serviceCallTaskList.Add(client.PostAsync(_baseURL + service.Id.ToString(), null)); } HttpResponseMessage[]...

cuda-memcheck fails to detect memory leak in an R package

r,memory-leaks,cuda,valgrind

This is not valid CUDA code: extern "C" void someCUDAcode() { int a; CUDA_CALL(cudaMalloc((void**) &a, sizeof(int))); mykernel<<<1, 1>>>(1); // CUDA_CALL(cudaFree(&a)); } When we want to do a cudaMalloc operation, we use pointers in C, not ordinary variables, like this: int *a; CUDA_CALL(cudaMalloc((void**) &a, sizeof(int))); When we want to free a...

Parallel.ForEach loop is performing like a serial loop

c#,tfs,parallel-processing,invalidoperationexception,parallel.foreach

Please help me figure out what I'm doing wrong regarding the dictionaries. The exception is thrown because List<T> is not thread-safe. You have a shared resource which needs to be modified, using Parallel.ForEach won't really help, as you're moving the bottleneck to the lock, causing the contention there, which...

How to perform parallel processes for different groups in a folder?

bash,unix,parallel-processing,groups

So my whole ordeal was with trying to use my code on a directory with a lot of files. In order to get rid of the errer stating that there are too many Arguments, I used this code that I gathered from previous Ole Tange posts: ls ./ | grep...

Using a data pointer with CUDA (and integrated memory)

c++,memory-management,cuda

The pointer has to be created (i.e. allocated) with cudaHostAlloc, even on integrated systems like Jetson. The reason for this is that the GPU requires (zero-copy) memory to be pinned, i.e. removed from the host demand-paging system. Ordinary allocations are subject to demand-paging, and may not be used as zero-copy...

How to perform synchronous parallel functions in Clojure?

clojure,parallel-processing

Since you have exactly two branches here, it'll be best to dispatch parallel jobs to separate threads using future function. future will return you a future object (a special promise which will be automatically resolved when the job will be completed). Here is how it will look: (defn some-entry-point [obja...

how to make the program pause when actor is running

multithreading,scala,parallel-processing,actor

If you need synchronous calls in akka, use ask pattern. Like Await.result(ping ? "ping") Also, you'd better use actor system to create actors. import akka.actor.{ActorRef, Props, Actor, ActorSystem} import akka.pattern.ask import akka.util.Timeout import scala.concurrent.Await import scala.concurrent.duration._ import scala.concurrent.ExecutionContext.Implicits.global object Test extends App { implicit val timeout = Timeout(3 second) val...

Basic Java threading (4 threads) slower than non-threading

java,multithreading,concurrency,parallel-processing

The huge difference in execution time is caused by Math.random() method. If you will dig into its implementation you will see that it uses static randomNumberGenerator that is shared across all threads. If you go one step deeper then you will notice that execution is relying on int next(int) method,...

Java 8, using .parallel in a stream causes OOM error

java,parallel-processing,java-8,java-stream

Here you create an infinite stream and limit it afterwards. There are known problems about processing infinite streams in parallel. In particular there's no way to split the task to equal parts effectively. Internally some heuristics are used which are not well suitable for every task. In your case it's...

Threads syncronization in CUDA

c++,multithreading,cuda

You can use a simple loop, and specify the threads you want to do the work in each iteration. Something like: for (int z = 0; z < zmax; z++) { if (threadIdx.z == z) { //do whatever with x and y } __syncthreads(); } In each iteration, threads with...

Java 8 parallelStream for concurrent Database / REST call

java,multithreading,concurrency,parallel-processing,java-8

You can do the operation with map instead of forEach - that will guarantee thread safety (and is cleaner from a functional programming perspective): List<String> allResult = partitions.parallelStream() .map(this::callRestAPI) .flatMap(List::stream) //flattens the lists .collect(toList()); And your callRestAPI method: private void callRestAPI(List<String> serverList) { List<String> result = //Do a REST call....

Reduce by key on device array

cuda,parallel-processing,thrust

Thrust interprets ordinary pointers as pointing to data on the host: thrust::reduce_by_key(d_list, d_list+n, d_ones, C, D,cmp); Therefore thrust will call the host path for the above algorithm, and it will seg fault when it attempts to dereference those pointers in host code. This is covered in the thrust getting started...

OpenMP shared variable seems to be private

c,parallel-processing,openmp

You are practically resetting n to zero by each thread. Only the thread with tid==0 will increment n prior to printing. Even here, you may encounter the program to print I'am 0, this is my n: 0 instead of the expected I'am 0, this is my n: 1 since you...

Julia parallel programming - Making existing function available to all workers

for-loop,parallel-processing,julia-lang

The approach to return the function seems elegant but unfortunately, unlike JavaScript, Julia does not resolve all the variables when creating the functions. Technically, your training function could produce the source code of the function with literal values for all the trained parameters. Then pass it to each of the...

direct global memory access using cuda

c++,cuda

q1- lets say i have copy one array onto device through stream1 using cudaMemCpyAsync; would i be able to access the values of that array in different stream say 2? Yes, the array da is accessible in both kernels you have shown. However, an important question is whether or...

How does CUDA's cudaMemcpyFromSymbol work?

cuda

I believe the details are that for each __device__ variable, cudafe creates a normal global variable as in C and also a CUDA-specific PTX variable. The global C variable is used so that the host program can refer to the variable by its address, and the PTX variable is used...

Multiprocessing a python script

python,parallel-processing,multiprocessing,python-multiprocessing

After discussion, the straight answer is: No. Simply because multi-processing is not some magical trick that automatically offloads the burden on one processor to another. The developer needs to know how a program should split up a task, and specify that each task should take up a new process. So...

What is version of cuda for nvidia 304.125

ubuntu,cuda,ubuntu-14.04,nvidia

304.xx is a driver that will support CUDA 5 and previous (does not support newer CUDA versions.) If you want to reinstall ubuntu to create a clean setup, the linux getting started guide has all the instructions needed to set up CUDA 7 if that is your intent. I believe...

Update a D3D9 texture from CUDA

c#,cuda,sharpdx,direct3d9,managed-cuda

As hinted by the commenter, I’ve tried creating a single instance of CudaDirectXInteropResource along with the D3D texture. It worked. It’s counter-intuitive and undocumented, but it looks like cuGraphicsUnregisterResource destroys the newly written data. At least on my machine with GeForce GTX 960, Cuda 7.0 and Windows 8.1 x64. So,...

How to run DEoptim in parallel?

r,parallel-processing

As you mention in your question, you need to use parVar and packages. The packages vector should list any packages that you use, e.g. you use a random number generator that is found in another package. The parVar vector should contain in functions or variables that are called by your...

How do you build the example CUDA Thrust device sort?

c++,visual-studio-2010,sorting,cuda,thrust

As @JaredHoberock pointed out, probably the key issue is that you are trying to compile a .cpp file. You need to rename that file to .cu and also make sure it is being compiled by nvcc. After you fix that, you will probably run into another issue. This is not...

clEnqueueNDRangeKernel fills up entire memory

c++,memory,parallel-processing,opencl

You are doing to mallocs that are never freed on each iteration of the loop. This is why you are running out of memory. Also, your loop is using an unsigned int variable, which could be a problem depending on the value of maxGloablThreads....

Can an unsigned long long int be used to store the output from clock64()?

cuda

There are various atomic functions which support atomic operations on unsigned long long int (ie. a 64-bit unsigned integer), such as atomicCAS, atomicExch and atomicAdd. And if you have a cc3.5 or higher GPU you have even more options. Referring to the documentation on clock64(): long long int clock64(); when...

ElasticSearch Multiple Scrolls Java API

java,scroll,elasticsearch,parallel-processing

After searching some more, I got the impression that this (same scrollId) is by design. After the timeout has expired (which is reset after each call Elasticsearch scan and scroll - add to new index). So you can only get one opened scroll per index. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html states: Scrolling is not...

how to generalize square matrix multiplication to handle arbitrary dimensions

c,cuda,parallel-processing,matrix-multiplication

This code will work for very specific dimensions but not for others. It will work for square matrix multiplication when width is exactly equal to the product of your block dimension (number of threads - 20 in the code you have shown) and your grid dimension (number of blocks -...

How to load data in global memory into shared memory SAFELY in CUDA?

c++,cuda,shared-memory

Consider one warp of the thread block finishing the first iteration and starting the next one, while other warps are still working on the first iteration. If you don't have __syncthreads at label sync2, you will end up with this warp writing to shared memory while others are reading from...

cudaMalloc vs cudaMalloc3D performance for a 2D array

c,cuda

The performance difference you observe is mostly due to the increased instruction overhead in the pitched memory indexing scheme. Because your array size is a large power of two in the major direction, it is very likely that the pitched array allocated with cudaMalloc3D is the same size as the...

Is prefix scan CUDA sample code in gpugems3 correct?

cuda,gpu,nvidia,prefix-sum

It seems that you've made at least 1 error in transcribing the code from the GPU Gems 3 chapter into your kernel. This line is incorrect: temp[bi] += g_idata[ai]; it should be: temp[bi] += temp[ai]; When I make that one change to the code you have now posted, it seems...

CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments

cuda,gpgpu,gpu-programming,cublas

The only actual problem in your code is here: cudaMalloc( &d_x,sizeof(d_x) ); sizeof(d_x) is just the size of a pointer. You can fix it like this: cudaMalloc( &d_x,sizeof(x) ); If you want to find out if a CUBLAS API call is failing, then you should check the return code of...

How to measure if a program was run in parallel over multiple cores in Linux?

linux,parallel-processing,benchmarking,perf

You can try using the command top in another terminal while the program is running. It will show the usage of all the cores on your machine.

How many parallel threads i can run on my nvidia graphic card in cuda programming?

cuda

That depends on the CUDA Version used I think. Compute capability(version) V1.0 V1.2 V2.x V3.0-X.X Maximum number of resident threads per multiprocessor 768 1024 1536 2048 Amount of local memory per thread 16 KB 512 KB Maximum number of threads per block 512 1024 If found this peace of Information...

Running java application parallely on multicore cluster nodes

java,multithreading,concurrency,parallel-processing

Are there any frameworks available in Java like EXECUTOR Framework which can do this task? I suggest you to take a look at the Akka framework for writing powerful concurrent & distributed applications. Akka uses the Actor Model together with Software Transactional Memory to raise the abstraction level and provide...

How can I pass a struct to a kernel in JCuda

java,struct,cuda,jni,jcuda

(The author of JCuda here (not "JCUDA", please)) As mentioned in the forum post linked from the comment: It is not impossible to use structs in CUDA kernels and fill them from JCuda side. It is just very complicated, and rarely beneficial. For the reason of why it is rarely...

'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D

c,cuda

The reason the error doesn't occur on this line: REAL tmp = unew_row[j]; // no error on this line is because the compiler is optimizing that line out. It doesn't do anything useful, and so the compiler completely eliminates it. The compiler warning: xxx.cu(87): warning: variable "tmp" was declared but...