Menu
  • HOME
  • TAGS

'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D

Tag: c,cuda

I am trying to allocate and copy memory of a flattened 2D array on to the device using cudaMalloc3D to test the performance of cudaMalloc3D. But when I try to write to the array from the kernel it throws 'an illegal memory access was encountered' exception. The program runs fine if I am just reading from the array but when I try to write to it, there is an error. Any help on this will be greatly appreciated. Below is my code and the syntax for compiling the code.

Compile using

nvcc -O2 -arch sm_20 test.cu 

Code: test.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define PI 3.14159265 
#define NX 8192     /* includes boundary points on both end */
#define NY 4096     /* includes boundary points on both end */
#define NZ 1        /* needed for cudaMalloc3D */

#define N_THREADS_X 16
#define N_THREADS_Y 16
#define N_BLOCKS_X NX/N_THREADS_X 
#define N_BLOCKS_Y NY/N_THREADS_Y 

#define LX 4.0    /* length of the domain in x-direction  */
#define LY 2.0    /* length of the domain in x-direction  */
#define dx       (REAL) ( LX/( (REAL) (NX) ) )
#define cSqrd     5.0
#define dt       (REAL) ( 0.4 * dx / sqrt(cSqrd) )
#define FACTOR   ( cSqrd * (dt*dt)/(dx*dx) )

#define IC  (i + j*NX)       /* (i,j)   */
#define IM1 (i + j*NX - 1)   /* (i-1,j) */
#define IP1 (i + j*NX + 1)   /* (i+1,j) */
#define JM1 (i + (j-1)*NX)   /* (i,j-1) */
#define JP1 (i + (j+1)*NX)   /* (i,j+1) */


// Macro for checking CUDA errors following a CUDA launch or API call
#define cudaCheckError() {\
  cudaError_t e = cudaGetLastError();\
  if( e != cudaSuccess ) {\
    printf("\nCuda failure %s:%d: '%s'\n",__FILE__,__LINE__,cudaGetErrorString(e));\
    exit(EXIT_FAILURE);\
  }\
}

typedef double REAL;
typedef int    INT;


void meshGrid ( REAL *x, REAL *y )
{

  INT i,j;
  REAL a;
  for (j=0; j<NY; j++) {
    a = dx * ( (REAL) j );
    for (i=0; i<NX; i++) {
      x[IC] =  dx * ( (REAL) i );
      y[IC] = a;
    }
  }
}


void initWave ( REAL *u, REAL *uold, REAL *x, REAL *y )
{                    
  INT i,j;
  for (j=1; j<NY-1; j++) {
    for (i=1; i<NX-1; i++) {
      u[IC] =  0.1 * (4.0*x[IC]-x[IC]*x[IC]) * ( 2.0*y[IC] - y[IC]*y[IC] );
    }
  }

  for (j=1; j<NY-1; j++) {
    for (i=1; i<NX-1; i++) {
      uold[IC] = u[IC] + 0.5*FACTOR*( u[IP1] + u[IM1] + u[JP1] + u[JM1] - 4.0*u[IC] );
    }
  }
}


__global__ void solveWaveGPU ( cudaPitchedPtr uold, cudaPitchedPtr u, cudaPitchedPtr unew )
{

 INT i,j;

 i = blockIdx.x*blockDim.x + threadIdx.x;
 j = blockIdx.y*blockDim.y + threadIdx.y;

 if (i>0 && i < (NX-1) && j>0 && j < (NY-1) ) {

  char *unewPtr  = (char *) unew.ptr;
  REAL *unew_row = (REAL *) (unewPtr + i * unew.pitch);

  REAL tmp = unew_row[j]; // no error on this line
  unew_row[j] = 1.2; // this is where I get the error
 }

}


INT main(INT argc, char *argv[])
{

  INT nTimeSteps = 10;  

  // pointers for the host side
  REAL *unew, *u, *uold, *uFinal, *x, *y;

  // allocate memory on the host
  unew        = (REAL *)calloc(NX*NY,sizeof(REAL));
  u           = (REAL *)calloc(NX*NY,sizeof(REAL));
  uold        = (REAL *)calloc(NX*NY,sizeof(REAL));
  uFinal      = (REAL *)calloc(NX*NY,sizeof(REAL));
  x           = (REAL *)calloc(NX*NY,sizeof(REAL));
  y           = (REAL *)calloc(NX*NY,sizeof(REAL));


  // pointer for the device side
  size_t pitch = NX * sizeof(REAL);
  cudaPitchedPtr  d_u, d_uold, d_unew, d_tmp;
  cudaExtent myExtent = make_cudaExtent(pitch, NY, NZ);

  // allocate 3D memory on the device
  cudaMalloc3D( &d_u, myExtent );    cudaCheckError();
  cudaMalloc3D( &d_uold, myExtent ); cudaCheckError();
  cudaMalloc3D( &d_unew, myExtent ); cudaCheckError();


  // initialize grid and wave
  meshGrid( x, y );
  initWave( u, uold, x, y );


  // copy host memory to 3D device memory
  cudaMemcpy3DParms cpy3D = { 0 };
  cpy3D.kind = cudaMemcpyHostToDevice;

  // copying u to d_u
  cpy3D.srcPtr = make_cudaPitchedPtr(u, pitch, NX, NY);
  cpy3D.dstPtr = d_u;
  cpy3D.extent = myExtent;
  cudaMemcpy3D( &cpy3D ); cudaCheckError();  

  // copying uold to d_uold
  cpy3D.srcPtr = make_cudaPitchedPtr(uold, pitch, NX, NY);
  cpy3D.dstPtr = d_uold;
  cpy3D.extent = myExtent;
  cudaMemcpy3D( &cpy3D ); cudaCheckError();  


  //  set up the GPU grid/block model
  dim3 dimGrid  ( N_BLOCKS_X , N_BLOCKS_Y  );
  dim3 dimBlock ( N_THREADS_X, N_THREADS_Y );

  for ( INT n = 1; n < nTimeSteps + 1; n++ ) {
    solveWaveGPU <<< dimGrid, dimBlock >>> ( d_uold, d_u, d_unew );
    cudaThreadSynchronize();
    cudaCheckError();

    d_tmp  = d_uold;
    d_uold = d_u;
    d_u    = d_unew;
    d_unew = d_tmp;
  }

  // copy the memory back to host
  cpy3D.kind = cudaMemcpyDeviceToHost;

  // copying d_unew to uFinal
  cpy3D.srcPtr = d_unew;
  cpy3D.dstPtr = make_cudaPitchedPtr(uFinal, pitch, NX, NY);
  cpy3D.extent = myExtent;
  cudaMemcpy3D( &cpy3D ); cudaCheckError();  

  free(u);    cudaFree(d_u.ptr);
  free(unew); cudaFree(d_unew.ptr);
  free(uold); cudaFree(d_uold.ptr);

  free(uFinal); free(x); free(y);

  return EXIT_SUCCESS;
}

Best How To :

The reason the error doesn't occur on this line:

REAL tmp = unew_row[j]; // no error on this line

is because the compiler is optimizing that line out. It doesn't do anything useful, and so the compiler completely eliminates it. The compiler warning:

xxx.cu(87): warning: variable "tmp" was declared but never referenced

is a hint to that effect.

Your code is very nearly correct. The issue is here:

REAL *unew_row = (REAL *) (unewPtr + i * unew.pitch);

It should be:

REAL *unew_row = (REAL *) (unewPtr + j * unew.pitch);

The i variable in your kernel is the width (i.e. X) dimension. The j variable is the height (i.e. Y) dimension.

The height is the one that refers to which row you are on, therefore the row pitch should be multiplied by the height parameter, i.e. j, not i.

Similarly, although it's not the source of the specific failure for your particular dimensions, this code may be not what you intended either:

REAL tmp = unew_row[j]; // no error on this line
unew_row[j] = 1.2; // this is where I get the error

If, for example, you were intending to compute the offset to the row and then index into the row (perhaps to set every element in the alocation, for example) then I think you would want to use i not j as your final index:

REAL tmp = unew_row[i]; // no error on this line
unew_row[i] = 1.2; // this is where I get the error

However, for this particular example, this is not the actual source of the illegal memory access.

How convert unsigned int to unsigned char array

c++,c

#include <stdio.h> int main() { unsigned int i = 0x557e89f3; unsigned char c[4]; c[0] = i & 0xFF; c[1] = (i>>8) & 0xFF; c[2] = (i>>16) & 0xFF; c[3] = (i>>24) & 0xFF; printf("c[0] = %x \n", c[0]); printf("c[1] = %x \n", c[1]); printf("c[2] = %x \n", c[2]); printf("c[3] =...

C language, vector of struct, miss something?

c,vector,struct

What is happening is that tPeca pecaJogo[tam]; is a local variable, and as such the whole array is allocated in the stack frame of the function, which means that it will be deallocated along with the stack frame where the function it self is loaded. The reason it's working is...

Recursive signal call using kill function

c,signals

Historically, lots of details about how signals work have changed. For instance, in the earliest variant, the processing of the signal reverted to default when the handler was called, and the handler had to re-establish itself. In this situation, sending the signal from the handler would kill the process. Currently,...

Unexpected result when calculating a percentage - even when factoring in integer division rules

c,percentage,integer-overflow,integer-division

Diagnosis The value you expect is, presumably, 91. The problem appears to be that your compiler is using 16-bit int values. You should identify the platform on which you're working and include information about unusual situations such as 16-bit int types. It is reasonable for us to assume 32-bit or...

Does realloc() invalidate all pointers?

c,pointers,dynamic-memory-allocation,behavior,realloc

Yes, ptr2 is unaffected by realloc(), it has no connection to realloc() call whatsoever(as per the current code). However, FWIW, as per the man page of realloc(), (emphasis mine) The realloc() function returns a pointer to the newly allocated memory, which is suitably aligned for any kind of variable and...

How does this code print odd and even?

c,if-statement,macros,logic

In binary any numbers LSB (Least Significant Bit) is set or 1 means the number is odd, and LSB 0 means the number is even. Lets take a look: Decimal binary 1 001 (odd) 2 010 (even) 3 011 (odd) 4 100 (even) 5 101 (odd) SO, the following line...

C programming - Confusion regarding curly braces

c,scope

The only difference between the two is the scope of the else. Without the braces, it spans until the end of the full statement, which is the next ;, i.e the next line: else putchar(ch); /* end of else */ lastch = ch; /* outside of if-else */ With the...

Text justification C language

c,text,alignment

From printf's manual: The field width An optional decimal digit string (with nonzero first digit) specifying a minimum field width. If the converted value has fewer characters than the field width, it will be padded with spaces on the left (or right, if the left-adjustment flag has been given). Instead...

Array breaking in Pebble C

c,arrays,pebble-watch,cloudpebble

The problem is this line static char *die_label = "D"; That points die_label to a region of memory that a) should not be written to, and b) only has space for two characters, the D and the \0 terminator. So the strcat is writing into memory that it shouldn't be....

getchar() not working in c

c,while-loop,char,scanf,getchar

That's because scanf() left the trailing newline in input. I suggest replacing this: ch = getchar(); With: scanf(" %c", &ch); Note the leading space in the format string. It is needed to force scanf() to ignore every whitespace character until a non-whitespace is read. This is generally more robust than...

Is post-increment operator guaranteed to run instantly?

c,c89,post-increment,ansi-c

This code is broken for two reasons: Accessing a variable twice between sequence points, for other purposes than to determine which value to store, is undefined behavior. There are no sequence points between the evaluation of function parameters. Meaning anything could happen, your program might crash & burn (or more...

Galois LFSR - how to specify the output bit number

c,prng,shift-register

If you need bit k (k = 0 ..15), you can do the following: return (lfsr >> k) & 1; This shifts the register kbit positions to the right and masks the least significant bit....

What does `strcpy(x+1, SEQX)` do?

c,strcpy

The pointer + offset notation is used as a convenient means to reference memory locations. In your case, the pointer is provided by malloc() after allocating sufficient heap memory, and represents an array of M + 2 elements of type char, thus the notation as used in your code represents...

Does strlen() always correctly report the number of char's in a pointer initialized string?

c,strlen

What strlen does is basically count all bytes until it hits a zero-byte, the so-called null-terminator, character '\0'. So as long as the string contains a terminator within the bounds of the memory allocated for the string, strlen will correctly return the number of char in the string. Note that...

Counting bytes received by posix read()

c,function,serial-port,posix

Yes, temp_uart_count will contain the actual number of bytes read, and obviously that number will be smaller or equal to the number of elements of temp_uart_data. If you get 0, it means that the end of file (or an equivalent condition) has been reached and there is nothing else to...

scanf get multiple values at once

c,char,segmentation-fault,user-input,scanf

I'm not saying that it cannot be done using scanf(), but IMHO, that's not the best way to do it. Instead, use fgets() to read the whole like, use strtok() to tokenize the input and then, based on the first token value, iterate over the input string as required. A...

How to read string until two consecutive spaces?

c,format,sscanf,c-strings

The scanf family of functions are good for simple parsing, but not for more complicated things like you seem to do. You could probably solve it by using e.g. strstr to find the comment starter "//", terminate the string there, and then remove trailing space....

VS2012 Identifer not found when part of static lib

c,visual-studio-2012,linker,static-libraries

C++ uses something called name mangling when it creates symbol names. It's needed because the symbol names must contain the complete function signature. When you use extern "C" the names will not be mangled, and can be used from other programming languages, like C. You clearly make the shunt library...

Tesla k20m interoperability with Direct3D 11

cuda,direct3d,tesla

No, this won't be possible. K20m can be used (with some effort) with OpenGL graphics on Linux, but at least up through windows 8.x, you won't be able to use K20m as a D3D device in Windows. The K20m does not publish a VGA classcode in PCI configuration space, which...

CGO converting Xlib XEvent struct to byte array?

c,go,xlib,cgo

As mentioned in the cgo documentation: As Go doesn't have support for C's union type in the general case, C's union types are represented as a Go byte array with the same length. Another SO question: Golang CGo: converting union field to Go type or a go-nuts mailing list post...

What all local variables goto Data/BSS segment?

c++,c,nm

"local" in this context means file scope. That is: static int local_data = 1; /* initialised local data */ static int local_bss; /* uninitialised local bss */ int global_data = 1; /* initialised global data */ int global_bss; /* uninitialised global bss */ void main (void) { // Some code...

Multiple definition and file management

c,arrays,compilation,compiler-errors,include

include is a preprocessor directive that includes the contents of the file named at compile time. The code that conditionally includes stuff is executed at run time...not compile time. So both files are being compiled in. ( You're also including each file twice, once in the main function and once...

Passing int using char pointer in C

c,exec,ipc

Programs simply do not take integers as arguments, they take strings. Those strings can be decimal representations of integers, but they are still strings. So you are asking how to do something that simply doesn't make any sense. Twenty is an integer. It's the number of things you have if...

execl() works on one of my code, but doesn't work on another

c,execl

My C is a bit rusty but your code made many rookie mistakes. execl will replace the current process if it succeeds. So the last line ("i have no idea why") won't print if the child can launch successfully. Which means... execl failed and you didn't check for it! Hint:...

Is there Predefined-Macros define about byte order in armcc

c,armcc,predefined-macro

Well according to this page: http://www.keil.com/support/man/docs/armccref/armccref_BABJFEFG.htm You have __BIG_ENDIAN which is defined when compiling for a big endian target....

CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments

cuda,gpgpu,gpu-programming,cublas

The only actual problem in your code is here: cudaMalloc( &d_x,sizeof(d_x) ); sizeof(d_x) is just the size of a pointer. You can fix it like this: cudaMalloc( &d_x,sizeof(x) ); If you want to find out if a CUBLAS API call is failing, then you should check the return code of...

Loop through database table and compare user input

mysql,c

If you are only looking for fields that match the input, you'll want to search the database using the input string. In other words, write your query string so that it only gives you results that match the user input. This will be much faster than searching through every returned...

free causing different results from malloc

c,string,malloc,free

Every time you are creating your string, you are not appending a null terminator, which causes the error. So change this: for(j=0; j<rem_len; j++) { if(j != i) { remaining_for_next[index_4_next] = remaining[j]; index_4_next++; } } to this: for(j=0; j<rem_len; j++) { if(j != i) { remaining_for_next[index_4_next] = remaining[j]; index_4_next++; }...

Segmentation Fault if I don't say int i=0

c,arrays,segmentation-fault,initialization,int

In your code, int i is an automatic local variable. If not initialized explicitly, the value held by that variable in indeterministic. So, without explicit initialization, using (reading the value of ) i in any form, like array[i] invokes undefined behaviour, the side-effect being a segmentation fault. Isn't it automatically...

fread(), solaris to unix portability and use of uninitialised values

c,linux,memory,stack,portability

Q 1. why is ch empty even after fread() assignment? (Most probably) because fread() failed. See the detailed answer below. Q 2.Is this a portability issue between Solaris and Linux? No, there is a possible issue with your code itself, which is correctly reported by valgrind. I cannot quite...

How can I pass a struct to a kernel in JCuda

java,struct,cuda,jni,jcuda

(The author of JCuda here (not "JCUDA", please)) As mentioned in the forum post linked from the comment: It is not impossible to use structs in CUDA kernels and fill them from JCuda side. It is just very complicated, and rarely beneficial. For the reason of why it is rarely...

Segmentation fault with generating an RSA and saving in ASN.1/DER?

c,openssl,cryptography,rsa

pub_l = malloc(sizeof(pub_l)); is simply not needed. Nor is priv_l = malloc(sizeof(priv_l));. Remove them both from your function. You should be populating your out-parameters; instead you're throwing out the caller's provided addresses to populate and (a) populating your own, then (b) leaking the memory you just allocated. The result is...

Set precision dynamically using sprintf

c,printf,format-string

Yes, you can do that. You need to use an asterisk * as the field width and .* as the precision. Then, you need to supply the arguments carrying the values. Something like sprintf(myNumber,"%*.*lf",A,B,a); Note: A and B need to be type int. From the C11 standard, chapter ยง7.21.6.1, fprintf()...

How does ((a++,b)) work? [duplicate]

c,function,recursion,comma

In your first code, Case 1: return reverse(i++); will cause stack overflow as the value of unchanged i will be used as the function argument (as the effect of post increment will be sequenced after the function call), and then i will be increased. So, it is basically calling the...

Infinite loop with fread

c,arrays,loops,malloc,fread

If you're "trying to allocate an array 64 bytes in size", you may consider uint8_t Buffer[64]; instead of uint8_t *Buffer[64]; (the latter is an array of 64 pointers to byte) After doing this, you will have no need in malloc as your structure with a 64 bytes array inside is...

Is there any way of protecting a variable for being modified at runtime in C?

c,variables,constants

You can make the result of the input be const like this: int func() { int op = 0; scanf( "%d", &op ); if( op == 0 ) return 1; else return 2; } int main() { const int v = func(); // ... } NB. Of course, there is...

CallXXXMethod undefined using JNI in C

java,c,jni

There are few fixes required in the code: CallIntMethod should be (*env)->CallIntMethod class Test should be public Invocation should be jint age = (*env)->CallIntMethod(env, mod_obj, mid, NULL); Note that you need class name to call a static function but an object to call a method. (cls2 -> person) mid =...

Is it safe to read and write on an array of 32 bit data byte by byte?

c,memory,memory-alignment

Yes, this is correct. The only danger would be generating a bit pattern that does not correspond to any int, but on modern systems there are no such patterns. Also, if the data type was uint32_t specifically, those are prohibited from having any such patterns anyway. Note that the inverse...

How can I align stack to the end of SRAM?

c,embedded,stm32,gnu-arm,coocox

I've found the reason: that's because stack size is actually fixed and it is located in heap (if I could call it heap). In file startup_stm32f10x*.c there is a section: /*----------Stack Configuration----------*/ #define STACK_SIZE 0x00000100 /*!< The Stack size suggest using even number */ And at then very next line:...

On entry to NIT parameter number 9 had an illegal value

c,mpi,intel-mkl,mpich,scalapack

This answer is courtesy of Ying from Intel, all the credits go to him! The int in C are supposed to be 32bit, you may try lp64 mode. mpicc -o test_lp64 ex1.c -I/opt/intel/mkl/include /opt/intel/mkl/lib/intel64/libmkl_scalapack_lp64.a -L/opt/intel/mkl/lib/intel64 -Wl,--start-group /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.a /opt/intel/mkl/lib/intel64/libmkl_core.a /opt/intel/mkl/lib/intel64/libmkl_sequential.a -Wl,--end-group /opt/intel/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.a -lpthread -lm -ldl [[email protected] scalapack]$ mpirun -n 4...

How to control C Macro Precedence

c,macros

You can redirect the JOIN operation to another macro, which then does the actual pasting, in order to enforce expansion of its arguments: #define VAL1CHK 20 #define NUM 1 #define JOIN1(A, B, C) A##B##C #define JOIN(A, B, C) JOIN1(A, B, C) int x = JOIN(VAL,NUM,CHK); This technique is often used...

C++ / C #define macro calculation

c++,c,macros

Are DETUNE1 and DETUNE2 calculated every time it is called? Very unlikely. Because you are calling sqrt with constants, most compilers would optimize the call to the sqrt functions and replace it with a constant value. GCC does that at -O1. So does clang. (See live). In the general...

Efficient comparison of small integer vectors

c,integer,compare,bit-manipulation,string-comparison

It's possible to do this using bit-manipulation. Space your values out so that each takes up 5 bits, with 4 bits for the value and an empty 0 in the most significant position as a kind of spacing bit. Placing a spacing bit between each value stops borrows/carries from propagating...

Reverse ^ operator for decryption

c,algorithm,security,math,encryption

This is not a power operator. It is the XOR operator. The thing that you notice for the XOR operator is that x ^ k ^ k == x. That means that your encryption function is already the decryption function when called with the same key and the ciphertext instead...

OpenGL glTexImage2D memory issue

c,opengl

Which man page are you quoting? There are multiple man pages available, not all mapping to the same OpenGL version. Anyways, the idea behind the + 2 (border) is to have 2 multiplied by the value of border, which is in your case 0. So your code is just fine....

Is i=i+1 an undefined behaviour?

c,increment,undefined-behavior

There is no undefined behavior in this code. i=i+1; is well-defined behavior, not to be confused with i=i++; which gives undefined behavior. The only thing that could cause different outputs here would be floating point inaccuracy. Try value += 4 * (int)nearbyint(pow(10,i)); and see if it makes any difference....

How to increment the value of an unsigned char * (C)

c++,c,openssl,byte,sha1

I am assuming your pointer refers to 20 bytes, for the 160 bit value. (An alternative may be text characters representing hex values for the same 160 bit meaning, but occupying more characters) You can declare a class for the data, and implement a method to increment the low order...

Program to reverse a string in C without declaring a char[]

c,string,pointers,char

Important: scanf(" %s", name); has no bounds checking on the input. If someone enters more than 255 characters into your program, it may give undefined behaviour. Now, you have the char array you have the count (number of char in the array), why do you need to bother doing stuffs...

Disadvantages of calling realloc in a loop

c,memory-management,out-of-memory,realloc

When you allocate/deallocate memory many times, it may create fragmentation in the memory and you may not get big contiguous chunk of the memory. When you do a realloc, some extra memory may be needed for a short period of time to move the data. If your algorithm does...

C binary tree sort - extending it

c,binary-tree,binary-search-tree

a sample to modify like as void inorder ( struct btreenode *, int ** ) ; int* sort(int *array, int arr_size) { struct btreenode *bt = NULL; int i, *p = array; for ( i = 0 ; i < arr_size ; i++ ) insert ( &bt, array[i] ) ;...