Menu
  • HOME
  • TAGS

GCC emits vastly different code using “-march=native” on similar architectures

Tag: c,gcc,assembly,sse,avx

I'm working on writing an OpenCL benchmark in C. Currently, it measures the fused multiply-accumulate performance of both a CL device, and the system's processor using C code. The results are then cross checked for accuracy.

I wrote the native code to take advantage of GCC's auto vectorizer, and it works. However, I've noticed that GCC has some odd behavior with the "-march=native" flag.

This is my loop:

#define BUFFER_SIZE_SQRT 4096
#define SQUARE(n) (n * n)

#define ROUNDS_PER_ITERATION 48

static float* cpu_result_matrix(const float* a, const float* b, const float* c)
{
    float* res = aligned_alloc(16, SQUARE(BUFFER_SIZE_SQRT) * sizeof(float));

    const unsigned buff_size = SQUARE(BUFFER_SIZE_SQRT);
    const unsigned round_cnt = ROUNDS_PER_ITERATION;

    float lres;
    for(unsigned i = 0; i < buff_size; i++)
    {
        lres = 0;
        for(unsigned j = 0; j < round_cnt; j++)
        {
            lres += a[i] * ((b[i] * c[i]) + b[i]);
            lres += b[i] * ((c[i] * a[i]) + c[i]);
            lres += c[i] * ((a[i] * b[i]) + a[i]);
        }

        res[i] = lres;
    }

    return res;
}

When I compile with "-march=native -Ofast" on a Broadwell system, I get nice vectorized AVX code.

.L19:
        vmovups ymm0, YMMWORD PTR [rcx+rdx]
        mov     eax, 48
        vmovups ymm2, YMMWORD PTR [rdi+rdx]
        vaddps  ymm1, ymm0, ymm5
        vmovups ymm3, YMMWORD PTR [rsi+rdx]
        vaddps  ymm4, ymm2, ymm5
        vmulps  ymm1, ymm1, ymm2
        vfmadd132ps     ymm4, ymm1, ymm0
        vaddps  ymm1, ymm3, ymm5
        vmulps  ymm0, ymm2, ymm0
        vmulps  ymm0, ymm0, ymm1
        vfmadd132ps     ymm4, ymm0, ymm3
        vmovaps ymm1, ymm4
        vxorps  xmm0, xmm0, xmm0
        .p2align 4,,10
        .p2align 3

Compiling with the same flags on a Piledriver system emits SSE2 instructions, but no AVX instructions, even though the architecture supports it. (I'll clarify my title here by saying that Broadwell and Piledriver are nothing alike, but they both support similar vector instruction set extensions, so the emitted code should be similar.)

.L19:
        mov     eax, 48
        movups  xmm0, XMMWORD PTR [rcx+rdx]
        movups  xmm2, XMMWORD PTR [r13+0+rdx]
        movaps  xmm4, xmm0
        movaps  xmm1, xmm2
        movups  xmm3, XMMWORD PTR [rsi+rdx]
        addps   xmm4, xmm5
        addps   xmm1, xmm5
        mulps   xmm4, xmm2
        mulps   xmm1, xmm0
        mulps   xmm0, xmm2
        addps   xmm1, xmm4
        movaps  xmm4, xmm1
        mulps   xmm4, xmm3
        addps   xmm3, xmm5
        mulps   xmm0, xmm3
        addps   xmm4, xmm0
        pxor    xmm0, xmm0
        movaps  xmm1, xmm4
        .p2align 4,,10
        .p2align 3

I can even compile the whole project with -march=broadwell, and run it on the Piledriver system, and it works, with a ~100% performance gain.

I'm compiling with GCC 5.1.0, and "-ftree-vectorizer-verbose" doesn't seem to work anymore, so the compiler's behavior is quite opaque. I haven't found any information about the flag being deprecated, so I'm not sure why it doesn't work anymore, and I'd really like to figure out what GCC is doing.

The whole project is here: https://github.com/jakogut/clperf/tree/v0.1

Best How To :

AVX is disabled because the entire AMD Bulldozer family does not handle 256-bit AVX instructions efficiently. Internally, the execution units are only 128-bit wide. So 256-bit operations are split up thereby providing no benefit over 128-bit.

To add insult to injury, on Piledriver, there's a bug in the 256-bit store that reduces the throughput to about 1 every 17 cycles.


Your test case seems to be an anomaly. You don't have 256-bit stores in that critical loop - which avoids the bug. This (theoretically) leaves SSE on par with AVX for Piledriver.

The tie-breaker comes from the FMA3 instructions which Piledriver supports. This is probably why the AVX loop does become faster on Piledriver.

One thing you can try is -mfma4 -mtune=bdver2 and see what happens.

How to control C Macro Precedence

c,macros

You can redirect the JOIN operation to another macro, which then does the actual pasting, in order to enforce expansion of its arguments: #define VAL1CHK 20 #define NUM 1 #define JOIN1(A, B, C) A##B##C #define JOIN(A, B, C) JOIN1(A, B, C) int x = JOIN(VAL,NUM,CHK); This technique is often used...

Segmentation Fault if I don't say int i=0

c,arrays,segmentation-fault,initialization,int

In your code, int i is an automatic local variable. If not initialized explicitly, the value held by that variable in indeterministic. So, without explicit initialization, using (reading the value of ) i in any form, like array[i] invokes undefined behaviour, the side-effect being a segmentation fault. Isn't it automatically...

How to read string until two consecutive spaces?

c,format,sscanf,c-strings

The scanf family of functions are good for simple parsing, but not for more complicated things like you seem to do. You could probably solve it by using e.g. strstr to find the comment starter "//", terminate the string there, and then remove trailing space....

How can I align stack to the end of SRAM?

c,embedded,stm32,gnu-arm,coocox

I've found the reason: that's because stack size is actually fixed and it is located in heap (if I could call it heap). In file startup_stm32f10x*.c there is a section: /*----------Stack Configuration----------*/ #define STACK_SIZE 0x00000100 /*!< The Stack size suggest using even number */ And at then very next line:...

Is i=i+1 an undefined behaviour?

c,increment,undefined-behavior

There is no undefined behavior in this code. i=i+1; is well-defined behavior, not to be confused with i=i++; which gives undefined behavior. The only thing that could cause different outputs here would be floating point inaccuracy. Try value += 4 * (int)nearbyint(pow(10,i)); and see if it makes any difference....

Is there any way of protecting a variable for being modified at runtime in C?

c,variables,constants

You can make the result of the input be const like this: int func() { int op = 0; scanf( "%d", &op ); if( op == 0 ) return 1; else return 2; } int main() { const int v = func(); // ... } NB. Of course, there is...

Infinite loop with fread

c,arrays,loops,malloc,fread

If you're "trying to allocate an array 64 bytes in size", you may consider uint8_t Buffer[64]; instead of uint8_t *Buffer[64]; (the latter is an array of 64 pointers to byte) After doing this, you will have no need in malloc as your structure with a 64 bytes array inside is...

VS2012 Identifer not found when part of static lib

c,visual-studio-2012,linker,static-libraries

C++ uses something called name mangling when it creates symbol names. It's needed because the symbol names must contain the complete function signature. When you use extern "C" the names will not be mangled, and can be used from other programming languages, like C. You clearly make the shunt library...

How to increment the value of an unsigned char * (C)

c++,c,openssl,byte,sha1

I am assuming your pointer refers to 20 bytes, for the 160 bit value. (An alternative may be text characters representing hex values for the same 160 bit meaning, but occupying more characters) You can declare a class for the data, and implement a method to increment the low order...

C binary tree sort - extending it

c,binary-tree,binary-search-tree

a sample to modify like as void inorder ( struct btreenode *, int ** ) ; int* sort(int *array, int arr_size) { struct btreenode *bt = NULL; int i, *p = array; for ( i = 0 ; i < arr_size ; i++ ) insert ( &bt, array[i] ) ;...

free causing different results from malloc

c,string,malloc,free

Every time you are creating your string, you are not appending a null terminator, which causes the error. So change this: for(j=0; j<rem_len; j++) { if(j != i) { remaining_for_next[index_4_next] = remaining[j]; index_4_next++; } } to this: for(j=0; j<rem_len; j++) { if(j != i) { remaining_for_next[index_4_next] = remaining[j]; index_4_next++; }...

Disadvantages of calling realloc in a loop

c,memory-management,out-of-memory,realloc

When you allocate/deallocate memory many times, it may create fragmentation in the memory and you may not get big contiguous chunk of the memory. When you do a realloc, some extra memory may be needed for a short period of time to move the data. If your algorithm does...

Is post-increment operator guaranteed to run instantly?

c,c89,post-increment,ansi-c

This code is broken for two reasons: Accessing a variable twice between sequence points, for other purposes than to determine which value to store, is undefined behavior. There are no sequence points between the evaluation of function parameters. Meaning anything could happen, your program might crash & burn (or more...

execl() works on one of my code, but doesn't work on another

c,execl

My C is a bit rusty but your code made many rookie mistakes. execl will replace the current process if it succeeds. So the last line ("i have no idea why") won't print if the child can launch successfully. Which means... execl failed and you didn't check for it! Hint:...

Efficient comparison of small integer vectors

c,integer,compare,bit-manipulation,string-comparison

It's possible to do this using bit-manipulation. Space your values out so that each takes up 5 bits, with 4 bits for the value and an empty 0 in the most significant position as a kind of spacing bit. Placing a spacing bit between each value stops borrows/carries from propagating...

Segmentation fault with generating an RSA and saving in ASN.1/DER?

c,openssl,cryptography,rsa

pub_l = malloc(sizeof(pub_l)); is simply not needed. Nor is priv_l = malloc(sizeof(priv_l));. Remove them both from your function. You should be populating your out-parameters; instead you're throwing out the caller's provided addresses to populate and (a) populating your own, then (b) leaking the memory you just allocated. The result is...

Set precision dynamically using sprintf

c,printf,format-string

Yes, you can do that. You need to use an asterisk * as the field width and .* as the precision. Then, you need to supply the arguments carrying the values. Something like sprintf(myNumber,"%*.*lf",A,B,a); Note: A and B need to be type int. From the C11 standard, chapter §7.21.6.1, fprintf()...

Passing int using char pointer in C

c,exec,ipc

Programs simply do not take integers as arguments, they take strings. Those strings can be decimal representations of integers, but they are still strings. So you are asking how to do something that simply doesn't make any sense. Twenty is an integer. It's the number of things you have if...

What does `strcpy(x+1, SEQX)` do?

c,strcpy

The pointer + offset notation is used as a convenient means to reference memory locations. In your case, the pointer is provided by malloc() after allocating sufficient heap memory, and represents an array of M + 2 elements of type char, thus the notation as used in your code represents...

How convert unsigned int to unsigned char array

c++,c

#include <stdio.h> int main() { unsigned int i = 0x557e89f3; unsigned char c[4]; c[0] = i & 0xFF; c[1] = (i>>8) & 0xFF; c[2] = (i>>16) & 0xFF; c[3] = (i>>24) & 0xFF; printf("c[0] = %x \n", c[0]); printf("c[1] = %x \n", c[1]); printf("c[2] = %x \n", c[2]); printf("c[3] =...

C++ / C #define macro calculation

c++,c,macros

Are DETUNE1 and DETUNE2 calculated every time it is called? Very unlikely. Because you are calling sqrt with constants, most compilers would optimize the call to the sqrt functions and replace it with a constant value. GCC does that at -O1. So does clang. (See live). In the general...

Is it safe to read and write on an array of 32 bit data byte by byte?

c,memory,memory-alignment

Yes, this is correct. The only danger would be generating a bit pattern that does not correspond to any int, but on modern systems there are no such patterns. Also, if the data type was uint32_t specifically, those are prohibited from having any such patterns anyway. Note that the inverse...

How does this code print odd and even?

c,if-statement,macros,logic

In binary any numbers LSB (Least Significant Bit) is set or 1 means the number is odd, and LSB 0 means the number is even. Lets take a look: Decimal binary 1 001 (odd) 2 010 (even) 3 011 (odd) 4 100 (even) 5 101 (odd) SO, the following line...

OpenGL glTexImage2D memory issue

c,opengl

Which man page are you quoting? There are multiple man pages available, not all mapping to the same OpenGL version. Anyways, the idea behind the + 2 (border) is to have 2 multiplied by the value of border, which is in your case 0. So your code is just fine....

C language, vector of struct, miss something?

c,vector,struct

What is happening is that tPeca pecaJogo[tam]; is a local variable, and as such the whole array is allocated in the stack frame of the function, which means that it will be deallocated along with the stack frame where the function it self is loaded. The reason it's working is...

fread(), solaris to unix portability and use of uninitialised values

c,linux,memory,stack,portability

Q 1. why is ch empty even after fread() assignment? (Most probably) because fread() failed. See the detailed answer below. Q 2.Is this a portability issue between Solaris and Linux? No, there is a possible issue with your code itself, which is correctly reported by valgrind. I cannot quite...

Loop through database table and compare user input

mysql,c

If you are only looking for fields that match the input, you'll want to search the database using the input string. In other words, write your query string so that it only gives you results that match the user input. This will be much faster than searching through every returned...

On entry to NIT parameter number 9 had an illegal value

c,mpi,intel-mkl,mpich,scalapack

This answer is courtesy of Ying from Intel, all the credits go to him! The int in C are supposed to be 32bit, you may try lp64 mode. mpicc -o test_lp64 ex1.c -I/opt/intel/mkl/include /opt/intel/mkl/lib/intel64/libmkl_scalapack_lp64.a -L/opt/intel/mkl/lib/intel64 -Wl,--start-group /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.a /opt/intel/mkl/lib/intel64/libmkl_core.a /opt/intel/mkl/lib/intel64/libmkl_sequential.a -Wl,--end-group /opt/intel/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.a -lpthread -lm -ldl [[email protected] scalapack]$ mpirun -n 4...

Galois LFSR - how to specify the output bit number

c,prng,shift-register

If you need bit k (k = 0 ..15), you can do the following: return (lfsr >> k) & 1; This shifts the register kbit positions to the right and masks the least significant bit....

C programming - Confusion regarding curly braces

c,scope

The only difference between the two is the scope of the else. Without the braces, it spans until the end of the full statement, which is the next ;, i.e the next line: else putchar(ch); /* end of else */ lastch = ch; /* outside of if-else */ With the...

ARM assembly cannot use immediate values and ADDS/ADCS together

gcc,assembly,arm,instructions

Somewhat ironically, by using UAL syntax to solve the first problem you've now hit pretty much the same thing, but the other way round and with a rather more cryptic symptom. The only Thumb encodings for (non-flag-setting) mov with an immediate operand are 32-bit ones, however Cortex-M0 doesn't support those,...

Counting bytes received by posix read()

c,function,serial-port,posix

Yes, temp_uart_count will contain the actual number of bytes read, and obviously that number will be smaller or equal to the number of elements of temp_uart_data. If you get 0, it means that the end of file (or an equivalent condition) has been reached and there is nothing else to...

Does strlen() always correctly report the number of char's in a pointer initialized string?

c,strlen

What strlen does is basically count all bytes until it hits a zero-byte, the so-called null-terminator, character '\0'. So as long as the string contains a terminator within the bounds of the memory allocated for the string, strlen will correctly return the number of char in the string. Note that...

What all local variables goto Data/BSS segment?

c++,c,nm

"local" in this context means file scope. That is: static int local_data = 1; /* initialised local data */ static int local_bss; /* uninitialised local bss */ int global_data = 1; /* initialised global data */ int global_bss; /* uninitialised global bss */ void main (void) { // Some code...

Django MySQLClient pip compile failure on Linux

python,linux,django,gcc,pip

It looks like you're missing zlib; you'll want to install it: apt-get install zlib1g-dev I also suggest reading over the README and confirming you have all other dependencies met: https://github.com/dccmx/mysqldb/blob/master/README Also, I suggest using mysqlclient over MySQLdb as its a fork of MySQLdb and what Django recommends....

Array breaking in Pebble C

c,arrays,pebble-watch,cloudpebble

The problem is this line static char *die_label = "D"; That points die_label to a region of memory that a) should not be written to, and b) only has space for two characters, the D and the \0 terminator. So the strcat is writing into memory that it shouldn't be....

CallXXXMethod undefined using JNI in C

java,c,jni

There are few fixes required in the code: CallIntMethod should be (*env)->CallIntMethod class Test should be public Invocation should be jint age = (*env)->CallIntMethod(env, mod_obj, mid, NULL); Note that you need class name to call a static function but an object to call a method. (cls2 -> person) mid =...

Does realloc() invalidate all pointers?

c,pointers,dynamic-memory-allocation,behavior,realloc

Yes, ptr2 is unaffected by realloc(), it has no connection to realloc() call whatsoever(as per the current code). However, FWIW, as per the man page of realloc(), (emphasis mine) The realloc() function returns a pointer to the newly allocated memory, which is suitably aligned for any kind of variable and...

Unexpected result when calculating a percentage - even when factoring in integer division rules

c,percentage,integer-overflow,integer-division

Diagnosis The value you expect is, presumably, 91. The problem appears to be that your compiler is using 16-bit int values. You should identify the platform on which you're working and include information about unusual situations such as 16-bit int types. It is reasonable for us to assume 32-bit or...

How can I convert an int to a string in C++11 without using to_string or stoi?

c++,string,c++11,gcc

Its not the fastest method but you can do this: #include <string> #include <sstream> #include <iostream> template<typename ValueType> std::string stringulate(ValueType v) { std::ostringstream oss; oss << v; return oss.str(); } int main() { std::cout << ("string value: " + stringulate(5.98)) << '\n'; } ...

scanf get multiple values at once

c,char,segmentation-fault,user-input,scanf

I'm not saying that it cannot be done using scanf(), but IMHO, that's not the best way to do it. Instead, use fgets() to read the whole like, use strtok() to tokenize the input and then, based on the first token value, iterate over the input string as required. A...

Text justification C language

c,text,alignment

From printf's manual: The field width An optional decimal digit string (with nonzero first digit) specifying a minimum field width. If the converted value has fewer characters than the field width, it will be padded with spaces on the left (or right, if the left-adjustment flag has been given). Instead...

__cplusplus < 201402L return true in gcc even when I specified -std=c++14

c++,gcc,g++,c++14,predefined-macro

According to the GCC CPP manual (version 4.9.2 and 5.1.0): __cplusplus This macro is defined when the C++ compiler is in use. You can use __cplusplus to test whether a header is compiled by a C compiler or a C++ compiler. This macro is similar to __STDC_VERSION__, in that it...

Is there Predefined-Macros define about byte order in armcc

c,armcc,predefined-macro

Well according to this page: http://www.keil.com/support/man/docs/armccref/armccref_BABJFEFG.htm You have __BIG_ENDIAN which is defined when compiling for a big endian target....

How does ((a++,b)) work? [duplicate]

c,function,recursion,comma

In your first code, Case 1: return reverse(i++); will cause stack overflow as the value of unchanged i will be used as the function argument (as the effect of post increment will be sequenced after the function call), and then i will be increased. So, it is basically calling the...

Multiple definition and file management

c,arrays,compilation,compiler-errors,include

include is a preprocessor directive that includes the contents of the file named at compile time. The code that conditionally includes stuff is executed at run time...not compile time. So both files are being compiled in. ( You're also including each file twice, once in the main function and once...

getchar() not working in c

c,while-loop,char,scanf,getchar

That's because scanf() left the trailing newline in input. I suggest replacing this: ch = getchar(); With: scanf(" %c", &ch); Note the leading space in the format string. It is needed to force scanf() to ignore every whitespace character until a non-whitespace is read. This is generally more robust than...

Reverse ^ operator for decryption

c,algorithm,security,math,encryption

This is not a power operator. It is the XOR operator. The thing that you notice for the XOR operator is that x ^ k ^ k == x. That means that your encryption function is already the decryption function when called with the same key and the ciphertext instead...

CGO converting Xlib XEvent struct to byte array?

c,go,xlib,cgo

As mentioned in the cgo documentation: As Go doesn't have support for C's union type in the general case, C's union types are represented as a Go byte array with the same length. Another SO question: Golang CGo: converting union field to Go type or a go-nuts mailing list post...

Program to reverse a string in C without declaring a char[]

c,string,pointers,char

Important: scanf(" %s", name); has no bounds checking on the input. If someone enters more than 255 characters into your program, it may give undefined behaviour. Now, you have the char array you have the count (number of char in the array), why do you need to bother doing stuffs...