Menu
  • HOME
  • TAGS

comparing a xmmX vector

assembly,sse,cmp,xmm,eflags

You can use cmpeqps, you just have to extract the four flags. For example (not tested) cmpeqps xmm2, xmm1 movmskps eax, xmm2 cmp eax, 15 je somewhere ...

Is it possible to get multiple sines in AVX/SSE?

windows,x86-64,sse,avx

Yes, there is a vector version using SSE/AVX! But the catch is that Intel C++ compiler must be used. This is called Intel small vector math library (intrinsics): for 128bit SSE please use (double precision): _mm_sin_pd for 256bit AVX please use (double precision): _mm256_sin_pd The two intrinsics are actually very...

minigw-w64 is incapable of 32 byte stack alignment, easy work around or switch compilers?

c++,windows,gcc,alignment,sse

You can solve this problem by switching to Microsoft's 64-bit C/C++ compiler. The problem is not intrinsic to 64-bit Windows. Despite what Kai Tietz said in the bug report you linked, Microsoft's x64 ABI does allow a compiler to give variables a greater than 16-byte alignment on the stack. Also...

SSE intrinsic over int16[8] to extract the sign of each element

c,x86,sse,simd,sign

You can use min/max operations to get the desired result, e.g. inline __m128i _mm_sgn_epi16(__m128i v) { v = _mm_min_epi16(v, _mm_set1_epi16(1)); v = _mm_max_epi16(v, _mm_set1_epi16(-1)); return v; } This is probably a little more efficient than explicitly comparing with zero + shifting + combining results. Note that there is already an...

Program crashes when using intrinsics

c++,visual-studio,visual-studio-2013,sse,intrinsics

The problem is that your "source" (the array x) is not aligned to the size that the SSE instructions require. You can fix this with using the "unaligned" load instruction, or you can fix it by using the __declspec(align(n)), e.g: float __declspec(align(16)) x[N]; float __declspec(align(16)) y[N]; Now your x and...

Auto vectorization not working

c++,optimization,vectorization,sse,simd

The error 1305 happens because the optimizer did not vectorize the loop since the value sum is not used. Simply adding printf("%d\n", sum) fixes that. But then you get a new error code 1105 "Loop includes a non-recognized reduction operation". To fix this you need you need to set /fp:fast...

Matrix operations using code vectorization

c,matrix,x86,sse,simd

I'm not sure how to do a in-place transpose for arbitrary matrices using SIMD efficiently but I do know how to do it for out-of-place. Let me describe how to do both In place transpose For in-place transpose you should see Agner Fog's Optimizing software in C++ manual. See section...

Is possible to address the output SIMD register by using an input register

c++,c,sse,simd

It seems that _mm_shuffle_epi8 is indeed the key to a solution. The idea is to set individual bits according to the values of the input vector a. These bits are distributed over (horizontal OR) the bytes of the 128 bits wide register. #include <stdio.h> #include <immintrin.h> /* gcc -O3 -Wall...

C/C++: -msse and -msse2 Flags do not have any effect on the binaries?

c++,gcc,sse,sse2

In your code very basic floating point maths is involved. And I bet if you turn optimizations on (even -O1) it gets optimized out because those values are constant expressions and so calculable at compile-time. SSE is used (movss, mulss) because it's the threshold of floating point calculus, if we...

Integer dot product using SSE/AVX?

c++,vectorization,sse,simd,avx

Every time someone does this: temp_1 = _mm_set_epi32(x[j], x[j+1], x[j+2], x[j+3]); .. a puppy dies. Use one of these: temp_1 = _mm_load_si128(x); // if aligned temp_1 = _mm_loadu_si128(x); // if not aligned Cast x as necessary. There is no integer version of _mm_dp_ps. But you can do what you were...

0xFFFF flags in SSE

c,vectorization,sse

You only set the first element of array ones to 1 (the rest of the array is initialised to 0). I suggest you get rid of the array ones altogether and then change this line: vOnes = _mm_load_si128((__m128i *)&(ones)[0] ); to: vOnes = _mm_set1_epi16(1); Probably a better solution though, if...

sse segfault on _mm_load_si128

c,sse

Yes you are right, your data is unaligned and the intrinsic you are using is for aligned data access. Use _mm_loadu_si128 instead of _mm_load_si128 or align the array to 16 bytes using align attribute. P.S.:You should be careful while using aligned load/stores. Union won't align the data. You need to...

SSE: conditionally replace pixel

c,gcc,gnu,sse

You need to compute horizontal OR. There is no horizontal OR instruction in SSE, but such operation can be simulated with 2x UNPACK + vertical OR. const __m128 pixel = _mm_load_ps(in); /* (p3, p2, p1, p0 ) */ __m128 isoe = _mm_cmpge_ps(pixel, upper); /* (p3|p1, p2|p0, p3|p1, p2|p0) */ isoe...

Fast inverse norm function

math,optimization,assembly,sse,micro-optimization

If you can arrange your vectors in AoSoA form (xxyyzzxxyyzzxxyyzz...) you can do this very efficiently with SSE or AVX (xxxxyyyyzzzz...). In the code below I assumed SSE2 which has vec_size=2 but it's easy to change this to AVX. But your code is likely memory bound and not compute bound...

Using SSE to round in Delphi

delphi,delphi-7,sse,rounding,inline-assembly

On modern Delphi, to set MXCSR you can call SetMXCSR from the System unit. To read the current value use GetMXCSR. Do beware that SetMXCSR, just like Set8087CW is not thread-safe. Despite my efforts to persuade Embarcadero to change this, it seems that this particular design flaw will remain with...

Square root of a OpenCV's grey image using SSE

c++,opencv,sse,simd

The SSE code looks OK, except that you're not processing the last 16 pixels: for (x = 0; x < (pixels - 16); x += 16) should be: for (x = 0; x <= (pixels - 16); x += 16) Note that if your image width is not a multiple...

SIMD latency throughput

c++,sse,simd

The "latency" for an instruction is how many clock cycles it takes the perform one instruction (how long does it take for the instruction to complete. Normally throughput is the number of instructions per clock cycle, but here throughput is the number the number of clock cycles per independent instruction...

OpenMP tasking - way of preventing a specific thread from executing tasks?

parallel-processing,task,openmp,sse

Isn't that the same question as in your other post (Task scheduling points of OpenMP tasks) The answer to this one here is: no, you cannot control how the OpenMP implementation assigns the tasks to the OpenMP threads of the parallel team. May I suggest that you post a piece...

Using SSE to mimic the standard Math.pow function

c,assembly,x86,sse,simd

MOVSS moves single precision floats (32-bit). I assume that n is an integer so you can't load it into a XMM register with MOVSS. Use CVTSI2SS instead. printf cannot process single precision floats, which would converted to doubles by the compiler. It's convenient to use CVTSS2SI at this point. So...

Are older SIMD-versions available when using newer ones?

c++,c,sse,simd,avx

In general, these have been additive but keep in mind that there are differences between Intel and AMD support for these over the years. If you have AVX, then you can assume SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE 4.2 as well. Remember that to use AVX you also need...

32-bit Hamming String formation from 32 8-bit comparisons

c++,c,sse,simd,avx

That's almost exactly what _mm256_movemask_epi8 is for, except it takes the top bits of the bytes instead of the least significant bits. So just shift left by 7 first. Or, change how you produce those bytes, because you probably made them as 0x00 or 0xFF for false and true respectively,...

Am I breaking strict aliasing rules?

c++,c++11,sse,strict-aliasing

There is only one intrinsic that "extracts" the lower order double value from xmm register: double _mm_cvtsd_f64 (__m128d a) You could use it this way: return _mm_cvtsd_f64(x); There is some contradiction between different references. MSDN says: This intrinsic does not map to any specific machine instruction. While Intel intrinsic guide...

Memory alignment for SSE in C++, _aligned_malloc equivalent?

c++,g++,malloc,sse,memory-alignment

with C++11, you may use something like: struct aligned_float { alignas(16) float f[4]; }; static_assert(sizeof(aligned_float) == 4 * sizeof(float), "padding issue"); int main() { const int length = 64000; std::vector<aligned_float> pResult(length / sizeof(aligned_float)); return 0; } ...

sse precision error with Matrix multiplication

c,sse,precision,matrix-multiplication,rounding-error

I am summarizing the discussion in order to close this question as answered. So according to the article (What Every Computer Scientist Should Know About Floating-Point Arithmetic) in link, floating point always results in a rounding error which is a direct consequence of the approximate representation nature of the floating...

How to detect SSE/AVX/AVX2 availability at compile-time ?

gcc,clang,sse,avx,avx2

Most compilers will automatically define: __SSE__ __SSE2__ __SSE3__ __AVX__ __AVX2__ etc, according to whatever command line switches you are passing. You can easily check this with gcc (or gcc-compatible compilers such as clang), like this: $ gcc -msse3 -dM -E - < /dev/null | egrep "SSE|AVX" | sort #define __SSE__...

Why do MSVC optimizations break SSE code when function arguments are const refs to temporaries or temporaries copied by value?

c++,c++11,visual-c++,sse,msvc12

You can add (non-virtual) member functions to a struct without really affecting the layout. So add destructor to print "I'm here %p" when the structure is destroyed, and print "I'm there" in your function. (Include the this address you you can make sense of other temporary copies being used). Then...

GCC emits vastly different code using “-march=native” on similar architectures

c,gcc,assembly,sse,avx

AVX is disabled because the entire AMD Bulldozer family does not handle 256-bit AVX instructions efficiently. Internally, the execution units are only 128-bit wide. So 256-bit operations are split up thereby providing no benefit over 128-bit. To add insult to injury, on Piledriver, there's a bug in the 256-bit store...

x64 floating point blends

assembly,x86,sse

As your quote says, the relevant bits are [3:0], that is the low 4 bits. Each of those control operation for the corresponding word. Since you have 4 words (floats) in an SSE register, you have 4 control bits. The top 4 bits are ignored. Note that the operation section...

SSE2 Saturated Arithmetic

c,sse,simd,intrinsics,sse2

To clip double precision values to a range of -1.0 to +1.0 you can use max/min operations. E.g. if you have a buffer, buff, of N double values: const __m128d kMax = _mm_set1_pd(1.0); const __m128d kMin = _mm_set1_pd(-1.0); for (int i = 0; i < N; i += 2) {...

Storing a constant in SSE register (GCC, C++)

c++,c,assembly,sse,inline-assembly

One note, you need to be more specific on how you do compile things and probably provide minimal example. I know it might not be best answer because of this, but I think it's good enough. It got long but it's because of codes. Bottom line of below work is...

How to calculate mod/remainder using SSE?

assembly,sse,division

If M is either a compile time constant or is constant within a loop then instead of using division you can calculated a reciprocal and then do multiplication and a shift. We can write x/M = (x*(2^n/M))>>n The factor 2^n/M (aka magic number) should be calculated before the loop or...

Demonstrator code failing to show 4 times faster speed

c,gcc,x86,sse,simd

That must be the instruction latency. (RAW dependency) While the ALU instructions have little to no latency, ie the results can be the operands for the next instruction without any delay, SIMD instructions tend to have long latencies until the results are available even for such simple ones like add....

AVX2 slower than SSE on Haswell

c++,x86,sse,simd,avx2

I converted your code to more vanilla C++ (plain arrays, no vectors, etc), cleaned it up and tested it with auto-vectorization disabled and got reasonable results: #include <iostream> using namespace std; #include <sys/time.h> #include <cstdlib> #include <cstdint> #include <immintrin.h> inline double timestamp() { struct timeval tp; gettimeofday(&tp, NULL); return double(tp.tv_sec)...

Pass v4sf by value or reference

gcc,optimization,sse

Typically SIMD functions which take vector parameters are relatively small and performance-critical, which usually means they should be inlined. Once inlined it doesn't really matter whether you pass by value, pointer or reference, as the compiler will optimise away unnecessary copies or dereferences. One further point: if you think you...

SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers

x86,sse,mmx

There is no direct conversion from float to byte, _mm_cvtps_pi8 is a composite. _mm_cvtps_pi16 is also a composite, and in this case it's just doing some pointless stuff that you undo with the shuffle. They also return annoying __m64's. Anyway, we can convert to dwords (signed, but that doesn't matter),...

Initializing an __m128 type from a 64-bit unsigned int

c++,sse,intrinsics

To answser your question about how to load a 64-bit value into the lower 64-bits of a XMM register while zeroing the upper 64-bits _mm_loadl_epi64(&x) will do exactly what you want. In regards to _mm_set_epi64 I said once that looking at the source code of Agner Fog's Vector Class Library...

SIMD signed with unsigned multiplication for 64-bit * 64-bit to 128-bit

c,x86,integer,bit-manipulation,sse

The right way to think about the throughput limits of integer multiplication using various instructions is in terms of how many "product bits" you can compute per cycle. mulx produces one 64x64 -> 128 result every cycle; that's 64x64 = 4096 "product bits per cycle" If you piece together a...

How to check inf for AVX intrinsic __m256

c++,c,sse,intrinsics,avx

If you want to check if a vector has any infinities: #include <limits> bool has_infinity(__m256 x){ const __m256 SIGN_MASK = _mm256_set1_ps(-0.0); const __m256 INF = _mm256_set1_ps(std::numeric_limits<float>::infinity()); x = _mm256_andnot_ps(SIGN_MASK, x); x = _mm256_cmp_ps(x, INF, _CMP_EQ_OQ); return _mm256_movemask_ps(x) != 0; } If you want a vector mask of the values that...

Is this a data alignment crash? (potentially involving stack misalignment, XNAMath, Visual Studio 2103)

c++,visual-studio-2013,sse,memory-alignment,xna-math-library

See this answer for more details. Keep in mind that VS 2013 for x86 defaults to using /arch:SSE2 so even with _XM_NO_INTRINSICS_ defined, the compiler is going to use SSE/SSE2. For that reason, you should probably stop using _XM_NO_INTRINSICS_ and just get your code to use DirectXMath or XNAMath correctly....

Effective way to extract from SSE vector on AMD processors

sse,simd,amd-processor

On x86-64 you can use _mm_cvtsi128_si64, which translates to a single MOVQ r64, xmm instruction

Why does MSVC use SSE2 instruction for such trivial thing?

visual-c++,assembly,x86,sse,fpu

The Intel Optimization Reference Manual section 3.8.1 (Guidelines for Optimizing Floating-point Code) says - Enable the compiler’s use of SSE, SSE2 and more advanced SIMD instruction sets (e.g. AVX) with appropriate switches. Favor scalar SIMD code generation to replace x87 code generation. Section 3.8.5 goes on to explain: Use Streaming...

Semantics of mov widths in x64 and SSE

assembly,64bit,sse,freepascal

With MASM(32 bit, but however) these two lines are rejected as an error. movdqu qword ptr [ecx], xmm0 movdqu [ecx], xmm0 ; standard prefix DWORD ... test.asm(121) : error A2022:instruction operands must be the same size These two are accepted: movdqu oword ptr [ecx], xmm0 ; explicit 128bit movdqu xmmword...

SSE intrinsics to copy bytes within a register

c++,c,sse,simd,intrinsics

You can do this with _mm_shuffle_ps (SHUFPS): #include "xmmintrin.h" // SSE xmm2 = _mm_shuffle_ps(xmm1, xmm1, _MM_SHUFFLE(0, 0, 0, 0)); Note: depending on how you've ordered the elements in your example above it might instead need to be: xmm2 = _mm_shuffle_ps(xmm1, xmm1, _MM_SHUFFLE(3, 3, 3, 3)); ...

SSE intrinsics: masking a float and using bitwise and?

c++,sse,intrinsics

You need to use a bitwise (integer) mask when you AND, so to e.g. clear alternate values in a vector you might do something like this: __m128 v1 = _mm_set_ps(1.0f, 127.0f, 99.0f, 1.0f); __m128 v2 = _mm_castsi128_ps(_mm_set_epi32(0, -1, 0, -1)); __m128 v = _mm_and_ps(v1, v2); // => v = {...

is it possible/efficient to put fpu exception or inf into work?

c++,optimization,x86,sse,fpu

You can, but it's probably not that useful. Masking won't be useful either under the circumstances. Exceptions are extremely slow when they happen, first a lot of microcoded complex stuff has to happen before the CPU even enters the kernel level exception handler, and then it has to hand it...

Equal zero instruction in SSE [duplicate]

c++,sse,avx

If it is SSE 4.1, you can use _mm_testz_si128, e.g. _mm_testz_si128(idata, _mm_set1_epi32(0x0000)) Probably look also into Check XMM register for all zeroes for a SSE2 compatible solution....

Modifying a function to use SSE intrinsics

c++,c++11,floating-point,sse,simd

SSE intrinsics can be pretty tedious sometimes... But not here. You just screwed up your loop : for( long long i = iMultipleOf4; i > 0LL; i -= 4LL ) I doubt it's doing what you expected. If iMultipleOf4 is 4, then your function will compute with 4,3,2,1 but not...

SIMD minmag and maxmag

assembly,floating-point,x86,sse,avx

Here's an alternate implementation which uses fewer instructions: static inline void maxminmag_test(__m128d & a, __m128d & b) { __m128d cmp = _mm_add_pd(a, b); // test for mean(a, b) >= 0 __m128d amin = _mm_min_pd(a, b); __m128d amax = _mm_max_pd(a, b); __m128d minmag = _mm_blendv_pd(amin, amax, cmp); __m128d maxmag = _mm_blendv_pd(amax,...

Packed masking in SSE

c,assembly,x86,nasm,sse

For this C-code with gcc-vector extensions, typedef float v4sf __attribute__((vector_size(16))); v4sf foo(float const *restrict a, float const *restrict b) { float const *restrict aa = __builtin_assume_aligned(a, 16); float const *restrict ba = __builtin_assume_aligned(b, 16); v4sf av = *(v4sf*)aa; v4sf bv = *(v4sf*)ba; v4sf sv = av+bv; float temp = sv[0]+sv[1]+sv[2];...

Determine what intrinsic flag is activated

c++,gcc,sse,intrinsics

GCC, ICC (on Linux), and Clang have the following compile options with corresponding defines options define -mfma __FMA__ -mavx2 __AVX2__ -mavx __AVX__ -msse4.2 __SSE4_2__ -msse4.1 __SSE4_1__ -mssse3 __SSSE3__ -msse3 __SSE3__ -msse2 __SSE2__ -m64 __SSE2__ -msse __SSE__ Options and defines in GCC and Clang but not in ICC: -msse4a __SSE4A__ -mfma4...

AVX2 Winner-Take-All Disparity Search

c++,sse,avx,disparity-mapping,avx2

I still haven't found the problem, but I did see some things you might want to change. You're not checking the return value of _mm_malloc, though. If it's failing, that would explain it. (Maybe it doesn't like allocating 32-byte aligned memory?) If you're running your code under a memory checker...

assembly function with C segfault

c,assembly,x86,sse,fpu

You have forgotten to cleanup the stack. In the prologue you have: pushl %eax pushl %ecx pushl %edx pushl %ebp movl %esp, %ebp You obviously need to undo that before you ret, such as: movl %ebp, %esp popl %ebp popl %edx popl %ecx popl %eax ret PS: I have already...

What is the difference between non-packed and packed instruction in the context of SIMD-operations?

sse,simd

To my understanding, packed means that conceptually more than one value is transferred or used as an operand, whereas non-packed means that only one value is is processed; non-packed means that no parallel processing takes place.

Clang: Proper way to enable SSE4 on a per-function / per-block of code basis?

xcode,clang,llvm,sse

There is currently no way to target different ISA extensions at block / function granularity in clang. You can only do it at file granularity (put your SSE4.1 code into a separate file and specify that file to use -msse4.1). If this is an important feature for you, please file...

Counting the number of leading zeros in a 128-bit integer

c++,gcc,bit-manipulation,sse

inline int clz_u128 (uint128_t u) { uint64_t hi = u>>64; uint64_t lo = u; int retval[3]={ __builtin_clzll(hi), __builtin_clzll(lo)+64, 128 }; int idx = !hi + ((!lo)&(!hi)); return retval[idx]; } this is a branch free variant. Note that more work is done than in tye branchy solution, and in practice the...

Converting from __m128 to __m128i results in wrong value

c++,type-conversion,clang,sse,intrinsics

Debugger, in this example, interprets the __m128i value as two 64-bit integers, as opposed to four 32-bit ones expected by you. The actual conversion is correct. In your code you need to explicitly specify how to interpret the SIMD value, for example: test_i.m128i_i32[0]...

Load two 64-bit integers into lower & upper xmm, respectively

assembly,sse,cpu-registers

Being limited to SSSE3 means no pinsrq, but you can do this: movq xmm1, r8 pslldq xmm1, 8 movq xmm0, rdx por xmm0, xmm1 There are many other ways, but I can't think of anything faster right now. Maybe this, if it doesn't have bypass delays: movq xmm1, r8 movq...

Extracting ints and shorts from a struct using AVX?

c++,x86,sse,simd,avx

You can extract 16 bit elements from an __m128i using _mm_extract_epi16 (requires SSE2): int16_t v = _mm_extract_epi16 (v, 4); // extract element 4 For 32 bit elements use _mm_extract_epi32 (requires SSE4.1) int32_t v = _mm_extract_epi32 (v, 0); // extract element 0 See: Intel Intrinsics Guide Assuming your struct is declared...

How to compile one specific class with SSE

c++,gcc,sse

So I tried out GCC 4.9 as Marc Glisse mentioned it, and I get it to work! The working code now looks like this: #include "MyClassWithSSE42.h" __attribute__((target("sse4.2"))) uint32_t MyClassWithSSE42::CRC32byte(const uint32_t *p, const uint32_t startValue) { uint32_t c = _mm_crc32_u32(startValue, p[0]); c = _mm_crc32_u32(c, p[1]); c = _mm_crc32_u32(c, p[2]); c =...

Using intrinsics to find next non-zero in an array

c++,performance,vectorization,sse,avx

It's fairly simple to do this, but throughput improvement may not be great, since you will probably be limited by memory bandwidth (unless your array is already cached): int index = -1; for (i = 0; i < n; i += 4) { __m128i v = _mm_load_si128(&A[i]); __m128i vcmp =...

VC++ SSE code generation - is this a compiler bug?

visual-c++,assembly,x86,sse,visual-studio-debugging

Since no one else has stepped up, I'm going to take a shot. 1) If the address is relative to a stack frame alignment cannot be forced. Is this really a compiler bug? I'm not sure it is true that you cannot force alignment for stack variables. Consider this code:...

SSE2: Multiplying signed integers from a 2d array with doubles and summing the results in C

c,x86,sse,simd,sse2

Firstly, as per the comments above, I'm going to assume that it's OK to transpose LATTICEVELOCITIES: static const int32_t LATTICEVELOCITIES[3][PARAMQ] = { { 0, -1, 0, 1, 0, -1, 0, 1, -1, 0, 1, -1, 0, 1, 0, -1, 0, 1, 0 }, { -1, 0, 0, 0, 1, -1,...

How to get instruction sets info in Android code?

android,cpu,sse,neon,instruction-set

You can fetch that info from /proc/cpuinfo. Use this code: BufferedReader br = new BufferedReader(new FileReader(new File("/proc/cpuinfo"))); String line; StringBuilder cpuInfo = new StringBuilder(); while ((line = br.readLine()) != null) { sb.append(aLine + "\n"); } Log.i(TAG, "CPU info: " + sb.toString()); The output of this command will contain all those...

SSE division by integer

assembly,floating-point,x86-64,sse

You should take some time to study the instruction set reference, so you at least get a rough idea what kind of possibilities you have. Also, you should read the appropriate ABI docs for the calling convention. That said, the answer to your first question is float return values should...

Segmentation fault with __m128 in C

c,sse

In threshold, short int i, N=16; should be: short int i, N=8; This is because there are 8 x short int elements per vector, and pointer arithmetic takes the size of the elements into account (my guess is that you were assuming you needed to work with 16 bytes as...

Choosing SSE instruction execution domains in mixed contexts

assembly,vector,sse,sse-execution-domain

Prefer integer-domain instructions for things like xor. On Intel CPUs, only one execution port can handle FP-domain logicals (XORPS, etc.), but most of the execution units (On SnB to Haswell: p015, but not Haswell's port 6) can handle vector integer logical instructions (PAND/POR/PXOR). Sometimes it costs an extra 1 cycle...

Vector-matrix & matrix-matrix multiplication using SSE for any size of input matrix and vector

c,sse,multicore

It seems you are loading and storing exclusively with mm_load_ps and mm_store_ps, which load and store 4 floats in a single instruction. Since your containers (matrixes and vectors) do not have necessarily a size which is a multiple of 4 floats (16 bytes) this is incorrect. memalign ensures that the...

Segmentation fault in openMP program with SSE instructions with threads > 4

c++,multithreading,segmentation-fault,openmp,sse

As already mentioned in my comment above, your problem is not related with the use of SSE instructions (at least not for the code you've posted). The reason is that if you use more than 4 threads, the loop for(i=0; i<(4/nThreads); i++) /* (4/nThreads) == 0 */ is never entered...

SSE 64 bit registers

c++,sse

The whole point of SSE is indeed to process a lot of numbers quickly. And the ability to process two numbers at a time helps a lot with that. For instance, you can indeed add a step {dx, dy} to a coordinate {x, y} in a single instruction (ADDPS). It...

Intrinsic code optimisation hints

c++,sse,intrinsics,avx

The first thing to try is auto-vectorization. To do this you need to enable auto-vectorization and AVX e.g. with GCC gcc -O3 -mavx. But if you really want to do this with intrinsics you could try something like this: __m256 min_value8 = _mm256_set1_ps(FLT_MAX); __m256 result_p8 = _mm256_setzero_ps(); __m256 one =...

Load 2 contiguous doubles into low-half of 2 sse registers

assembly,sse,intrinsics

I would simply use the _mm_set_pd or _mm_set1_pd intrinsics and see what your compiler generates - it should be reasonably efficient, and if not then the generated code may give you an idea of how to improve on it with more explicit intrinsics, e.g.: double d[2]; __m128d v0 = _mm_set_pd(d[0],...

AVX2 — multiply two __m256i integers

vectorization,sse,intrinsics,avx,avx2

You want the _mm256_mullo_epi32() intrinsic. From Intel's excellent online intrinsics guide: Synopsis __m256i _mm256_mullo_epi32 (__m256i a, __m256i b) #include "immintrin.h" Instruction: vpmulld ymm, ymm, ymm CPUID Flags: AVX2 Description Multiply the packed 32-bit integers in a and b, producing intermediate 64-bit integers, and store the low 32 bits of the...

How do I enable SSE for my freestanding bootable code?

x86,sse,instruction-set

If you're running an ancient or custom OS that doesn't support saving XMM regs on context switches, it won't have set the SSE-enabling bits in the machine control registers. In that case all instructions that touch xmm regs will fault. Took me a sec to find, but http://wiki.osdev.org/SSE explains how...

SSE intrinsics bit shifting to the right

c++,sse,bit-shift,intrinsics

When you use the _mm_set_epi* functions, they accept their parameters as the most significant item first. For example, the first statement, __m128i _16 = _mm_set_epi8( 128, 64, 32, 16, 8, 4, 2, 1, 128, 64, 32, 16, 8, 4, 2, 1); will load the variable with this value: 0x80402010080402018040201008040201 (128,64,32...

sorting component-wise multi value (SIMD) array

algorithm,sorting,time-complexity,sse,simd

It sounds as though a sorting network is the answer to the question that you asked, since the position of the comparators is not data dependent. Batcher's bitonic mergesort is O(n log2 n).

How to rewrite this code to sse intrinsics

c++,c,x86,mingw,sse

My sse is a bit rusty, but what you should do is: xmm0: [k, k+1, k+2, k+3] //xc0, xc1,.... xmm1: [k, k+1, k+2, k+3] //yc0, yc1,.... //initialize before the loop xmm2: [512, 512, 512, 512] xmm3: [idx, idx, idx, idx] xmm4: [iddx, iddx, iddx, iddx] xmm5: [idy, idy, idy, idy]...

OpenCV FAST corner detection SSE implementation walkthrough

c,performance,opencv,optimization,sse

As harold said, delta is used to make unsigned comparsion. Let's describe this implementation by steps: __m128i x0 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[0])), delta); __m128i x1 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[4])), delta); __m128i x2 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[8])), delta); __m128i x3 = _mm_sub_epi8(_mm_loadu_si128((const __m128i*)(ptr + pixel[12])), delta); m0 =...

SSE/AVX floating point convert exceptions

floating-point,sse,avx,floating-point-exceptions

The answers to these questions can mostly be found in the Intel® 64 and IA-32 Architectures Software Developer’s Manual: CVTPD2DQ ... If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H)...

SSE Code runs 30% faster, yet when in use show over 20% CPU increase

c,sse

Ok. I've found out what was going on. While the new routine is much faster with traditional memory, when it comes to working on frames generated by hardware decode method, it is actually slower: This Intel white paper explains things: https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers All my benchmarks and tests were done using traditionally...

Tell C++ that pointer data is 16 byte aligned

c++,gcc,sse,memory-alignment

The compiler isn't vectorizing the loop because it can't determine that the dynamically allocated pointers don't alias each other. A simple way to allow your sample code to be vectorized is to pass the --param vect-max-version-for-alias-checks=1000 option. This will allow the compiler to emit all the checks necessary to see...

How can I add together two SSE registers

c++,c,intel,sse,avx2

To add two 128-bit numbers x and y to give z with SSE you can do it like this z = _mm_add_epi64(x,y); c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x)); z = _mm_sub_epi64(z,c); This is based on this link how-can-i-add-and-subtract-128-bit-integers-in-c-or-c. The function unsigned_lessthan is defined below. It's complicated without AMD XOP (actually a found...

xmm, cmp two 32-bit float

assembly,floating-point,sse

You want to use the JB and JA (jump below/above) instructions instead of JL/JG. The COMISS instruction sets the flags as if it were two unsigned integers being compared. This makes the effect on the flags simpler. The COMISS instruction's effect on flags is documented in the Intel 64 and...

Why is the generated assembly reordered when using intrinsics?

c,gcc,x86,sse,intrinsics

TL;DR: From a compiler's point of view, the input code is different and might go through different places and hit different tests on the way through, which would make the output be different. You won't see this in (a current) clang, since the intrinsics disappear when you get to IR...

Intel SSE Intrinsics _mm_load_si128 segmentation fault,

c,sse,simd,memory-alignment,intrinsics

For the second load you need to use _mm_loadu_si128 because the source data is misaligned. Explanation: an offset of +5 ints from a base address which is 16 byte aligned will no longer be 16 byte aligned.

GCC -msse2 does not generate SIMD code

c++,gcc,x86,sse,simd

-march=core2 means that gcc can assume (along with 64 bit ISA) up to SSSE3 (e.g., MMX, SSE, SSE2, SSE3) is available. -mfpmath=sse can then force the use of SSE for floating-point arithmetic (the default in 64-bit mode), rather than 387 (the default in 32-bit -m32 mode). See: "Intel 386 and...

practical BigNum AVX/SSE possible?

sse,simd,avx

I think it may be possible to implement BigNum with SIMD efficiently but not in the way you suggest. Instead of implementing a single BigNum using a SIMD register (or with an array of SIMD registers) you should process multiple BigNums at once. Let's consider 128-bit addition. Let 128-bit integers...

How floating point conversion was handled before the invention of FPU and SSE?

c,assembly,x86,sse,fpu

It depends on the processor, and there have been a huge number of different processors over the years. FPU stands for "floating-point unit". It's a more or less generic term that can refer to a floating-point hardware unit for any computer system. Some systems might have floating-point operations built into...

why does _mm_mulhrs_epi16() always do biased rounding to positive infinity?

rounding,sse

A most serious mistake. I asked the same question on the Intel developer forums and andysem corrected me, pointing out the behavior is to round to the nearest integer. I was mistaken into thinking it was biased because the formula from MSDN, https://msdn.microsoft.com/en-us/library/bb513995.aspx was (x * y + 16384) >>...