performance,optimization,assembly,simd,avx

OK came to the conclusion that it was no idea of trying to be 'smart', I benched: the built in intrinsic popcount: _mm_popcnt_u64 bmi2:__tzcnt_u64(~_pext_u64(data[i],data[i])); against three assembler functions popcnt, bmi2 and avx2. They all run at the speed you can move memory in and out of my: cat /proc/cpuinfo -Intel(R)...

On x86-64 you can use _mm_cvtsi128_si64, which translates to a single MOVQ r64, xmm instruction

I converted your code to more vanilla C++ (plain arrays, no vectors, etc), cleaned it up and tested it with auto-vectorization disabled and got reasonable results: #include <iostream> using namespace std; #include <sys/time.h> #include <cstdlib> #include <cstdint> #include <immintrin.h> inline double timestamp() { struct timeval tp; gettimeofday(&tp, NULL); return double(tp.tv_sec)...

By design, for IEEE754 data types, you can simply treat the value as an integer and increment the value. Or decrement it if the value is negative. function NextDoubleGreater(const D: Double): Double; var SpecialType: TFloatSpecial; I: Int64; begin SpecialType := D.SpecialType; case SpecialType of fsZero,fsNZero: // special handling needed around...

Assuming you want to convert a vector of 16 x 8 bit ints to four vectors of 4 x 32 bit ints, you can do this by first unpacking to 16 bits and then again to 32 bits: // load 8 bit vector uint8x16_t v = vld1q_u8(p); // load vector...

This is an address alignment qualifier: 9.4.2.5 NEON Alignment Specifiers Some NEON load/store instructions allow an optional address alignment qualifier. The ARM documentation specifies that this is indicated by `@ align'. However GAS already interprets the `@' character as a "line comment" start, so `: align' is used instead. For...

The "latency" for an instruction is how many clock cycles it takes the perform one instruction (how long does it take for the instruction to complete. Normally throughput is the number of instructions per clock cycle, but here throughput is the number the number of clock cycles per independent instruction...

The SSE code looks OK, except that you're not processing the last 16 pixels: for (x = 0; x < (pixels - 16); x += 16) should be: for (x = 0; x <= (pixels - 16); x += 16) Note that if your image width is not a multiple...

fortran,vectorization,simd,intel-fortran

The loop cost is an estimate of the duration of one loop iteration and it takes somewhat longer in the vectorized case, but you can process more array items in one vectorized iteration. In your case the speedup is roughly 12 / 20 * 4 = 2.4 because you can...

_mm256_testz_ps just tests the sign bits - in order to test the values you'll need to compare against 0 and then extract the resulting mask, e.g. __m256 vcmp = _mm256_cmp_ps(*pSrc1, _mm256_set1_ps(0.0f), _CMP_EQ_OQ); int mask = _mm256_movemask_ps(vcmp); bool any_nz = mask != 0xff; ...

You can use min/max operations to get the desired result, e.g. inline __m128i _mm_sgn_epi16(__m128i v) { v = _mm_min_epi16(v, _mm_set1_epi16(1)); v = _mm_max_epi16(v, _mm_set1_epi16(-1)); return v; } This is probably a little more efficient than explicitly comparing with zero + shifting + combining results. Note that there is already an...

The documentation on using ARMv8 in Android is not very good, but for your specific questions, they're answered quite well in this document: ARMv8 Instruction Set Overview To answer your specific questions: mov R0, #42 vdup.8 D0, R0 becomes mov w0,#42 dup v0.8b,w0 and vld4.8 {d0-d3}, [r0]! becomes ld4 {v0.8b,v1.8b,v2.8b,v3.8b},[x0],#32...

That's almost exactly what _mm256_movemask_epi8 is for, except it takes the top bits of the bytes instead of the least significant bits. So just shift left by 7 first. Or, change how you produce those bytes, because you probably made them as 0x00 or 0xFF for false and true respectively,...

In general, these have been additive but keep in mind that there are differences between Intel and AMD support for these over the years. If you have AVX, then you can assume SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE 4.2 as well. Remember that to use AVX you also need...

You can extract 16 bit elements from an __m128i using _mm_extract_epi16 (requires SSE2): int16_t v = _mm_extract_epi16 (v, 4); // extract element 4 For 32 bit elements use _mm_extract_epi32 (requires SSE4.1) int32_t v = _mm_extract_epi32 (v, 0); // extract element 0 See: Intel Intrinsics Guide Assuming your struct is declared...

You can do table lookups in NEON using the VTBL and VTBX instructions, but they are only useful for look up tables with few entries. When optimising for NEON it is often best to look for a way to calculate values at run time instead of using a table. In...

That must be the instruction latency. (RAW dependency) While the ALU instructions have little to no latency, ie the results can be the operands for the next instruction without any delay, SIMD instructions tend to have long latencies until the results are available even for such simple ones like add....

parallel-processing,gpu,cpu,simd

It's a similar idea, it goes kind of like this (very informally speaking): The CPU has a set amount of functions that can run on packed values. Depending on your brand and version of your CPU, you might have access to SSE2, 3, 4, 3dnow, etc, and each of them...

c++,performance,parallel-processing,simd,avx2

No. On modern architectures the crossover at which Karatsuba beats schoolbook multiplication is usually somewhere between 8 and 24 machine words (e.g. between 512 and 1536 bits on x86_64). For fixed sizes, the threshold is at the smaller end of that range, and the new ADCX/ADOX instructions likely bring it...

c,sse,simd,memory-alignment,intrinsics

For the second load you need to use _mm_loadu_si128 because the source data is misaligned. Explanation: an offset of +5 ints from a base address which is 16 byte aligned will no longer be 16 byte aligned.

c++,visual-studio-2010,optimization,simd

You could do something like: #include <valarray> #define sz 1000 struct pixel { int r, g, b; }; int main() { std::valarray<int> src(3 * sz); std::valarray<int> ker(3 * sz); std::valarray<int> t = src * ker; pixel px; px.r = t[t % 3 == 0].sum(); px.g = t[t % 3 ==...

Integer operations are only added since AVX2. So you'll have to use SSE2. If the int values fit in 23 bits then you can use float instead https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2 Advanced Vector Extensions 2 (AVX2), also known as Haswell New Instructions,[2] is an expansion of the AVX instruction set introduced in Intel's...

You're displaying the values as if they are signed, because you use %d as the printf format specifier. If you use %u you'll see the equivalent unsigned values (0 or UINT_MAX). Note that signed-ness is not particularly important in this context - you can treat the comparison results as unsigned...

I think it's just a failure of GCC to catch the problem. It's certainly UB (aliasing violation). Solving it with __attribute__((__may_alias__)) is easy though: typedef uint32_t __attribute__((__may_alias__)) u32ma; then use u32ma instead of uint32_t in the pointer cast....

At a first glance, this looks like expected behavior. When you specify let test = int4(1,2,3,4) the integer literals there are implicitly initialized as Int32 types. When you just do a let x = 1 x by default has a type of Int. As a safety measure, Swift doesn't do...

I think it may be possible to implement BigNum with SIMD efficiently but not in the way you suggest. Instead of implementing a single BigNum using a SIMD register (or with an array of SIMD registers) you should process multiple BigNums at once. Let's consider 128-bit addition. Let 128-bit integers...

I'm afraid there's nothing at the CIL level. As you noted, Microsoft is moving forward with SIMD support for a future version of .NET. If you can use Mono, it has had some SIMD support for a while (currently incompatible with the Microsoft.Bcl.Simd API. Both .NET and Mono share a...

In my matrix multiplication code I only have to use the broadcast once per kernel code but if you really want to load four doubles in one instruction and then broadcast them to four registers you can do it like this #include <stdio.h> #include <immintrin.h> int main() { double in[]...

I'm not sure how to do a in-place transpose for arbitrary matrices using SIMD efficiently but I do know how to do it for out-of-place. Let me describe how to do both In place transpose For in-place transpose you should see Agner Fog's Optimizing software in C++ manual. See section...

there is absolutely no auto vectorization wrt SIMD in ghc at the moment. none the current simd primops will trigger a GHC panic when used with the native code gen, though they will work with the -fllvm backend. those simd primops crucially lack a good data model for data shuffling,...

Use the _mm256_load_si256 intrinsic. Quoting the Intel Intrinsics Guide: __m256i _mm256_load_si256 (__m256i const * mem_addr) #include "immintrin.h" [...] Description Load 256-bits of integer data from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated. If the alignment requirement is a problem,...

I'm not sure I completely understand what you're trying to do, but if you want to convert e.g. 16 doubles to 16 chars per iteration using AVX/SSE then here is some code that works: #include <iostream> #include <immintrin.h> __m128i proc(const __m256d in0, const __m256d in1, const __m256d in2, const __m256d...

@dbaupp put the nail in the coffin with the suggestion to use round: #![allow(unstable)] use std::simd::{f32x4}; use std::num::Float; fn clamp(val: f32) -> u8 { if val < 0.0 { 0 } else if val > 255.0 { 255 } else { val.round() as u8 } } fn sum_f32x4(v: f32x4) ->...

opencv,assembly,optimization,histogram,simd

Like Jester I'm surprised that your SIMD code had any significant improvement. Did you compile the C code with optimization turned on? The one additional suggestion I can make is to unroll your Packetloop loop. This is a fairly simple optimization and reduces the number of instructions per "iteration" to...

Technically two 64-bit values could result in a 128-bit result. That's why there are the following int64*int32+int32 functions, but not one that takes two 64-bit input values. int64x2_t vmlal_s32 (int64x2_t, int32x2_t, int32x2_t); int64x2_t vqdmlal_s32 (int64x2_t, int32x2_t, int32x2_t); If those don't work for you, then you'll need to use a scalar...

Your inner loop: for (int i = 0; i < 80 * x; i++){ f += 2; } is optimized away by compiler. Compiling on VC++ for x86 the whole loop folds into one single assembly instruction: lea esi, DWORD PTR [esi+ecx*2] Where ecx is value of 80*x, and esi...

If the elements which are <= 0 are relatively sparse then one simple approach is to test 8 at a time using AVX and then drop into scalar code when you identify a vector which contains one or more such elements, e.g.: #include <immintrin.h> // AVX intrinsics const __m256 vk0...

c++,vectorization,sse,simd,avx

Every time someone does this: temp_1 = _mm_set_epi32(x[j], x[j+1], x[j+2], x[j+3]); .. a puppy dies. Use one of these: temp_1 = _mm_load_si128(x); // if aligned temp_1 = _mm_loadu_si128(x); // if not aligned Cast x as necessary. There is no integer version of _mm_dp_ps. But you can do what you were...

openmp,vectorization,simd,powerpc

The XL compiler on POWER Linux currently only supports a subset of the OpenMP 4.0 features. The SIMD construct feature is not supported at the moment, so the compiler will not recognize the construct in your source code. However, if vectorization is what you're looking for then the good news...

something like this (you'll probably want to use maximum optimiser settings): #include <iostream> template<class Intrinsic> struct optimised { using type = Intrinsic; optimised(type v) : _v (v) {} operator type&() { return _v; } operator const type&() const { return _v; } type _v; }; // naiive implementation of madd...

You can do this with _mm_shuffle_ps (SHUFPS): #include "xmmintrin.h" // SSE xmm2 = _mm_shuffle_ps(xmm1, xmm1, _MM_SHUFFLE(0, 0, 0, 0)); Note: depending on how you've ordered the elements in your example above it might instead need to be: xmm2 = _mm_shuffle_ps(xmm1, xmm1, _MM_SHUFFLE(3, 3, 3, 3)); ...

That's a great question. Intel's answer (my bold) is here: This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction sequences to reflect special-number...

c++,optimization,simd,intrinsics,avx

Improving performance for the codes like yours is "well explored" and still popular area. Take a look at dot-product (perfect link provided by Z Boson already) or at some (D)AXPY optimization discussions (http://scicomp.stackexchange.com/questions/1932/are-daxpy-dcopy-dscal-overkills) In general , key topics to explore and consider applying are: AVX2 advantage over AVX due to...

Instruction predication means that an instruction is conditionally executed by a thread depending on a predicate. Threads for which the predicate is true execute the instruction, the rest do nothing. For example: var = 0; // Not taken by all threads if (condition) { var = 1; } else {...

c++,c++11,floating-point,sse,simd

SSE intrinsics can be pretty tedious sometimes... But not here. You just screwed up your loop : for( long long i = iMultipleOf4; i > 0LL; i -= 4LL ) I doubt it's doing what you expected. If iMultipleOf4 is 4, then your function will compute with 4,3,2,1 but not...

algorithm,sorting,time-complexity,sse,simd

It sounds as though a sorting network is the answer to the question that you asked, since the position of the comparators is not data dependent. Batcher's bitonic mergesort is O(n log2 n).

It seems that _mm_shuffle_epi8 is indeed the key to a solution. The idea is to set individual bits according to the values of the input vector a. These bits are distributed over (horizontal OR) the bytes of the 128 bits wide register. #include <stdio.h> #include <immintrin.h> /* gcc -O3 -Wall...

Each and every kernel instruction is always executed for all of the threads within a warp. Therefore it is logically not possible to carry out different instructions on different threads within the same warp at the same time. This would be against the SIMT execution model upon which GPUs are...

A x64 native (AMD64 or Intel 64) processor is only mandated to support SSE and SSE2. SSE3 is supported by Intel Pentium 4 processors (“Prescott”), AMD Athlon 64 (“revision E”), AMD Phenom, and later processors. This means most, but not quite all, x64 capable CPUs should support SSE3. Supplemental SSE3...

To my understanding, packed means that conceptually more than one value is transferred or used as an operand, whereas non-packed means that only one value is is processed; non-packed means that no parallel processing takes place.

Finally, i found this update is working properly int t; int s; int16_t *array; __m128i vector; posix_memalign ((void **) &array, BYTE_ALIGNMENT, n * m * sizeof(int16_t) ); int l=0; for (int i=0; i<n; i++) { for (int j=0; j<m; j++) { array[l] = (condition) ? t : s; // fill...

-march=core2 means that gcc can assume (along with 64 bit ISA) up to SSSE3 (e.g., MMX, SSE, SSE2, SSE3) is available. -mfpmath=sse can then force the use of SSE for floating-point arithmetic (the default in 64-bit mode), rather than 387 (the default in 32-bit -m32 mode). See: "Intel 386 and...

c,multithreading,performance,matrix-multiplication,simd

Here are the times I get building on your algorithm on my four core i7 IVB processor. sequential: 3.42 s 4 threads: 0.97 s 4 threads + SSE: 0.86 s Here are the times on a 2 core P9600 @2.53 GHz which is similar to the OP's E2200 @2.2 GHz...

There is a way to emulate this operation, but it is not very beautiful: const __m256i K0 = _mm256_setr_epi8( 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0,...

c++,algorithm,bit-manipulation,simd,intrinsics

As announced in the chat, I add a refined answer. It contains three parts, each of them followed by a description of that part. The 1st part, get.h, is my solution, but generalized and with one correction. The 2nd part, got.h, is the original algorithm as posted in the question,...

Here's some OpenMP code I put together to find the intersection of two sorted sets. It's possible to do this without a critical section when merging the results (as well as merge them sorted in parallel) but I did not do that here. It's probably possible to do this with...

Is this possible to improve more the performances/speed up ? It should be x4 (maximum) for SSE and x8 for AVX. Yes, I explained this in detail at efficient-4x4-matrix-vector-multiplication-with-sse-horizontal-add-and-dot-product. The efficient method for multiplying a 4x4 matrix M with a column vector u giving v = M u is:...

MOVSS moves single precision floats (32-bit). I assume that n is an integer so you can't load it into a XMM register with MOVSS. Use CVTSI2SS instead. printf cannot process single precision floats, which would converted to doubles by the compiler. It's convenient to use CVTSS2SI at this point. So...

Here is a solution (PaulR improved my solution, see the end of my answer or his answer) based on a variation of this question fastest-way-to-broadcast-32-bits-in-32-bytes. __m256i t1 = _mm256_set1_epi8(x); __m256i t2 = _mm256_and_si256(t1, mask); __m256i t4 = _mm256_cmpeq_epi32(t2, _mm256_setzero_si256()); t4 = _mm256_xor_si256(t4, _mm256_set1_epi32(-1)); I don't have AVX2 hardware to test...

A couple of optimisations for your existing code: If your data is sparse then it would probably be a good idea to add an additional test of each 8 bit mask value prior to testing the additional bits, i.e. int mask = packed_destinations[i].bitMask[j]; if (mask != 0) { if (mask...

The theoretical maximum is 25 32 bit integer ops per cycle: Port 0: 1 scalar op or 1 vector multiply op Port 1: 1 scalar op or 1 vector ALU op Port 5: 1 scalar op or 1 vector ALU op Port 6: 1 scalar op Since vector ops can...

I think you just have a trivial bug - your function should be: int check2(__m256i vector1, __m256i vector2) { __m256i vcmp = _mm256_cmpgt_epi16(vector1, vector2); int cmp = _mm256_movemask_epi8(vcmp); return cmp != 0; } The problem is that _mm256_movemask_epi8 returns 32 bit flags as a signed int, and you were testing...

Firstly, as per the comments above, I'm going to assume that it's OK to transpose LATTICEVELOCITIES: static const int32_t LATTICEVELOCITIES[3][PARAMQ] = { { 0, -1, 0, 1, 0, -1, 0, 1, -1, 0, 1, -1, 0, 1, 0, -1, 0, 1, 0 }, { -1, 0, 0, 0, 1, -1,...

Class constructor = Static constructor Instance constructor = Normal constructor For example, class MyClass { // Static/Class constructor. // Note: Static constructors cannot have visibility modifier (eg. public/private), // and cannot have any arguments. static MyClass() { ... // This will only execute once - when this class is first...

c,visual-c++,simd,avx,dot-product

There are two big inefficiencies in your loop that are immediately apparent: (1) these two chunks of scalar code: __declspec(align(32)) double ar[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x }; ... __m256d y = _mm256_load_pd(ar); and __declspec(align(32)) double arr[4] = { xb[i].x, xb[i + 1].x,...

To clip double precision values to a range of -1.0 to +1.0 you can use max/min operations. E.g. if you have a buffer, buff, of N double values: const __m128d kMax = _mm_set1_pd(1.0); const __m128d kMin = _mm_set1_pd(-1.0); for (int i = 0; i < N; i += 2) {...

That data structure has multiple names such as a Hybrid Structure of Arrays (see Extending a C-like Language for Portable SIMD Programming) or an array of struct of arrays (AoSoA). AoS is not suitable for SIMD. SoA is an improvement but in some cases is still not sufficient. The solution...

c++,optimization,vectorization,sse,simd

The error 1305 happens because the optimizer did not vectorize the loop since the value sum is not used. Simply adding printf("%d\n", sum) fixes that. But then you get a new error code 1105 "Loop includes a non-recognized reduction operation". To fix this you need you need to set /fp:fast...