Firstly, as per the comments above, I'm going to assume that it's OK to transpose LATTICEVELOCITIES: static const int32_t LATTICEVELOCITIES[3][PARAMQ] = { { 0, -1, 0, 1, 0, -1, 0, 1, -1, 0, 1, -1, 0, 1, 0, -1, 0, 1, 0 }, { -1, 0, 0, 0, 1, -1,...

I'm guessing the source of your confusion is that the double precision intrinsics (_mm_load_pd et al) each process a vector of two double precision values. lda appears to be the stride. So for example: c1 = _mm_loadu_pd( C+0*lda ); c2 = _mm_loadu_pd( C+1*lda ); loads a 2x2 block of doubles...

To clip double precision values to a range of -1.0 to +1.0 you can use max/min operations. E.g. if you have a buffer, buff, of N double values: const __m128d kMax = _mm_set1_pd(1.0); const __m128d kMin = _mm_set1_pd(-1.0); for (int i = 0; i < N; i += 2) {...

In your code very basic floating point maths is involved. And I bet if you turn optimizations on (even -O1) it gets optimized out because those values are constant expressions and so calculable at compile-time. SSE is used (movss, mulss) because it's the threshold of floating point calculus, if we...

I think you just have a trivial bug - your function should be: int check2(__m256i vector1, __m256i vector2) { __m256i vcmp = _mm256_cmpgt_epi16(vector1, vector2); int cmp = _mm256_movemask_epi8(vcmp); return cmp != 0; } The problem is that _mm256_movemask_epi8 returns 32 bit flags as a signed int, and you were testing...

I'm not sure I completely understand what you're trying to do, but if you want to convert e.g. 16 doubles to 16 chars per iteration using AVX/SSE then here is some code that works: #include <iostream> #include <immintrin.h> __m128i proc(const __m256d in0, const __m256d in1, const __m256d in2, const __m256d...