Is this possible to improve more the performances/speed up ? It should be x4 (maximum) for SSE and x8 for AVX. Yes, I explained this in detail at efficient-4x4-matrix-vector-multiplication-with-sse-horizontal-add-and-dot-product. The efficient method for multiplying a 4x4 matrix M with a column vector u giving v = M u is:...

Finally, i found this update is working properly int t; int s; int16_t *array; __m128i vector; posix_memalign ((void **) &array, BYTE_ALIGNMENT, n * m * sizeof(int16_t) ); int l=0; for (int i=0; i<n; i++) { for (int j=0; j<m; j++) { array[l] = (condition) ? t : s; // fill...