I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile.
SSE2: gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp
AVX: gcc matrix.cpp -o matrix_gcc -O3 -mavx -fopenmp
AVX2+FMA: gcc matrix.cpp -o matrix_gcc -O3 -march=native -fopenmp -ffast-math
The SSE2 and AVX version are clearly different in performance. However, the AVX2+FMA is no better than the AVX version. I don't understand this. I get over 80% of the peak flops of the CPU assuming there is no FMA but I think I should be able to do a lot better with FMA. Matrix Multiplication should benefit directly from FMA. I'm essentially doing eight dot products at once in AVX. When I check march=native
it gives:
cc -march=native -E -v - </dev/null 2>&1 | grep cc1 | grep fma
...-march=core-avx2 -mavx -mavx2 -mfma -mno-fma4 -msse4.2 -msse4.1 ...
So I can see it's enabled (just to be sure I added -mfma
but it makes not difference). ffast-math
should allow a relaxed floating point model How to use Fused Multiply-Add (FMA) instructions with SSE/AVX
Edit:
Based on Mysticial's comments I went ahead and used _mm256_fmadd_ps and now the AVX2+FMA version is faster. I'm not sure why the compiler won't do this for me. I'm now getting about 80 GFLOPS (110% of the peak flops without FMA) for over 1000x1000 matrices. In case anyone does not trust my peak flop calculation here is what I did.
peak flops (no FMA) = frequency * simd_width * ILP * cores
= 2.3GHZ * 8 * 2 * 2 = 73.2 GFLOPS
peak flops (with FMA) = 2 * peak flops (no FMA) = 146.2 GFLOPS
My CPU in turbo mode when using both cores is 2.3 GHz. I get 2 for ILP because Ivy Bridge can do one AVX multiplication and one AVX addition at the same time (and I have unrolled the loop several times to ensure this).
I'm only geting about 55% of the peak flops (with FMA). I'm not sure why but at least I'm seeing something now.
One side effect is that I now get a small error when I compare to a simple matrix multiplication algorithm I know I trust. I think that's due to the fact that FMA only has one rounding mode instead of what would normally be two (which ironically breaks IEEE floating point rules even though it's probably better).
Edit:
Somebody needs to redo How do I achieve the theoretical maximum of 4 FLOPs per cycle? but do 8 double floating point FLOPS per cycle with Haswell.
Edit
Actually, Mysticial has updated his project to support FMA3 (see his answer in the link above). I ran his code in Windows8 with MSVC2012 (because the Linux version did not compile with FMA support). Here are the results.
Testing AVX Mul + Add:
Seconds = 22.7417
FP Ops = 768000000000
FLOPs = 3.37705e+010
sum = 17.8122
Testing FMA3 FMA:
Seconds = 22.1389
FP Ops = 1536000000000
FLOPs = 6.938e+010
sum = 333.309
That's 69.38 GFLOPS for FMA3 for double floating point. For single floating point I need to double it so that's 138.76 SP GFLOPS. I calculate my peak is 146.2 SP GFLOPS. That's 95% of the peak! In other words I should be able to improve my GEMM code quite a bit (although it's already quite a bit faster than Eigen).
See Question&Answers more detail:os