When I first got a Haswell processor I tried implementing FMA to determine the Mandelbrot set. The main algorithm is this:
intn = 0;
for(int32_t i=0; i<maxiter; i++) {
floatn x2 = square(x), y2 = square(y); //square(x) = x*x
floatn r2 = x2 + y2;
booln mask = r2<cut; //booln is in the float domain non integer domain
if(!horizontal_or(mask)) break; //_mm256_testz_pd(mask)
n -= mask
floatn t = x*y; mul2(t); //mul2(t): t*=2
x = x2 - y2 + cx;
y = t + cy;
}
This determines if n
pixels are in the Mandelbrot set. So for double floating point it runs over 4 pixels (floatn = __m256d
, intn = __m256i
). This requires 4 SIMD floating point multiplication and four SIMD floating point additions.
Then I modified this to work with FMA like this
intn n = 0;
for(int32_t i=0; i<maxiter; i++) {
floatn r2 = mul_add(x,x,y*y);
booln mask = r2<cut;
if(!horizontal_or(mask)) break;
add_mask(n,mask);
floatn t = x*y;
x = mul_sub(x,x, mul_sub(y,y,cx));
y = mul_add(2.0f,t,cy);
}
where mul_add calls _mm256_fmad_pd
and mul_sub calls _mm256_fmsub_pd
. This method uses 4 FMA SIMD operations, and two SIMD multiplications which is two less arithmetic operations then without FMA. Additionally, FMA and multiplication can use two ports and addition only one.
To make my tests less biased I zoomed into a region which is entirely in the Mandelbrot set so all the values are maxiter
. In this case the method using FMA is about 27% faster. That's certainly an improvement but going from SSE to AVX doubled my performance so I was hoping for maybe another factor of two with FMA.
But then I found this answer in regards to FMA where it says
The important aspect of the fused-multiply-add instruction is the (virtually) infinite precision of the intermediate result. This helps with performance, but not so much because two operations are encoded in a single instruction — It helps with performance because the virtually infinite precision of the intermediate result is sometimes important, and very expensive to recover with ordinary multiplication and addition when this level of precision is really what the programmer is after.
and later gives an example of double*double to double-double multiplication
high = a * b; /* double-precision approximation of the real product */
low = fma(a, b, -high); /* remainder of the real product */
From this, I concluded that I was implementing FMA non-optimally and so I decided to implement SIMD double-double. I implemented double-double based on the paper Extended-Precision Floating-Point Numbers for GPU Computation. The paper is for double-float so I modified it for double-double. Additionally, instead of packing one double-double value in a SIMD registers I pack 4 double-double values into one AVX high register and one AVX low register.
For the Mandelbrot set what I really need is double-double multiplication and addition. In that paper these are the df64_add
and df64_mult
functions.
The image below shows the assembly for my df64_mult
function for software FMA (left) and hardware FMA (right). This clearly shows that hardware FMA is a big improvement for double-double multiplication.
So how does hardware FMA perform in the double-double Mandelbrot set calculation? The answer is that's only about 15% faster than with software FMA. That's much less than I hoped for. The double-double Mandelbrot calculation needs 4 double-double additions, and four double-double multiplications (x*x
, y*y
, x*y
, and 2*(x*y)
). However, the 2*(x*y)
multiplication is trivial for double-double so this multiplication can be ignored in the cost. Therefore, the reason I think the improvement using hardware FMA is so small is that the calculation is dominated by the slow double-double addition (see the assembly below).
It used to be that multiplication was slower than addition (and programers used several tricks to avoid multiplication) but with Haswell it seems that it's the other way around. Not only due to FMA but also because multiplication can use two ports but addition only one.
So my questions (finally) are:
- How does one optimize when addition is slow compared to multiplication?
- Is there an algebraic way to change my algorithm to use more multiplications
and less additions? I know there are method to do the reverse, e.g.
(x+y)*(x+y) - (x*x+y*y) = 2*x*y
which use two more additions for one less multiplication. - Is there a way to simply the df64_add function (e.g. using FMA)?
In case anyone is wondering the double-double method is about ten times slower than double. That's not so bad I think as if there was a hardware quad-precision type it would likely be at least twice as slow as double so my software method is about five times slower than what I would expect for hardware if it existed.
df64_add
assembly
vmovapd 8(%rsp), %ymm0
movq %rdi, %rax
vmovapd 72(%rsp), %ymm1
vmovapd 40(%rsp), %ymm3
vaddpd %ymm1, %ymm0, %ymm4
vmovapd 104(%rsp), %ymm5
vsubpd %ymm0, %ymm4, %ymm2
vsubpd %ymm2, %ymm1, %ymm1
vsubpd %ymm2, %ymm4, %ymm2
vsubpd %ymm2, %ymm0, %ymm0
vaddpd %ymm1, %ymm0, %ymm2
vaddpd %ymm5, %ymm3, %ymm1
vsubpd %ymm3, %ymm1, %ymm6
vsubpd %ymm6, %ymm5, %ymm5
vsubpd %ymm6, %ymm1, %ymm6
vaddpd %ymm1, %ymm2, %ymm1
vsubpd %ymm6, %ymm3, %ymm3
vaddpd %ymm1, %ymm4, %ymm2
vaddpd %ymm5, %ymm3, %ymm3
vsubpd %ymm4, %ymm2, %ymm4
vsubpd %ymm4, %ymm1, %ymm1
vaddpd %ymm3, %ymm1, %ymm0
vaddpd %ymm0, %ymm2, %ymm1
vsubpd %ymm2, %ymm1, %ymm2
vmovapd %ymm1, (%rdi)
vsubpd %ymm2, %ymm0, %ymm0
vmovapd %ymm0, 32(%rdi)
vzeroupper
ret
See Question&Answers more detail:os