Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I've been running into some issues that only occurred during Release x86 mode and not during Release x64 or any Debug mode. I managed to reproduce the bug using the following code:

#include <stdio.h>
#include <iostream>
using namespace std;

struct WMatrix {
    float _11, _12, _13, _14;
    float _21, _22, _23, _24;
    float _31, _32, _33, _34;
    float _41, _42, _43, _44;

    WMatrix(float f11, float f12, float f13, float f14,
            float f21, float f22, float f23, float f24,
            float f31, float f32, float f33, float f34,
            float f41, float f42, float f43, float f44) :
        _11(f11), _12(f12), _13(f13), _14(f14),
        _21(f21), _22(f22), _23(f23), _24(f24),
        _31(f31), _32(f32), _33(f33), _34(f34),
        _41(f41), _42(f42), _43(f43), _44(f44) {
    }
};

void printmtx(WMatrix m1) {
    char str[256];
    sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._11, m1._12, m1._13, m1._14);
    cout << str << "
";
    sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._21, m1._22, m1._23, m1._24);
    cout << str << "
";
    sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._31, m1._32, m1._33, m1._34);
    cout << str << "
";
    sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._41, m1._42, m1._43, m1._44);
    cout << str << "
";
}

WMatrix mul1(WMatrix m, float f) {
    WMatrix out = m;
    for (unsigned int i = 0; i < 4; i++) {
        for (unsigned int j = 0; j < 4; j++) {
            unsigned int idx = i * 4 + j; // critical code
            *(&out._11 + idx) *= f; // critical code
        }
    }
    return out;
}

WMatrix mul2(WMatrix m, float f) {
    WMatrix out = m;
    unsigned int idx2 = 0;
    for (unsigned int i = 0; i < 4; i++) {
        for (unsigned int j = 0; j < 4; j++) {
            unsigned int idx = i * 4 + j; // critical code
            bool b = idx == idx2; // critical code
            *(&out._11 + idx) *= f; // critical code
            idx2++;
        }
    }
    return out;
}


int main() {
    WMatrix m1(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16);
    WMatrix m2 = mul1(m1, 0.5f);
    WMatrix m3 = mul2(m1, 0.5f);

    printmtx(m1);
    cout << "
";
    printmtx(m2);
    cout << "
";
    printmtx(m3);

    int x;
    cin >> x;
}

In the above code, mul2 works, but mul1 does not. mul1 and mul2 are simply trying to iterate over the floats in the WMatrix and multiply them by f, but the way mul1 indexes (i*4+j) somehow evaluates to incorrect results. All mul2 does different is it checks the index before using it and then it works (there are many other ways of tinkering with the index to make it work). Notice if you remove the line "bool b = idx == idx2" then mul2 also breaks...

Here is the output:

1.000, 2.000, 3.000, 4.000
5.000, 6.000, 7.000, 8.000
9.000, 10.000, 11.000, 12.000
13.000, 14.000, 15.000, 16.000

0.500, 0.500, 0.375, 0.250
0.625, 1.500, 3.500, 8.000
9.000, 10.000, 11.000, 12.000
13.000, 14.000, 15.000, 16.000

0.500, 1.000, 1.500, 2.000
2.500, 3.000, 3.500, 4.000
4.500, 5.000, 5.500, 6.000
6.500, 7.000, 7.500, 8.000

Correct output should be...

1.000, 2.000, 3.000, 4.000
5.000, 6.000, 7.000, 8.000
9.000, 10.000, 11.000, 12.000
13.000, 14.000, 15.000, 16.000

0.500, 1.000, 1.500, 2.000
2.500, 3.000, 3.500, 4.000
4.500, 5.000, 5.500, 6.000
6.500, 7.000, 7.500, 8.000

0.500, 1.000, 1.500, 2.000
2.500, 3.000, 3.500, 4.000
4.500, 5.000, 5.500, 6.000
6.500, 7.000, 7.500, 8.000

Am I missing something? Or is it actually a bug in the compiler?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
158 views
Welcome To Ask or Share your Answers For Others

1 Answer

This afflicts only the 32-bit compiler; x86-64 builds are not affected, regardless of optimization settings. However, you see the problem manifest in 32-bit builds whether optimizing for speed (/O2) or size (/O1). As you mentioned, it works as expected in debugging builds with optimization disabled.

Wimmel's suggestion of changing the packing, accurate though it is, does not change the behavior. (The code below assumes the packing is correctly set to 1 for WMatrix.)

I can't reproduce it in VS 2010, but I can in VS 2013 and 2015. I don't have 2012 installed. That's good enough, though, to allow us to analyze the difference between the object code produced by the two compilers.

Here is the code for mul1 from VS 2010 (the "working" code):
(Actually, in many cases, the compiler inlined the code from this function at the call site. But the compiler will still output disassembly files containing the code it generated for the individual functions prior to inlining. That's what we're looking at here, because it is more cluttered. The behavior of the code is entirely equivalent whether it's been inlined or not.)

PUBLIC  mul1
_TEXT   SEGMENT
_m$ = 8                     ; size = 64
_f$ = 72                        ; size = 4
mul1 PROC
 ___$ReturnUdt$ = eax

    push    esi
    push    edi

    ; WMatrix out = m;

    mov ecx, 16                 ; 00000010H
    lea esi, DWORD PTR _m$[esp+4]
    mov edi, eax
    rep movsd

    ; for (unsigned int i = 0; i < 4; i++)
    ; {
    ;    for (unsigned int j = 0; j < 4; j++)
    ;    {
    ;       unsigned int idx = i * 4 + j; // critical code
    ;       *(&out._11 + idx) *= f; // critical code

    movss   xmm0, DWORD PTR [eax]
    cvtps2pd xmm1, xmm0
    movss   xmm0, DWORD PTR _f$[esp+4]
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax], xmm1
    movss   xmm1, DWORD PTR [eax+4]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+4], xmm1
    movss   xmm1, DWORD PTR [eax+8]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+8], xmm1
    movss   xmm1, DWORD PTR [eax+12]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+12], xmm1
    movss   xmm2, DWORD PTR [eax+16]
    cvtps2pd xmm2, xmm2
    cvtps2pd xmm1, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+16], xmm1
    movss   xmm1, DWORD PTR [eax+20]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+20], xmm1
    movss   xmm1, DWORD PTR [eax+24]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+24], xmm1
    movss   xmm1, DWORD PTR [eax+28]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+28], xmm1
    movss   xmm1, DWORD PTR [eax+32]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+32], xmm1
    movss   xmm1, DWORD PTR [eax+36]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+36], xmm1
    movss   xmm2, DWORD PTR [eax+40]
    cvtps2pd xmm2, xmm2
    cvtps2pd xmm1, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+40], xmm1
    movss   xmm1, DWORD PTR [eax+44]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+44], xmm1
    movss   xmm2, DWORD PTR [eax+48]
    cvtps2pd xmm1, xmm0
    cvtps2pd xmm2, xmm2
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+48], xmm1
    movss   xmm1, DWORD PTR [eax+52]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    movss   DWORD PTR [eax+52], xmm1
    movss   xmm1, DWORD PTR [eax+56]
    cvtps2pd xmm1, xmm1
    cvtps2pd xmm2, xmm0
    mulsd   xmm1, xmm2
    cvtpd2ps xmm1, xmm1
    cvtps2pd xmm0, xmm0
    movss   DWORD PTR [eax+56], xmm1
    movss   xmm1, DWORD PTR [eax+60]
    cvtps2pd xmm1, xmm1
    mulsd   xmm1, xmm0
    pop edi
    cvtpd2ps xmm0, xmm1
    movss   DWORD PTR [eax+60], xmm0
    pop esi

    ; return out;
    ret 0
mul1 ENDP

Compare that to the code for mul1 generated by VS 2015:

mul1 PROC
_m$ = 8                         ; size = 64
; ___$ReturnUdt$ = ecx
; _f$ = xmm2s

    ; WMatrix out = m;

    movups  xmm0, XMMWORD PTR _m$[esp-4]

    ; for (unsigned int i = 0; i < 4; i++)

    xor eax, eax
    movaps  xmm1, xmm2
    movups  XMMWORD PTR [ecx], xmm0
    movups  xmm0, XMMWORD PTR _m$[esp+12]
    shufps  xmm1, xmm1, 0
    movups  XMMWORD PTR [ecx+16], xmm0
    movups  xmm0, XMMWORD PTR _m$[esp+28]
    movups  XMMWORD PTR [ecx+32], xmm0
    movups  xmm0, XMMWORD PTR _m$[esp+44]
    movups  XMMWORD PTR [ecx+48], xmm0
    npad    4
$LL4@mul1:

    ; for (unsigned int j = 0; j < 4; j++)
    ; {
    ;    unsigned int idx = i * 4 + j; // critical code
    ;    *(&out._11 + idx) *= f; // critical code

    movups  xmm0, XMMWORD PTR [ecx+eax*4]
    mulps   xmm0, xmm1
    movups  XMMWORD PTR [ecx+eax*4], xmm0
    inc eax
    cmp eax, 4
    jb  SHORT $LL4@mul1

    ; return out;
    mov eax, ecx
    ret 0
?mul1@@YA?AUWMatrix@@U1@M@Z ENDP            ; mul1
_TEXT   ENDS

It is immediately obvious how much shorter the code is. Apparently the optimizer got a lot smarter between VS 2010 and VS 2015. Unfortunately, sometimes the source of the optimizer's "smarts" is the exploitation of bugs in your code.

Looking at the code that matches up with the loops, you can see that VS 2010 is unrolling the loops. All of the computations are done inline so that there are no branches. This is kind of what you'd expect for loops with upper and lower bounds that are known at compile time and, as in this case, reasonably small.

What happened in VS 2015? Well, it didn't unroll anything. There are 5 lines of code, and then a conditional jump JB back to the top of the loop sequence. That alone doesn't tell you much. What does look highly suspicious is that it only loops 4 times (see the cmp eax, 4 statement that sets flags right before doing the jb, effectively continuing the loop as long as the counter is less than 4). Well, that might be okay if it had merged the two loops into one. Let's see what it's doing inside of the loop:

$LL4@mul1:
  movups  xmm0, XMMWORD PTR [ecx+eax*4]   ; load a packed unaligned value into XMM0
  mulps   xmm0, xmm1                      ; do a packed multiplication of XMM0 by XMM1,
                                          ;  storing the result in XMM0
  movups  XMMWORD PTR [ecx+eax*4], xmm0   ; store the result of the previous multiplication
                                          ;  back into the memory location that we
                                          ;  initially loaded from

  inc      eax                            ; one iteration done, increment loop counter
  cmp      eax, 4                         ; see how many loops we've done
  jb       $LL4@mul1                      ; keep looping if < 4 iterations

The code reads a value from memory (an XMM-sized value from the location determined by ecx + eax * 4) into XMM0, multiplies it by a value in XMM1 (which was set outside the loop, based on the f parameter), and then stores the result back into the original memory location.

Compare that to the code for the corresponding loop in mul2:

$LL4@mul2:
  lea     eax, DWORD PTR [eax+16]
  movups  xmm0, XMMWORD PTR [eax-24]
  mulps   xmm0, xmm2
  movups  XMMWORD PTR [eax-24], xmm0
  sub     ecx, 1
  jne     $LL4@mul2

Aside from a different loop control sequence (this sets ECX to 4 outside of the loop, subtracts 1 each time through, and keeps looping as long as ECX != 0), the big difference here is the actual XMM values that it manipulates in memory. Instead of loading from [ecx+eax*4], it loads from [eax-24] (after having previously added 16 to EAX).

What's different about mul2? You had added code to track a separate index in idx2, incrementing it each time through the loop. Now, this alone would not be enough. If you comment out the assignment to the bool variable b, mul1 and mul2 result in identical object code. Clearly without the comparison of idx to idx2, the compiler is able to deduce that idx2 is completely unused, and therefore eliminate it, turning mul2 into mul1. But with that comparison, the compiler apparently becomes unable to eliminate idx2, and its presence ever so slightly changes what optimizations are deemed possible for the function, resulting in the output discrepancy.

Now the question turns to why is this happening. Is it an optimizer bug, as you first suspected? Well, no—and as some of the commenters have mentioned, it should never be your first instinct to blame the compiler/optimizer. Always assume that there are bugs in your code unless you can prove otherwise. That proof would always involve looking at the disassembly, and preferably referencing the relevant portions of the language standard if you really want to be taken seriously.

In this case, Mystical has already nailed the problem. Your code exhibits undefined behavior when it does *(&out._11 + idx). This makes certain assumptions about the layout of the WMatrix struct in memory, which you cannot legally make, even after explicitly setting the packing.

This is why undefined behavior is evil—it results in code that seems to work sometimes, but other times it doesn't. It is very sensitive to compiler flags, especially optimizations, but also target platforms (as we saw at the top of this answer). mul2 only works by accident. Both mul1 and mul2 are wrong. Unfortunately, the bug is in your code. Worse, the compiler didn't issue a warning that might have alerted you to your use of undefined behavior.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...