C 64-bit loop performance on x86

Question

Welcome To Ask or Share your Answers For Others

C 64-bit loop performance on x86

asked Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

I needed an Internet Checksum function (one's complement checksum) for some IPv4 ICMP processing code using raw sockets and I've stumbled on behaviour I can't explain on a 64-bit Intel processor (using gcc 4.8.2). I was wondering if anybody could shed some light on it.

I implemented a first checksum function using a 32-bit accumulator and performing 16-bit sums. Then I implemented the same using a 64-bit accumulator and 32-bit sums, thinking that less sums would result in faster execution. The result is that the first implementation runs twice as fast as the second (with O3 optimisation). I simply can't figure out why...

The code below doesn't actually perform accurate checksums (I've simplified it) but illustrates the problem. Both compiled as 64-bit running on 64-bit native platform (LP64: short 16-bit, int 32-bit, long 64-bit, pointers 64-bit).

32-bit accumulator and 16-bit sums

unsigned short
cksum_16_le(unsigned char* data, size_t size)
{
    unsigned short word;
    unsigned int sum = 0;
    unsigned int i;

    for(i = 0; i < size - 1; i += 2)
        sum += *((unsigned short*) (data + i));

    sum = (sum & 0xffff) + (sum >> 16);
    sum = (sum & 0xffff) + (sum >> 16);

    return ~sum;
}

50,000 function calls over the same 10k data: ~1.1 seconds.

64-bit accumulator and 32-bit sums

unsigned short
cksum_32_le(unsigned char* data, size_t size)
{
    unsigned long word;
    unsigned long sum = 0;
    unsigned int i;

    for(i = 0; i < size - 3; i += 4)
        sum += *((unsigned int*) (data + i));

    sum = (sum & 0xffffffff) + (sum >> 32);
    sum = (sum & 0xffffffff) + (sum >> 32);
    sum = (sum & 0xffff) + (sum >> 16);
    sum = (sum & 0xffff) + (sum >> 16);

    return ~sum;
}

50,000 function calls over the same 10k data: ~2.2 seconds.

Update:

It seems that the issue is due to hardware. Running memory diags revealed occasional bus parity errors (not sure why this would affect the 32-bit version more than the 16-bit version, but there you go). The code runs as expected on other servers. Will delete the question within the next few hours (being hardware related, it is not particularly useful anymore).

Final update:

Interestingly, replacing the for loops with while loops and compiling with O3 optimisation (shown below for the 64-bit accumulator case) gets both the 32-bit and 64-bit accumulator cases to run at identical speeds. This is because compiler performs some loop unrolling (oddly, it does not unroll the for loop) and performs the summing using mmx registers...

uint64_t sum = 0;
const uint32_t* dptr = (const uint32_t*) data;

while (size > 3)
{
    sum += (uint32_t) *dptr++;
    size -= 4;
}

question from:https://stackoverflow.com/questions/21654207/c-64-bit-loop-performance-on-x86

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

144 views

1 Answer

深蓝 · Answer 1 · 2021-10-06T03:33:32+0000

I had a similar issue like this before; I couldn't find any issues in either of our code. BUT what worked for me was changing the compiler.

My guess is that GCC is writing deprecated Assembly.

If you can decompile your application we could shed more light on the issue but there just isn't enough information to go on here.

When I decompiled my code what I found is that it was rewriting an entire method several times. but that might just be for me.

Hope this helped you, there is little to no information about this anywhere.

If I had to guess I would agree with Learner, I am pretty certain that the decompiled code would point to the for loop. I am very interested with this issue so please comment back.

Categories

C 64-bit loop performance on x86

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags