I needed an Internet Checksum function (one's complement checksum) for some IPv4 ICMP processing code using raw sockets and I've stumbled on behaviour I can't explain on a 64-bit Intel processor (using gcc 4.8.2). I was wondering if anybody could shed some light on it.
I implemented a first checksum function using a 32-bit accumulator and performing 16-bit sums. Then I implemented the same using a 64-bit accumulator and 32-bit sums, thinking that less sums would result in faster execution. The result is that the first implementation runs twice as fast as the second (with O3 optimisation). I simply can't figure out why...
The code below doesn't actually perform accurate checksums (I've simplified it) but illustrates the problem. Both compiled as 64-bit running on 64-bit native platform (LP64: short 16-bit, int 32-bit, long 64-bit, pointers 64-bit).
32-bit accumulator and 16-bit sums
unsigned short cksum_16_le(unsigned char* data, size_t size) { unsigned short word; unsigned int sum = 0; unsigned int i; for(i = 0; i < size - 1; i += 2) sum += *((unsigned short*) (data + i)); sum = (sum & 0xffff) + (sum >> 16); sum = (sum & 0xffff) + (sum >> 16); return ~sum; }
50,000 function calls over the same 10k data: ~1.1 seconds.
64-bit accumulator and 32-bit sums
unsigned short cksum_32_le(unsigned char* data, size_t size) { unsigned long word; unsigned long sum = 0; unsigned int i; for(i = 0; i < size - 3; i += 4) sum += *((unsigned int*) (data + i)); sum = (sum & 0xffffffff) + (sum >> 32); sum = (sum & 0xffffffff) + (sum >> 32); sum = (sum & 0xffff) + (sum >> 16); sum = (sum & 0xffff) + (sum >> 16); return ~sum; }
50,000 function calls over the same 10k data: ~2.2 seconds.
Update:
It seems that the issue is due to hardware. Running memory diags revealed occasional bus parity errors (not sure why this would affect the 32-bit version more than the 16-bit version, but there you go). The code runs as expected on other servers. Will delete the question within the next few hours (being hardware related, it is not particularly useful anymore).
Final update:
Interestingly, replacing the for
loops with while
loops and compiling with O3 optimisation (shown below for the 64-bit accumulator case) gets both the 32-bit and 64-bit accumulator cases to run at identical speeds. This is because compiler performs some loop unrolling (oddly, it does not unroll the for
loop) and performs the summing using mmx
registers...
uint64_t sum = 0;
const uint32_t* dptr = (const uint32_t*) data;
while (size > 3)
{
sum += (uint32_t) *dptr++;
size -= 4;
}
question from:https://stackoverflow.com/questions/21654207/c-64-bit-loop-performance-on-x86