Don’t try to outsmart the compiler
The other day at work there was a need to flip an image vertically, in a way that did not bring large portions of other code that deals with images. Flipping vertically is easy:
for( int y = 0; y < height/2; ++y ) {
memswap( img+y*width, img+(height-y-1)*width, width*img(arr[0]) );
}
memswap function was done this way:
// why isnt this in the std lib?
// using XOR to avoid tmp var
void memswap( void* m1, void* m2, size_t n )
{
char *p = (char*)m1; char *q = (char*)m2;
while ( n-- ) {
*p ^= *q; *q ^= *p; *p ^= *q;
p++; q++;
}
}
The comment above the function was what triggered my interest. I just added:
// because it can be slower (local variable is likely in register;
// whereas using XOR involves reads/writes to memory)
But then I got interested in this, I just had to check what happens in one or another case.
Using Apple's gcc 4.0.1 on Core 2 Duo, the above memory swapping code takes about 12.5 clock cycles per swapped image pixel (pixel = 4 bytes). The inner loop is this:
movzx eax,BYTE PTR [edx-0x1] xor al,BYTE PTR [ecx-0x1] mov BYTE PTR [edx-0x1],al xor al,BYTE PTR [ecx-0x1] mov BYTE PTR [ecx-0x1],al xor BYTE PTR [edx-0x1],al dec ebx inc edx inc ecx cmp ebx,0xffffffff jne loopstart
So the loop is three memory reads, three writes and some increments of the pointers / loop counter. Visual C++ 2008 compiles it very similarly, just uses more complex addressing mode to save one loop counter:
movzx edx,byte ptr [ecx+eax] xor byte ptr [eax],dl mov dl,byte ptr [eax] xor byte ptr [ecx+eax],dl mov dl,byte ptr [ecx+eax] xor byte ptr [eax],dl dec esi inc eax test esi,esi jne loopstart
What if we don't do this "XOR trick", and just swap the contents using a temporary variable?
// ... char t = *p; *p = *q; *q = t; // ...
Lo and behold, now it runs at 7 cycles / pixel (almost twice as fast), and the inner loop is two memory reads and two writes:
movzx edx,BYTE PTR [ebx-0x1] movzx eax,BYTE PTR [ecx-0x1] mov BYTE PTR [ebx-0x1],al mov BYTE PTR [ecx-0x1],dl // ... incrementing pointers / counter here, like in previous case
So yeah. The XOR trick is pretty much useless here - it's twice as slow. Hey, it can even be slower as images get larger - if tested on a 2048x2048 image, regular swap still takes 7 cycles/pixel, but XOR trick takes 55 cycles/pixel!
I guess XOR trick is useful only in quite rare situations, for example when you're inside of some inner loop and want to swap register values without spilling them to memory or using an additional register. Heh, Wikipedia has info on this, so I'm not saying anything new :)
Now of course, if we happen to know that our pixels are 32 bits in size, there's no good reason to keep the loop in bytes. We can operate on integers instead:
void memswapI( void* m1, void* m2, size_t n )
{
size_t nn = n/sizeof(int);
int *p = (int*)m1; int *q = (int*)m2;
while ( nn-- ) {
int t = *p; *p = *q; *q = t;
p++; q++;
}
}
This runs at 1.5 cycles/pixel (XOR variant at 2.5 cycles/pixel). The assembly is pretty much the same, just with 32 bit registers.
Another option? If you use STL, just use:
std::swap_ranges(p, p+n, q);
on the pixel datatype. On 32 bit pixels, this also runs at 1.5 cycles/pixel.
So yeah. Don't try to outsmart the compiler without measuring it.

This seems the typical scenario to use restrict. Did you try that? Just out of curiosity, because it’s never going to be faster.
XOR version is really evil. “*p ^= *q; *q ^= *p; *p ^= *q;” has 3 dependencies on previous results. That should be a nightmare for in-order CPUs.
@hcpizzi: well, I don’t see how __restrict would improve the loop in any significant way. To swap two bytes in memory, one still needs to read them and write them back (two reads and two writes).
@ReJ: yeah, I was thinking the same as well. XOR swap is slow on “regular” desktop platforms, and on some others it could be even worse.
I was talking about the xor version. It should end up being just two reads and two writes as well, plus the xors, that’s why I said that it could never be faster. Just curiosity as I said before.
@hcpizzi: well, in that case the compiler is not that smart. Tried adding __restrict in VS2008, the XOR test still does three memory reads and three XORs into memory. The linker even collapsed both versions (with __restrict and without) into a single function.
The xor trick is nifty, in an abstract sort of way, but it has basically never been useful. Not even in the 1970s. The use of it today is 99% an indication of cluelessness more than anything else!
@christer: yeah. I can think of perhaps one situation where it could be used. Like you’re in some inner loop, and for some reason you need to swap contents of two registers, and you don’t have any spare register left, and the CPU does not have fast exchange instruction. Yeah, probably you have to wait for centuries for this situation to appear… in other words, no, not useful in practice.
At the same time, I’d advise not to trust STL blindly without measuring, either. Some STL containers (multi_map being my favorite) either generate atrocius code, or, if you crank up optimizations sufficiently high, generate decent code but take forever to compile and bloat the namespace for linking…
My money’s on memswapI in the sense that it is more predictable performance-wise.