Here’s a cute way of reversing a 64-bit word (there is a similar, slightly faster, but slightly more obscure method in TAOCP 4a, so if you really need to save the cycles, use that one).

We use ternary swaps: for example, to bit-reverse a 3-bit sequence, we can just swap the high and low bits. To bit-reverse a 9-bit segment, break into [3,3,3] and swap the top and bottom 3-bit sections, and then bit-reverse each 3-bit section separately.

More generally, we can reverse any bit sequence by breaking it up into [M,N,M], swapping top and bottom sections (by shifting up and down by M+N), and recursing.

So, to reverse a 63 bit number, since 63 = 3*3*7, we can nest two 3-swaps and a 7-swap. We can do the actual swapping of sections with another classy Knuth function, nicely expressible as a C++ template:

template <typename T, T m, int k> static inline T swapbits(T p) { T q = ((p>>k)^p)&m; return p^q^(q<<k); }

`m`

is a mask, `k`

a shift amount, the function swaps the bits at positions given by the set bits of `m`

and `m<<k`

(clearly these two should be disjoint, ie. `m & m<<k == 0`

)

All we need to do now is calculate the masks and the shift amounts.

First 3-swap; the binary mask is `...001001001`

and we need to shift the mask up by 2. We can see that `m1 + m1<<1 + m1<<2 = 2^63-1`

, so we can calculate the mask with a (compile-time) division.

Second 3-swap: the binary mask is `...000000111000000111`

and we need to shift the mask up by 6. Once again, we can easily compute the correct mask.

The 7-swap (ie. reverse 7 sections of 9 bits), we do in two stages, do a 3-swap for the top and bottom sections, so the mask is `111111111 + 111111111<<36`

, finally we swap the top and bottom 27 bits, so the mask is just 2^27-1, and the shift is 36.

This reverses the bottom 63 bits, a simple rotate by one then puts everything into the right place.

uint64_t bitreverse (uint64_t n) { static const uint64_t m1 = ((uint64_t(1)<<63)-1)/(1+(1<<1)+(1<<2)); static const uint64_t m2 = ((uint64_t(1)<<63)-1)/(1+(1<<3)+(1<<6)); static const uint64_t m3 = ((uint64_t(1)<<9)-1)+(((uint64_t(1)<<9)-1)<<36); static const uint64_t m4 = (uint64_t(1)<<27)-1; n = swapbits<uint64_t, m1, 2>(n); n = swapbits<uint64_t, m2, 6>(n); n = swapbits<uint64_t, m3, 18>(n); n = swapbits<uint64_t, m4, 36>(n); n = (n >> 63) | (n << 1); return n; }

Here is what gcc makes of that:

_Z10bitreversey: movq %rdi, %rdx movabsq $1317624576693539401, %rax movabsq $126347562148695559, %rcx shrq $2, %rdx xorq %rdi, %rdx andq %rax, %rdx movq %rdx, %rax salq $2, %rdx xorq %rdi, %rax xorq %rdx, %rax movq %rax, %rdx shrq $6, %rdx xorq %rax, %rdx andq %rcx, %rdx movabsq $35115652612607, %rcx xorq %rdx, %rax salq $6, %rdx xorq %rdx, %rax movq %rax, %rdx shrq $18, %rdx xorq %rax, %rdx andq %rcx, %rdx xorq %rdx, %rax salq $18, %rdx xorq %rdx, %rax movq %rax, %rdx shrq $36, %rdx xorq %rax, %rdx andl $134217727, %edx xorq %rdx, %rax salq $36, %rdx xorq %rdx, %rax rorq $63, %rax ret

For comparison, here is Knuth’s 64-bit reverse (I’ve just hard-coded the constants this time). It’s based on a 32-bit reverse that breaks one 17-bit segment into [7,3,7] and the remaining 15-bit segment into [3,7,3] – we can do both swaps with the same shift of 10. First step is to swap adjacent bits which can be done slightly faster than a general swap. Very cunning:

uint64_t kbitreverse (uint64_t n) { static const uint64_t m0 = 0x5555555555555555LLU; static const uint64_t m1 = 0x0300c0303030c303LLU; static const uint64_t m2 = 0x00c0300c03f0003fLLU; static const uint64_t m3 = 0x00000ffc00003fffLLU; n = ((n>>1)&m0) | (n&m0)<<1; n = swapbits<uint64_t, m1, 4>(n); n = swapbits<uint64_t, m2, 8>(n); n = swapbits<uint64_t, m3, 20>(n); n = (n >> 34) | (n << 30); return n; }

and the corresponding compiler output:

_Z11kbitreversey: movq %rdi, %rdx movabsq $6148914691236517205, %rax movabsq $216384095313249027, %rcx shrq %rdx andq %rax, %rdx andq %rdi, %rax addq %rax, %rax orq %rdx, %rax movq %rax, %rdx shrq $4, %rdx xorq %rax, %rdx andq %rcx, %rdx movabsq $54096023692247103, %rcx xorq %rdx, %rax salq $4, %rdx xorq %rdx, %rax movq %rax, %rdx shrq $8, %rdx xorq %rax, %rdx andq %rcx, %rdx movabsq $17575006191615, %rcx xorq %rdx, %rax salq $8, %rdx xorq %rdx, %rax movq %rax, %rdx shrq $20, %rdx xorq %rax, %rdx andq %rcx, %rdx xorq %rdx, %rax salq $20, %rdx xorq %rdx, %rax rorq $34, %rax ret

Knuth wins here by 1 instruction! Oh well, maybe I’ll have better luck next time…