Reversing a 64-bit word

Here’s a cute way of reversing a 64-bit word (there is a similar, slightly faster, but slightly more obscure method in TAOCP 4a, so if you really need to save the cycles, use that one).

We use ternary swaps: for example, to bit-reverse a 3-bit sequence, we can just swap the high and low bits. To bit-reverse a 9-bit segment, break into [3,3,3] and swap the top and bottom 3-bit sections, and then bit-reverse each 3-bit section separately.

More generally, we can reverse any bit sequence by breaking it up into [M,N,M], swapping top and bottom sections (by shifting up and down by M+N), and recursing.

So, to reverse a 63 bit number, since 63 = 3*3*7, we can nest two 3-swaps and a 7-swap. We can do the actual swapping of sections with another classy Knuth function, nicely expressible as a C++ template:

template <typename T, T m, int k>
static inline T swapbits(T p) {
  T q = ((p>>k)^p)&m;
  return p^q^(q<<k);
}

m is a mask, k a shift amount, the function swaps the bits at positions given by the set bits of m and m<<k (clearly these two should be disjoint, ie. m & m<<k == 0)

All we need to do now is calculate the masks and the shift amounts.

First 3-swap; the binary mask is ...001001001 and we need to shift the mask up by 2. We can see that m1 + m1<<1 + m1<<2 = 2^63-1, so we can calculate the mask with a (compile-time) division.

Second 3-swap: the binary mask is ...000000111000000111 and we need to shift the mask up by 6. Once again, we can easily compute the correct mask.

The 7-swap (ie. reverse 7 sections of 9 bits), we do in two stages, do a 3-swap for the top and bottom sections, so the mask is 111111111 + 111111111<<36, finally we swap the top and bottom 27 bits, so the mask is just 2^27-1, and the shift is 36.

This reverses the bottom 63 bits, a simple rotate by one then puts everything into the right place.

uint64_t bitreverse (uint64_t n)
{
  static const uint64_t m1 = ((uint64_t(1)<<63)-1)/(1+(1<<1)+(1<<2));
  static const uint64_t m2 = ((uint64_t(1)<<63)-1)/(1+(1<<3)+(1<<6));
  static const uint64_t m3 = ((uint64_t(1)<<9)-1)+(((uint64_t(1)<<9)-1)<<36);
  static const uint64_t m4 = (uint64_t(1)<<27)-1;
  n = swapbits<uint64_t, m1, 2>(n);
  n = swapbits<uint64_t, m2, 6>(n);
  n = swapbits<uint64_t, m3, 18>(n);
  n = swapbits<uint64_t, m4, 36>(n);
  n = (n >> 63) | (n << 1);
  return n;
}

Here is what gcc makes of that:

_Z10bitreversey:
	movq	%rdi, %rdx
	movabsq	$1317624576693539401, %rax
	movabsq	$126347562148695559, %rcx
	shrq	$2, %rdx
	xorq	%rdi, %rdx
	andq	%rax, %rdx
	movq	%rdx, %rax
	salq	$2, %rdx
	xorq	%rdi, %rax
	xorq	%rdx, %rax
	movq	%rax, %rdx
	shrq	$6, %rdx
	xorq	%rax, %rdx
	andq	%rcx, %rdx
	movabsq	$35115652612607, %rcx
	xorq	%rdx, %rax
	salq	$6, %rdx
	xorq	%rdx, %rax
	movq	%rax, %rdx
	shrq	$18, %rdx
	xorq	%rax, %rdx
	andq	%rcx, %rdx
	xorq	%rdx, %rax
	salq	$18, %rdx
	xorq	%rdx, %rax
	movq	%rax, %rdx
	shrq	$36, %rdx
	xorq	%rax, %rdx
	andl	$134217727, %edx
	xorq	%rdx, %rax
	salq	$36, %rdx
	xorq	%rdx, %rax
	rorq	$63, %rax
	ret

For comparison, here is Knuth’s 64-bit reverse (I’ve just hard-coded the constants this time). It’s based on a 32-bit reverse that breaks one 17-bit segment into [7,3,7] and the remaining 15-bit segment into [3,7,3] – we can do both swaps with the same shift of 10. First step is to swap adjacent bits which can be done slightly faster than a general swap. Very cunning:

uint64_t kbitreverse (uint64_t n)
{
  static const uint64_t m0 = 0x5555555555555555LLU;
  static const uint64_t m1 = 0x0300c0303030c303LLU;
  static const uint64_t m2 = 0x00c0300c03f0003fLLU;
  static const uint64_t m3 = 0x00000ffc00003fffLLU;
  n = ((n>>1)&m0) | (n&m0)<<1;
  n = swapbits<uint64_t, m1, 4>(n);
  n = swapbits<uint64_t, m2, 8>(n);
  n = swapbits<uint64_t, m3, 20>(n);
  n = (n >> 34) | (n << 30);
  return n;
}

and the corresponding compiler output:

_Z11kbitreversey:
	movq	%rdi, %rdx
	movabsq	$6148914691236517205, %rax
	movabsq	$216384095313249027, %rcx
	shrq	%rdx
	andq	%rax, %rdx
	andq	%rdi, %rax
	addq	%rax, %rax
	orq	%rdx, %rax
	movq	%rax, %rdx
	shrq	$4, %rdx
	xorq	%rax, %rdx
	andq	%rcx, %rdx
	movabsq	$54096023692247103, %rcx
	xorq	%rdx, %rax
	salq	$4, %rdx
	xorq	%rdx, %rax
	movq	%rax, %rdx
	shrq	$8, %rdx
	xorq	%rax, %rdx
	andq	%rcx, %rdx
	movabsq	$17575006191615, %rcx
	xorq	%rdx, %rax
	salq	$8, %rdx
	xorq	%rdx, %rax
	movq	%rax, %rdx
	shrq	$20, %rdx
	xorq	%rax, %rdx
	andq	%rcx, %rdx
	xorq	%rdx, %rax
	salq	$20, %rdx
	xorq	%rdx, %rax
	rorq	$34, %rax
	ret

Knuth wins here by 1 instruction! Oh well, maybe I’ll have better luck next time…

Advertisements


Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s