Templated bit twiddling

Sometimes it’s nice to combine C++ template metaprogramming with some bit twiddling, here’s a classic parallelized bit counting algorithm with all the magic numbers constructed as part of template instantiation:

```#include <stdio.h>
#include <stdlib.h>

template <typename T, int k, T m>
struct A
{
static T f(T n)
{
static const int k1 = k/2;
n = A<T, k1, m^(m<<k1)>::f(n);
return ((n>>k)&m) + (n&m);
}
};

template <typename T, T m>
struct A<T, 0, m>
{
static T f(T n) { return n; }
};

template<typename T>
static inline int bitcount (T n)
{
static const int k = 4 * sizeof(n);
return A<T, k, (T(1)<<k)-1>::f(n);
}

int main(int argc, char *argv[])
{
unsigned long n = strtoul(argv[1], NULL, 10);
printf("bitcount(%lu) = %d\n", n, bitcount(n));
}
```

We could extract the magic number construction out as a separate template, and there are some extra optimizations that this code doesn’t do, for example, the level 1 code can be simplified to:
``` i = i - ((i >> 1) & 0x55555555); ```
rather than what we get from the above:
``` i = ((i >> 1)&0x55555555) + (i&0x55555555); ```
Adding an appropriate partial template instantiation for `A<T,1,m>` is left as an exercise.

In fact, there are lots of shortcuts that can be made here, we aren’t really being competitive with Bit Twiddling Hacks:

```unsigned int v; // count bits set in this (32-bit value)
unsigned int c; // store the total here
static const int S[] = {1, 2, 4, 8, 16}; // Magic Binary Numbers
static const int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};

c = v - ((v >> 1) & B[0]);
c = ((c >> S[1]) & B[1]) + (c & B[1]);
c = ((c >> S[2]) + c) & B[2];
c = ((c >> S[3]) + c) & B[3];
c = ((c >> S[4]) + c) & B[4];
```

which would need 3 template specializations and we are starting to get a bit complicated…

In the same style then, here’s a bit reverse function, with a separate magic number calculating template:

```#include <stdio.h>
#include <stdlib.h>

template <typename T, int k>
struct Magic {
static const T m = ~T(0)/(T(1)+(T(1)<<k));
};

template <typename T, int k>
struct RevAux {
static inline T f(T n) {
T n1 = RevAux<T,k/2>::f(n);
return ((n1>>k) & Magic<T,k>::m) | ((n1 & Magic<T,k>::m)<<k);
}
};

template <typename T>
struct RevAux<T,0> {
static inline T f(T n) { return n; }
};

template <typename T>
T rev(T n) { return RevAux<T,4*sizeof(T)>::f(n); }

int main(int argc, char *argv[])
{
unsigned long n = strtoul(argv[1],NULL,16);
printf("%lx %lx\n", n, rev(n));
}
```

Now, how do we check that the template parameter T is unsigned (otherwise the right shifts are probably going to go wrong)…

A simple optimization is to replace the top level function with:

```template <typename T>
static inline T rev(T n) {
static const int k = 4*sizeof(T);
n = RevAux<T,k/2>::f(n);
return (n>>k)|(n<<k);
}
```

but GCC seems to have worked that out for itself (compiling gcc 4.4.3 , -O3 -S):

```.globl _Z7reversej
pushl   %ebp
movl    %esp, %ebp
movl    8(%ebp), %eax
popl    %ebp
movl    %eax, %edx
andl    \$1431655765, %edx
shrl    %eax
andl    \$1431655765, %eax
orl     %eax, %edx
movl    %edx, %eax
andl    \$858993459, %eax
shrl    \$2, %edx
andl    \$858993459, %edx
sall    \$2, %eax
orl     %edx, %eax
movl    %eax, %edx
andl    \$252645135, %edx
shrl    \$4, %eax
andl    \$252645135, %eax
sall    \$4, %edx
orl     %eax, %edx
movl    %edx, %eax
andl    \$16711935, %eax
shrl    \$8, %edx
sall    \$8, %eax
andl    \$16711935, %edx
orl     %edx, %eax
rorl    \$16, %eax
ret
```

This sort of thing isn’t going to work very well if the compiler isn’t up to scratch, but GCC seems to have done it right here.

There are slightly faster ways to reverse bits using ternary swap (swap top and bottom thirds, recurse), but they need to be more ad hoc unless you have a ternary computer with word length a power of 3, so less amenable to this sort of generic treatment (or at least, it becomes a lot harder).