Sometimes it’s nice to combine C++ template metaprogramming with some bit twiddling, here’s a classic parallelized bit counting algorithm with all the magic numbers constructed as part of template instantiation:

#include <stdio.h> #include <stdlib.h> template <typename T, int k, T m> struct A { static T f(T n) { static const int k1 = k/2; n = A<T, k1, m^(m<<k1)>::f(n); return ((n>>k)&m) + (n&m); } }; template <typename T, T m> struct A<T, 0, m> { static T f(T n) { return n; } }; template<typename T> static inline int bitcount (T n) { static const int k = 4 * sizeof(n); return A<T, k, (T(1)<<k)-1>::f(n); } int main(int argc, char *argv[]) { unsigned long n = strtoul(argv[1], NULL, 10); printf("bitcount(%lu) = %d\n", n, bitcount(n)); }

We could extract the magic number construction out as a separate template, and there are some extra optimizations that this code doesn’t do, for example, the level 1 code can be simplified to:

i = i - ((i >> 1) & 0x55555555);

rather than what we get from the above:

i = ((i >> 1)&0x55555555) + (i&0x55555555);

Adding an appropriate partial template instantiation for `A<T,1,m>`

is left as an exercise.

In fact, there are lots of shortcuts that can be made here, we aren’t really being competitive with Bit Twiddling Hacks:

unsigned int v; // count bits set in this (32-bit value) unsigned int c; // store the total here static const int S[] = {1, 2, 4, 8, 16}; // Magic Binary Numbers static const int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF}; c = v - ((v >> 1) & B[0]); c = ((c >> S[1]) & B[1]) + (c & B[1]); c = ((c >> S[2]) + c) & B[2]; c = ((c >> S[3]) + c) & B[3]; c = ((c >> S[4]) + c) & B[4];

which would need 3 template specializations and we are starting to get a bit complicated…

In the same style then, here’s a bit reverse function, with a separate magic number calculating template:

#include <stdio.h> #include <stdlib.h> template <typename T, int k> struct Magic { static const T m = ~T(0)/(T(1)+(T(1)<<k)); }; template <typename T, int k> struct RevAux { static inline T f(T n) { T n1 = RevAux<T,k/2>::f(n); return ((n1>>k) & Magic<T,k>::m) | ((n1 & Magic<T,k>::m)<<k); } }; template <typename T> struct RevAux<T,0> { static inline T f(T n) { return n; } }; template <typename T> T rev(T n) { return RevAux<T,4*sizeof(T)>::f(n); } int main(int argc, char *argv[]) { unsigned long n = strtoul(argv[1],NULL,16); printf("%lx %lx\n", n, rev(n)); }

Now, how do we check that the template parameter T is unsigned (otherwise the right shifts are probably going to go wrong)…

A simple optimization is to replace the top level function with:

template <typename T> static inline T rev(T n) { static const int k = 4*sizeof(T); n = RevAux<T,k/2>::f(n); return (n>>k)|(n<<k); }

but GCC seems to have worked that out for itself (compiling gcc 4.4.3 , -O3 -S):

.globl _Z7reversej pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax popl %ebp movl %eax, %edx andl $1431655765, %edx shrl %eax addl %edx, %edx andl $1431655765, %eax orl %eax, %edx movl %edx, %eax andl $858993459, %eax shrl $2, %edx andl $858993459, %edx sall $2, %eax orl %edx, %eax movl %eax, %edx andl $252645135, %edx shrl $4, %eax andl $252645135, %eax sall $4, %edx orl %eax, %edx movl %edx, %eax andl $16711935, %eax shrl $8, %edx sall $8, %eax andl $16711935, %edx orl %edx, %eax rorl $16, %eax ret

This sort of thing isn’t going to work very well if the compiler isn’t up to scratch, but GCC seems to have done it right here.

There are slightly faster ways to reverse bits using ternary swap (swap top and bottom thirds, recurse), but they need to be more ad hoc unless you have a ternary computer with word length a power of 3, so less amenable to this sort of generic treatment (or at least, it becomes a lot harder).