Using the Cycle Counter Registers on the Raspberry Pi 3

ARM processors support various performance monitoring registers, the most basic being a cycle count register. This is how to make use of it on the Raspberry Pi 3 with its ARM Cortex-A53 processor. The A53 implements the ARMv8 architecture which can operate in both 64- and 32-bit modes, the Pi 3 uses the 32-bit AArch32 mode, which is more or less backwards compatible with the ARMv7-A architecture, as implemented for example by the Cortex-A7 (used in the early Pi 2’s) and Cortex-A8. I hope I’ve got that right, all these names are confusing

The performance counters are made available through coprocessor registers and the mrc and mcr instructions, the precise registers used depending on the particular architecture.

By default, use of these instructions is only possible in “privileged” mode, ie. from the kernel, so the first thing we need to do is to enable register access from userspace. This can be done through a simple kernel module that can also set up the cycle counter parameters needed (we could do this from userspace after the kernel module has enabled access, but it’s simpler to do everything at once).

To compile a kernel module, you need a set of header files compatible with the kernel you are running. Fortunately, if you have installed a kernel with the raspberrypi-kernel package, the corresponding headers should be in raspberrypi-kernel-headers – if you have used rpi-update, you may need to do something else to get the right headers, and of course if you have built your own kernel, you should use the headers from there. So:

$ sudo apt-get install raspberrypi-kernel
$ sudo apt-get install raspberrypi-kernel-headers

Our Makefile is just:

obj-m += enable_ccr.o

all:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

and the kernel module source is:

#include <linux/module.h>
#include <linux/kernel.h>

void enable_ccr(void *info) {
  // Set the User Enable register, bit 0
  asm volatile ("mcr p15, 0, %0, c9, c14, 0" :: "r" (1));
  // Enable all counters in the PNMC control-register
  asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(1));
  // Enable cycle counter specifically
  // bit 31: enable cycle counter
  // bits 0-3: enable performance counters 0-3
  asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x80000000));
}

int init_module(void) {
  // Each cpu has its own set of registers
  on_each_cpu(enable_ccr,NULL,0);
  printk (KERN_INFO "Userspace access to CCR enabled\n");
  return 0;
}

void cleanup_module(void) {
}

To build the module, just use make:

$ make

and if all goes well, the module itself should be built as enable_ccr.ko

Install it:

$ sudo insmod enable_ccr.ko

$ dmesg | tail

should show something like:

...
[ 430.244803] enable_ccr: loading out-of-tree module taints kernel.
[ 430.244820] enable_ccr: module license 'unspecified' taints kernel.
[ 430.244824] Disabling lock debugging due to kernel taint
[ 430.245300] User-level access to CCR has been turned on
...

It should go without saying that making your own kernel modules & allowing normally forbidden access from userspace may result in all sorts of potential vulnerabilities that you should be wary of).

Now we can use the cycle counters in user code:

#include <stdio.h>
#include <stdint.h>

static inline uint32_t ccnt_read (void)
{
  uint32_t cc = 0;
  __asm__ volatile ("mrc p15, 0, %0, c9, c13, 0":"=r" (cc));
  return cc;
}

int main() {
  uint32_t t0 = ccnt_read();
  uint32_t t1 = ccnt_read();       
  printf("%u\n", t1-t0);
  volatile uint64_t n = 100000000;
  while(n > 0) n--;
  t1 = ccnt_read();
  printf("%u\n", t1-t0);
}

We use a volatile loop counter so the loop isn’t optimized away completely.

Using taskset to keep the process on one CPU:

$ gcc -Wall -O3 cycles.c -o cycles
pi@pi3:~/raspbian-ccr$ time taskset 0x1 ./cycles
1
805314304

real 0m0.712s
user 0m0.700s
sys 0m0.010s

Looks like we can count a single cycle and since the Pi 3 has a 1.2GHz clock the loop time looks about right (the clock seems to be scaled if the processor is idle so we don’t necessarily get a full 1.2 billion cycles per second – for example, if we replace the loop above with a sleep).

References:

ARMv8 coprocessor registers:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Bgbjjhaj.html

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/Bgbdeggf.html

A useful forum discussion (which includes some details on accessing other performance counters):

https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=181490

Cycle counters on the Pi 1:

High-Resolution Timing on the Raspberry Pi


At The Bakery

After the topical excitement of the last couple of posts, let’s look at an all-time great – Leslie Lamport’s Bakery Algorithm (and of course this is still topical; Lamport is the most recent winner of the Turing Award).

The problem is mutual exclusion without mutual exclusion primitives. Usually, it’s described in the context of a shared memory system (and that is what we will implement here), but will work equally well in a message-passing system with only local state (each thread or process only needs to write to its own part of the store).

For further details, and Lamport’s later thoughts see http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#bakery: “For a couple of years after my discovery of the bakery algorithm, everything I learned about concurrency came from studying it.” – and since Lamport understands more about concurrency than just about anyone on the planet, it’s maybe worth spending some time looking at it ourselves.

I’m not going to attempt to prove the algorithm correct, I’ll leave that to Lamport, but the crucial idea seems to me to be that a thread reading a particular value from another thread is a synchronization signal from that thread – here, reading a false value for the entering variable is a signal that the other thread isn’t in the process of deciding on it’s own number, therefore it is safe for the reading process to proceed.

Implementing on a real multiprocessor system, we find that use of memory barriers or synchronization primitives is essential – the algorithm requires that reads and writes are serialized in the sense that once a value is written, other processes won’t see an earlier value (or earlier values of other variables). This doesn’t conflict with what Lamport says about not requiring low-level atomicity – we can allow reads and writes to happen simultaneously, with the possibility of a read returning a bogus value – and in fact we can simulate this in the program by writing a random value just before a process selects its real ticket number, but once a write has completed, all processes should see the new value.

Another essential feature is the volatile flag – as many have pointed out, this isn’t enough by itself for correct thread synchronization, but for shared memory systems, prevents the compiler from making invalid assumptions about consistency of reads from shared variables.

A final point – correctness requires that ticket numbers can increase without bound, this is hard to arrange in practice, so we just assert if they grow too large (this rarely happens in reality, unless we get carried away with our randomization).

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>

// Compile with: g++ -Wall -O3 bakery.cpp -pthread -o bakery

static const int NTHREADS = 4;

// Some features to play with
//#define NCHECK     // Disable crucial check
//#define NSYNC      // Disable memory barrier
//#define NVOLATILE  // Disable volatile

#if defined NVOLATILE
#define VOLATILE
#else
#define VOLATILE volatile
#endif

VOLATILE bool entering[NTHREADS];
VOLATILE unsigned number[NTHREADS];
VOLATILE int count = 0;
VOLATILE int total = 0;

unsigned getmax(int n)
{
  unsigned max = 0;
  for (int i = 0; i < n; i++) {
    if (number[i] > max) max = number[i];
  }
  return max;
}

bool check(int i, int j)
{
  return number[j] < number[i] || 
         (number[j] == number[i] && j < i);
}

inline void synchronize()
{
#if !defined NSYNC
    // gcc builtin full memory barrier
  __sync_synchronize();
#endif
}

void lock(int i) {
  entering[i] = true;
  synchronize();
  // Simulate non-atomic write
  number[i] = rand();
  synchronize();
  number[i] = 1 + getmax(NTHREADS);
  assert(number[i] > 0);
  entering[i] = false;
  synchronize();
  for (int j = 0; j < NTHREADS; j++) {
    // Wait until thread j receives its number:
#if !defined NCHECK
    while (entering[j]) { /* nothing */ }
#endif
    // At this point, we have read a false value for 
    // "entering[j]", therefore any number picked by j
    // later will takes our choice into account, any value
    // chosen earlier (and so might be less than ours) 
    // will be visible to us in the following test.

    // Wait until all threads with smaller numbers or with 
    // the same number, but with higher priority, finish 
    // their work:
    while ((number[j] != 0) && check(i,j)) { /* nothing */ }
  }
}

void unlock(int i)
{
  number[i] = 0;
}

void *threadfun(void *arg) 
{
  int i = *(int*)(arg);
  while (true) {
    lock(i);
    total++;
    if (total % 1000000 == 0) fprintf(stderr,"%c", 'a'+i);
    assert(count==0); // Check we have exclusive access
    count++;
    // It's not clear that these synchs are unnecessary,
    // but nothing seems to break if I remove them.
    //synchronize();
    count--;
    //synchronize();
    unlock(i);
    // non-critical section...
  }
  return NULL;
}

int main()
{
  pthread_t t[NTHREADS];
  int n[NTHREADS];
  for (int i = 0; i < NTHREADS; i++) {
    n[i] = i;
    pthread_create(&t[i], NULL, threadfun, (void*)&n[i]);
  }
  for (int i = 0; i < NTHREADS; i++) {
    pthread_join(t[i], NULL);
  }
}

Loopless Ruler

This function (based on Algorithm L for loopless Gray code generation in TAOCP 7.2.1.1):

void ruler(int n)
{
  int f[n+1];
  for (int j = 0; j <= n; j++) {
    f[j] = j;
  }
  while(true) {
    visit(f,n+1);
    int j = f[0];
    f[0] = 0;
    if (j == n) break;
    f[j] = f[j+1];
    f[j+1] = j+1;
  }
}

enables us to compute the “ruler function”, ρ for 1, 2, 3…:

0 1 0 2 0 1 0 3 0 1 0 2 0 1 0...

It’s a “loopless” algorithm and uses f, an array of “focus pointers” to keep track of what’s going on.

The ruler function ρ(n) is essentially the number of trailing 0’s in the binary representation of n.

This is what’s printed by ruler(4):

0 1 2 3 4 
1 1 2 3 4 
0 2 2 3 4 
2 1 2 3 4 
0 1 3 3 4 
1 1 3 3 4 
0 3 2 3 4 
3 1 2 3 4 
0 1 2 4 4 
1 1 2 4 4 
0 2 2 4 4 
2 1 2 4 4 
0 1 4 3 4 
1 1 4 3 4 
0 4 2 3 4 
4 1 2 3 4 

The left hand side is the ruler function of course, what are the other values?

It’s clearer if we print f in reverse and match up with binary representations:

00000
43210
00001
43211
00010
43220
00011
43212
00100
43310
00101
43311
00110
43230
00111
43213
01000
44210
01001
44211
01010
44220
01011
44212
01100
43410
01101
43411
01110
43240
01111
43214

If n[j] is the jth binary digit of n, then f[j] is just j, unless n[j] is the first of a block of 1’s, in which case f[j] is the index of the first 0 after that block.

For example, 01101, has block starting at 0 and ending at 1, so f[0] = 1, and another block starting at 2, ending at 4, so f[2] = 4; for other j, f[j] = j, so the focus pointers are 43411.

If n is of the form:

j   k
01110111

with f[0] = k, f[k] = k, f[k+1] = j

0n incrementing we get:

j   k
01111000

Now f[0] = 0, f[k] = j and f[k+1] = k+1, other values of f are the same

If k is 0, then we go from:

j   k
01110

with f[0] = 0, f[1] = j

to:

j   k
01111

with f[0] = 0, f[1] = 1

Either way, we can see that the inner loop of ruler is correct.


Ringing the Changes

Apart from a brief visit many years ago to Macclesfield church with my friend Jo, I have never taken part in the ancient English art of bell-ringing, but permutation generation has always seemed a fascinating topic.

In TAOCP 7.2.1.2, Knuth has Algorithm P (“plain changes”) for generating all permutations using adjacent interchanges:

void taocp1(int *a, int n)
{
  int c[n+1];
  int o[n+1];
  for (int j = 1; j <= n; j++) {
    c[j] = 0; o[j] = 1;
  }
 P2:
  visit(a,n);
 P3:
  int j = n; int s = 0;
 P4:
  int q = c[j] + o[j];
  if (q < 0) goto P7;
  if (q == j) goto P6;
 P5:
  swap(a[j-c[j]+s],a[j-q+s]);
  c[j] = q;
  goto P2;
 P6:
  if (j == 1) return;
  s = s+1;
 P7:
  o[j] = -o[j];
  j = j-1;
  goto P4;
}

It works just fine, but I find it a little hard to follow and let’s face it, we just don’t write code like that anymore, with all those gotos, 1-based arrays, etc.

0-basing the arrays isn’t too much of a problem, for eliminating the gotos we could take the functional approach:

void taocp2(int *a, int n)
{
  int c[n];
  int o[n];
  for (int j = 0; j < n; j++) {
    c[j] = 0; o[j] = 1;
  }
  P2(a,c,o,n);
}

void P2(int *a, int *c, int *o, int n) {
  visit(a,n);
  P3(a,c,o,n);
}

void P3(int *a, int *c, int *o, int n) {
  P4(a,c,o,n,n-1,0);
}

void P4(int *a, int *c, int *o, int n, int j, int s) {
  int q = c[j] + o[j];
  if (q < 0) P7(a,c,o,n,j,s);
  else if (q == j+1) P6(a,c,o,n,j,s);
  else P5(a,c,o,n,j,s,q);
}

void P5(int *a, int *c, int *o, int n, int j, int s, int q) {
  swap(a[j-c[j]+s],a[j-q+s]);
  c[j] = q;
  P2(a,c,o,n);
}

void P6(int *a, int *c, int *o, int n, int j, int s) {
  if (j != 1) {
    P7(a,c,o,n,j,s+1);
  }
}

void P7(int *a, int *c, int *o, int n, int j, int s) {
  o[j] = -o[j];
  P4(a,c,o,n,j-1,s);
}

but somehow that seems missing the point (I don’t think that the worst thing about that code is that it still uses modifiable state).

There are actually two component parts to the algorithm: generating a “mixed radix reflected Gray code” which defines the progressive inversions of the array elements, and going from the inversion changes to the actual indexes of the swapped objects.

For generating the Gray codes, with a bit of rewriting, 0-basing our arrays etc. we get:

void gray(int n)
{
  int j, c[n], o[n];
  for (int j = 0; j < n; j++) {
    c[j] = 0; o[j] = 1;
  }
  do {
    visit(c,n);
    for (j = n-1; j > 0; j--) {
      int q = c[j] + o[j];
      if (q < 0) {
        o[j] = -o[j];
      } else if (q == j+1) {
        o[j] = -o[j];
      } else {
        c[j] = q;
        break;
      }
    } 
  } while (j != 0);
}

Each time around, find the highest element whose inversion count can be changed in the desired direction. For elements whose inversion count can’t be changed, change direction. If no element can be changed we are done.

The next step is to calculate the position of element j – but this is just j less the number of elements less than j that appear after j (ie. the number of inversions, c[j]), plus the number of elements greater than j that appear before j – but this is just the number of elements we have been passed over in the “if (q == j+1)” step above, so we can now add in the rest of algorithm P:

void taocp3(int *a, int n)
{
  int c[n];
  int o[n];
  for (int j = 0; j < n; j++) {
    c[j] = 0; o[j] = 1;
  }
  int j;
  do {
    visit(a,n);
    int s = 0;
    for (j = n-1; j > 0; j--) {
      int q = c[j] + o[j];
      if (q < 0) {
        o[j] = -o[j];
      } else if (q == j+1) {
        o[j] = -o[j];
        s++;          // This element will be before element j
      } else {
        swap(a[j-c[j]+s],a[j-q+s]);
        c[j] = q;
        break;
      }
    } 
  } while (j != 0);
}

Knuth’s algorithm essentially Dijkstra’s:

Click to access EWD553.PDF

which explains it very lucidly: there is a 1-1 correspondence between inversion counts & permutations – and a Gray enumeration of inversion gives us a sequence of permutations where only 1 element at a time changes its inversion count, and only by 1 or -1, which can only be if we exchange adjacent elements.

Knuth also gives a “loopless” algorithm for generating reflected Gray sequences, and we could use a table of element positions to construct the permutations from this:

void gray2(int *m, int n)
{
  int a[n],f[n+1],o[n];
  for (int j = 0; j < n; j++) {
    a[j] = 0; f[j] = j; o[j] = 1;
  }
  f[n] = n;
  while (true) {
    visit(a,n);
    int j = f[0];
    f[0] = 0;
    if (j == n) break;
    a[j] += o[j];
    if (a[j] == 0 || a[j] == m[j]-1) {
      o[j] = -o[j];
      f[j] = f[j+1];
      f[j+1] = j+1;
    }
  }
}

The “classic” Johnson-Trotter algorithm for plain changes involves consideration of “mobile” elements: each element has a current direction and it is “mobile” if if it greater than the next adjacent element (if there is one) in that direction. The algorithm proceeds by finding the highest mobile element and moving it accordingly:

void jt(int *a, int n)
{
  int o[n]; // Direction of element i, 1 = left, 0 = right
  int c[n]; // Position of element i
  for (int j = 0; j < n; j++) {
    o[j] = 1; c[j] = j;
  }
  int j; // Will be set to the highest mobile element
  do {
    visit(a,n);
    // Search high to low. 0 is never mobile
    for (j = n-1; j > 0; j--) {
      int p = c[j]; // Position of element j
      if (o[j]) { // Going left
        int k = a[p-1];
        if (p > 0 && k < j) {
          // Swap and adjust positions
          a[p] = k; c[k] = p;
          a[p-1] = j; c[j] = p-1; 
          break;
        }
      } else { // Going right
        int k = a[p+1];
        if (p < n-1 && k < j) {
          a[p] = k; c[k] = p;
          a[p+1] = j; c[j] = p+1; 
          break;
        }
      }
      o[j] = !o[j]; // Change direction & continue
    }
  } while (j > 0);
}

Note that we can check elements from high to low directly & stop as soon as a mobile element is found – there is no need to check every element.

Johnson-Trotter and Algorithm P are essentially doing the same thing: when moving the mobile element, we are changing its inversion count in the same way as we do directly in P.

Performance-wise, the two algorithms are very similar though (somewhat surprisingly to me) the Johnson-Trotter version seems a little faster.

Either way, my laptop generates the 3628800 permutations of 10 elements in 0.03 seconds, outputting them (even to /dev/null) takes 4.5 seconds. It takes about 3 seconds to run through the 479001600 permutations of 12 elements (I haven’t tried outputting them). We can streamline the computation further by taking advantage of the fact that most of the time we are just moving the highest element in the same direction, but that can wait for another day.

Finally, here is a function that gives the next permutation in plain changes without maintaining any state:

bool nextperm(int *a, int n)
{
  int c[n], o[n];
  for (int i = 0; i < n; i++) {
    c[i] = o[i] = 0;
  }
  // Count inversions
  for (int i = 0; i < n; i++) {
    for (int j = i+1; j < n; j++) {
      if (a[j] < a[i]) c[a[i]]++;
    }
  }
  // Find directions
  for (int i = 1, k = 0; i < n-1; i++) {
    k += c[i];
    o[i+1] = k%2;
  }
  // Find mobile element and move it.
  for (int j = n-1, s = 0; j > 0; j--) {
    if (!o[j]) {
      if (c[j] < j) {
	swap(a[j-c[j]+s],a[j-c[j]+s-1]);
	return true;
      }
      s++;
    } else {
      if (c[j] > 0) {
	swap(a[j-c[j]+s],a[j-c[j]+s+1]);
	return true;
      }
    }
  }
  return false;
}

void stateless(int *a, int n)
{
  do {
    visit(a,n);
  } while (nextperm(a,n));
}

We use the Knuth/Dijkstra trick to avoid building a position table. I suspect that n-squared inversion counting can be improved, but we still come in at under second for all permutations of 10, though that’s about 30 times slower than the optimized version.


Krazy Kompilation

Let’s take a simple factorial program:

int fact(int n)
{
  if (n == 0) return 1;
  else return n * fact(n-1);
}

and look at what gcc compiles it to (using Matthew Godbolt’s excellent GCC Explorer):

With -O1, a nice clean version of the recursive function:

fact(int):
	pushq	%rbx
	movl	%edi, %ebx
	movl	$1, %eax
	testl	%edi, %edi
	je	.L2
	leal	-1(%rdi), %edi
	call	fact(int)
	imull	%ebx, %eax
.L2:
	popq	%rbx
	ret

With -O2, that’s impressive, it’s done some program transformation and now we have a loop:

fact(int):
	testl	%edi, %edi
	movl	$1, %eax
	je	.L4
.L3:
	imull	%edi, %eax
	subl	$1, %edi
	jne	.L3
	rep
	ret
.L4:
	rep
	ret

With -O3, whaaah!:

fact(int):
	testl	%edi, %edi
	je	.L7
	movl	%edi, %edx
	movl	%edi, %esi
	shrl	$2, %edx
	leal	0(,%rdx,4), %ecx
	testl	%ecx, %ecx
	je	.L8
	cmpl	$6, %edi
	jbe	.L8
	leal	-1(%rdi), %eax
	movl	%edi, -24(%rsp)
	movdqa	.LC1(%rip), %xmm4
	movl	%eax, -20(%rsp)
	leal	-2(%rdi), %eax
	movd	-20(%rsp), %xmm2
	movl	%eax, -16(%rsp)
	leal	-3(%rdi), %eax
	movd	-16(%rsp), %xmm1
	movl	%eax, -12(%rsp)
	xorl	%eax, %eax
	movd	-12(%rsp), %xmm0
	punpckldq	%xmm0, %xmm1
	movd	-24(%rsp), %xmm0
	punpckldq	%xmm2, %xmm0
	punpcklqdq	%xmm1, %xmm0
	movdqa	.LC0(%rip), %xmm1
.L4:
	movdqa	%xmm1, %xmm3
	psrldq	$4, %xmm1
	addl	$1, %eax
	movdqa	%xmm0, %xmm2
	cmpl	%eax, %edx
	pmuludq	%xmm0, %xmm3
	paddd	%xmm4, %xmm0
	psrldq	$4, %xmm2
	pmuludq	%xmm1, %xmm2
	pshufd	$8, %xmm3, %xmm1
	pshufd	$8, %xmm2, %xmm2
	punpckldq	%xmm2, %xmm1
	ja	.L4
	movdqa	%xmm1, %xmm2
	subl	%ecx, %edi
	cmpl	%ecx, %esi
	psrldq	$8, %xmm2
	movdqa	%xmm2, %xmm0
	psrldq	$4, %xmm2
	pmuludq	%xmm1, %xmm0
	psrldq	$4, %xmm1
	pshufd	$8, %xmm0, %xmm0
	pmuludq	%xmm2, %xmm1
	pshufd	$8, %xmm1, %xmm1
	punpckldq	%xmm1, %xmm0
	movdqa	%xmm0, %xmm1
	psrldq	$4, %xmm1
	movdqa	%xmm1, %xmm2
	psrldq	$4, %xmm1
	pmuludq	%xmm0, %xmm2
	psrldq	$4, %xmm0
	pmuludq	%xmm0, %xmm1
	pshufd	$8, %xmm2, %xmm0
	pshufd	$8, %xmm1, %xmm1
	punpckldq	%xmm1, %xmm0
	movd	%xmm0, -24(%rsp)
	movl	-24(%rsp), %eax
	je	.L13
.L9:
	imull	%edi, %eax
	subl	$1, %edi
	jne	.L9
	rep
	ret
.L7:
	movl	$1, %eax
	ret
.L8:
	movl	$1, %eax
	jmp	.L9
.L13:
	ret

Well, OK, it’s unwound the loop and vectorized with SSE instructions though a more precise analysis will have to wait for a future post. Impressive optimization from GCC though and pretty groovy considering the biggest factorial we can fit in a 32-bit int is fact(15)!


Redirection

Like, I suspect, many programmers, there are many software tools I have used on a regular basis for many years but remain woefully ignorant of their inner workings and true potential. One of these is the Unix command line.

It’s a common, for example, to want to make the error output of a program to appear as the normal output and trial and error, or Googling or looking at Stack Overflow leads to:

strace ls 2>&1 >/dev/null

which works fine but seems puzzling – we’ve told the program to send error output to normal output, then normal output to /dev/null so why doesn’t that discard everything, similar to:

strace ls >/dev/null 2>&1

This is because we don’t understand what is going on.

An indirection is actually a call to the dup2 system call. From the man page:

dup2() makes newfd be the copy of oldfd, closing newfd first if necessary'

So: n>&m does a dup2(m,n): close fd n if necessary, then make n be a copy of fd m, and n>file means: close n if necessary, open file as fd m, then do dup2(m,n).

Now it all makes sense:

strace ls 2>&1 1>/dev/null 

first of all makes 2 be a copy of 1, then changes 1 to point to /dev/null – the copying is done ‘by value’ as it were (despite the confusing, for C++ programmers anyway, use of ‘&’).

Using strace here is not an accident, but used like this doesn’t tell us much: indirection is handled by the shell, not by the program, so we need to do something like this for further insight:

$ strace -f -etrace=clone,execve,open,dup2 bash -c 'ls >/dev/null 2>&1'
execve("/bin/bash", ["bash", "-c", "ls >/dev/null 2>&1"], [/* 46 vars */]) = 0
clone(Process 20454 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f643f4c09d0) = 20454
Process 20453 suspended
[pid 20454] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 20454] dup2(3, 1)                  = 1
[pid 20454] dup2(1, 2)                  = 2
[pid 20454] execve("/bin/ls", ["ls"], [/* 45 vars */]) = 0
Process 20453 resumed
Process 20454 detached
--- SIGCHLD (Child exited) @ 0 (0) ---

bash forks off a subprocess (a clone syscall these days rather than fork), which then sets up the input and output before calling execve to actually run the command.

We don’t have to limit ourselves to fds 0,1 and 2 that we get to start with:

$ runit () { echo stderr 1>&2; echo stdout; }
$ runit 1>/dev/null
stderr
$ runit 2>/dev/null
stdout
$ runit 3>&2 2>&1 1>&3
stderr
stdout
$ (runit 3>&2 2>&1 1>&3) 1> /dev/null
stdout
$ (runit 3>&2 2>&1 1>&3) 2> /dev/null
stderr

We duplicate 2 to 3, then 1 to 2, then 3 to 1, and we have swapped stdin and stderr.

We can also pass non-standard file descriptors in to programs, though this doesn’t seem to be a technique used much:

#include <unistd.h>
int main()
{
  char buffer[256];
  ssize_t n;
  while ((n = read(3,buffer,sizeof(buffer))) > 0) {
    write(4,buffer,n);
  }
}

and do:

$ g++ -Wall cat34.cpp -o cat34
$ echo Hello World | ./cat34 3<&0 4>&1
Hello World

It’s interesting that this also works:

$ echo Hello World | ./cat34 3>&0 4<&1
Hello World

While we are this part of town, let’s talk briefly about named pipes, another feature that has been around for ever, but doesn’t seem to get used as much as it deserves. We can run in to problems though:

Suppose I want to capture an HTTP response from a web server, I can do this:

$ mkfifo f1
$ mkfifo f2
$ mkfifo f3
$ netcat -l 8888 <f1 >f2 &
$ netcat www.bbc.co.uk 80 <f2 >f3 &
$ tee foo.txt <f3 >f1 &

and try a download, but alas:

$ GET http://localhost:8888/ >/dev/null
Can't connect to localhost:8888 (Connection refused)

This doesn’t seem right, netcat should be listening on port 8888, I told it so myself! And checking with netstat shows no listener on 8888 and finally ps -aux shows no sign of any netcat processes – what is going on?

Once again, strace helps us see the true reality of things:

$ kill %1
$ strace netcat -l 8888 < f1 > f2

But strace tells us nothing – there is no output! Like the dog that didn’t bark in the night though, this is an important clue, and widening our area of investigation, we find:

$ strace -f -etrace=open,execve bash -c 'netcat -l 8888 < f1 > f2'
execve("/bin/bash", ["bash", "-c", "netcat -l 8888 < f1 > f2"], [/* 46 vars */]) = 0
Process 20621 attached
Process 20620 suspended
[pid 20621] open("f1", O_RDONLY ...

The shell is stalled trying to open the “f1” fifo, before it even gets around to starting the netcat program, which is why the first strace didn’t show anything. What we have forgotten is that opening a pipe blocks if there is no process with the other end open (it doesn’t have to actively reading or writing, it just has to be there). The shell handles redirections in the order they appear, so since our 3 processes are all opening their read fifo first, none have got around to opening their write fifo – we have deadlock, in fact, the classic dining philosophers problem, and a simple solution is for one philosopher to pick up the forks in a different order:

$ netcat -l 8888 >f2 <f1 &
$ netcat www.bbc.co.uk 80 <f2 >f3 &
$ tee foo.txt <f3 >f1 &
$ GET http://localhost:8888/ >/dev/null
$ cat foo.txt
HTTP/1.1 301 Moved Permanently
Server: Apache
...

We can, it should be noted, do this more easily with a normal pipeline and a single fifo, and avoid all these problems:

$ netcat www.bbc.co.uk 80 <f1 | tee foo.txt | netcat -l 8888 >f1

but that would be less fun and possibly less instructive.


tolower

In an Age of Highly Pipelined RISC Architectures, we might be tempted to write a basic ASCII lower-case function as:

int tolower1(int c)
{
  return c|-(c-0101U<032)&040;
}

and avoid any branching. We subtract ‘A’ from c and do an unsigned comparison with ‘Z’-‘A’, convert to a bitmask and use that to ‘or’ in the lower case bit where necessary.

A more quiche-eating approach would be the clear and obvious:

int tolower2(int c)
{
  if (c >='A' && c <= 'Z') return c+'a'-'A';
  else return c;
}

at which hard-nosed hacker types will sneer, however, an Age of Highly Pipelined RISC Architectures is also an Age of Conditional Move Instructions, and GCC generates the pleasantly concise:

tolower2:
	leal	-65(%rdi), %ecx
	leal	32(%rdi), %edx
	movl	%edi, %eax
	cmpl	$25, %ecx
	cmovbe	%edx, %eax
	ret

for our simple approach rather than the distinctly bloated:

tolower1:
	leal	-65(%rdi), %eax
	cmpl	$25, %eax
	setbe	%al
	movzbl	%al, %eax
	negl	%eax
	andl	$32, %eax
	orl	%edi, %eax
	ret

All is not lost though for the hard-nosed hacker, for this is also an Age of Vectorization and we can do something similar using SSE2 instructions to lower-case not one but 16 characters at a time:

#include <x86intrin.h>
void tolower128(char *p)
{
  __m128i v1 = _mm_loadu_si128((__m128i*)p);
  __m128i v2 = _mm_cmplt_epi8(v1,_mm_set1_epi8('A'));
  __m128i v3 = _mm_cmpgt_epi8(v1,_mm_set1_epi8('Z'));
  __m128i v4 = _mm_or_si128(v2,v3);
  __m128i v5 = _mm_andnot_si128(v4,_mm_set1_epi8(0x20));
  _mm_storeu_si128((__m128i*)p, _mm_or_si128(v1,v5));
}

Now everyone should be happy!


Fun with TUN

TUN devices are much used for virtualization, VPNs, network testing programs, etc. A TUN device essentially is a network interface that also exists as a user space file descriptor, data sent to the interface can be read from the file descriptor, and data written to the file descriptor emerges from the network interface.

Here’s a simple example of their use. We create a TUN device that simulates an entire network, with traffic to each network address just routed back to the original host.

For a complete program, see:

https://github.com/matthewarcus/stuff/blob/master/tun/reflect.cpp

First create your TUN device, this is fairly standard, most public code seems to be derived from Maxim Krasnyansky’s:

https://www.kernel.org/doc/Documentation/networking/tuntap.txt

and our code is no different:

int tun_alloc(char *dev) 
{
  assert(dev != NULL);
  int fd = open("/dev/net/tun", O_RDWR);
  CHECKFD(fd);

  struct ifreq ifr; 
  memset(&ifr, 0, sizeof(ifr)); 
  ifr.ifr_flags = IFF_TUN | IFF_NO_PI;
  strncpy(ifr.ifr_name, dev, IFNAMSIZ); 
  CHECKSYS(ioctl(fd, TUNSETIFF, (void *) &ifr));
  strncpy(dev, ifr.ifr_name, IFNAMSIZ); 
  return fd;
}

We want a TUN device (rather than TAP, essentially the same thing but at the ethernet level) and we don’t want packet information at the moment. We copy the name of the allocated device to the char array given as a parameter.

Now all our program needs to do is create the TUN device and sit in a loop copying packets:

int main(int argc, char *argv[])
{
  char dev[IFNAMSIZ+1];
  memset(dev,0,sizeof(dev));
  if (argc > 1) strncpy(dev,argv[1],sizeof(dev)-1);

  // Allocate the tun device
  int fd = tun_alloc(dev);
  if (fd < 0) exit(0);

  uint8_t buf[2048];
  while(true) {
    // Sit in a loop, read a packet from fd, reflect
    // addresses and write back to fd.
    ssize_t nread = read(fd,buf,sizeof(buf));
    CHECK(nread >= 0);
    if (nread == 0) break;
    reflect(buf,nread);
    ssize_t nwrite = write(fd,buf,nread);
    CHECK(nwrite == nread);
  }
}

The TUN mechanism ensures that we get exactly one packet for each read, we don’t need to worry about fragmentation, and we just send each packet back with the source and destination IPs swapped:

static inline void put32(uint8_t *p, size_t offset, uint32_t n)
{
  memcpy(p+offset,&n,sizeof(n));
}

static inline uint32_t get32(uint8_t *p, size_t offset)
{
  uint32_t n;
  memcpy(&n,p+offset,sizeof(n));
  return n;
}

void reflect(uint8_t *p, size_t nbytes)
{
  (void)nbytes;
  uint8_t version = p[0] >> 4;
  switch (version) {
  case 4:
    break;
  case 6:
    fprintf(stderr, "IPv6 not implemented yet\n");
    exit(0);
  default:
    fprintf(stderr, "Unknown protocol %u\n", version);
    exit(0);
  }
  uint32_t src = get32(p,12);
  uint32_t dst = get32(p,16);
  put32(p,12,dst);
  put32(p,16,src);
}

We don’t need to recalculate the header checksum as it doesn’t get changed by just swapping two 32 bit segments.

Handling IPV6 is left as an exercise for the reader (we just need to use a different offset and address size I think).

In this day and age, security should be prominent in our minds, particularly for long-running programs like our TUN server, so for extra points, let’s add in some capability processing.

(You might need to install a libcap-dev package for this to work, for example, with “sudo apt-get install libcap-dev” and link with -lcap).

Once we have started up, we should check if we have the required capability, we just require CAP_NET_ADMIN to be permitted:

  cap_t caps = cap_get_proc();
  CHECK(caps != NULL);

  cap_value_t cap = CAP_NET_ADMIN;
  const char *capname = STRING(CAP_NET_ADMIN);

  cap_flag_value_t cap_permitted;
  CHECKSYS(cap_get_flag(caps, cap,
                        CAP_PERMITTED, &cap_permitted));
  if (!cap_permitted) {
    fprintf(stderr, "%s not permitted, exiting\n", capname);
    exit(0);
  }

and then make effective what we require:

  CHECKSYS(cap_clear(caps));
  CHECKSYS(cap_set_flag(caps, CAP_PERMITTED, 1, &cap, CAP_SET));
  CHECKSYS(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap, CAP_SET));
  CHECKSYS(cap_set_proc(caps));

Finally, after creating our TUN object, before entering our main loop, we can relinquish our extra privileges altogether:

  CHECKSYS(cap_clear(caps));
  CHECKSYS(cap_set_proc(caps));
  CHECKSYS(cap_free(caps));

For completeness, here are the error checking macros used above:

#define CHECKAUX(e,s)                            \
 ((e)? \
  (void)0: \
  (fprintf(stderr, "'%s' failed at %s:%d - %s\n", \
           s, __FILE__, __LINE__,strerror(errno)), \
   exit(0)))
#define CHECK(e) (CHECKAUX(e,#e))
#define CHECKSYS(e) (CHECKAUX((e)==0,#e))
#define CHECKFD(e) (CHECKAUX((e)>=0,#e))
#define STRING(e) #e

Of course, production code will want to do something more sophisticated than calling exit(0) when an error occurs…

To use, compile for example with:

g++ -W -Wall -O3 reflect.cpp -lcap -o reflect

We can set permissions for our new executable to include the relevant capability, so we don’t need to start it as root:

$ sudo setcap cap_net_admin+ep ./reflect

Actually start it:

$ ./reflect&
Capability CAP_NET_ADMIN: 1 0 1
Created tun device tun0

We now have an interface, but it isn’t configured:

$ ifconfig tun0
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
POINTOPOINT NOARP MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

With the interface running, set up networking:

$ sudo ip link set tun0 up
$ sudo ip addr add 10.0.0.1/8 dev tun0

Check all is well:

$ ifconfig tun0
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:10.0.0.1 P-t-P:10.0.0.1 Mask:255.0.0.0
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

And try it out:

$ ping -c 1 10.0.0.41
PING 10.0.0.41 (10.0.0.41) 56(84) bytes of data.
64 bytes from 10.0.0.41: icmp_req=1 ttl=64 time=0.052 ms

--- 10.0.0.41 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.052/0.052/0.052/0.000 ms

Let’s check performance, firstly, a flood ping on the loopback device:

$ sudo ping -f -c10000 -s1500 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 1500(1528) bytes of data.

--- 127.0.0.1 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 778ms
rtt min/avg/max/mdev = 0.003/0.006/0.044/0.002 ms, pipe 2, ipg/ewma 0.077/0.006 ms

compared to one through the TUN connection:

$ sudo ping -f -c10000 -s1500 10.0.0.100
PING 10.0.0.100 (10.0.0.100) 1500(1528) bytes of data.

--- 10.0.0.100 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 945ms
rtt min/avg/max/mdev = 0.022/0.032/3.775/0.038 ms, pipe 2, ipg/ewma 0.094/0.032 ms

Respectable. We have got ourselves a network!


Tail recursive string recognizers

A recent discussion at work about the best way to write a simple string checking function set me thinking…

Essentially, the problem is to recognize strings conforming to some regular expression, but with some extra conditions, that either can’t be expressed as a regexp or that are inconvenient to do so, so we can’t just plug it in to PCRE or whatever RE library is flavour of the month.

In this case, we wanted strings conforming to:

[0-9a-zA-Z*][-0-9a-zA-Z*]*(.[0-9a-zA-Z*][-0-9a-zA-Z*]*)*

ie. a path containing “.” separated components, possibly with “*” wildcards, but also with the condition that each component has at most one wildcard and with a limit on the length of each component.

Ignoring the extra conditions, we can define a suitable regular grammar:

S = S0
S0 = ALNUM + S1 | "*" + S1
S1 = ALNUM + S1 | "-" + S1 | "*" + S1 | "." + S0 | ""

and this naturally suggests a recursive descent recognizer, though in this case, there is no actual descent as the grammar is regular and all the recursions are tail calls.

Happily, these days compilers like GCC will compile tail calls to jumps (at least on higher optimization settings), just like ML compilers, for example, have always had to, and we can hope for reasonably good code from such an approach. Of course, once the tail calls have been optimized, we have a finite state recognizer with the state transitions implemented as gotos!

The extra conditions can now be easily added in to the state transition functions, with any variables needed just appearing as extra parameters (and because we need to specify the parameter value for every recursive call/state transition, we are less likely to forget to set something) so we end up with something like:

#include <ctype.h>

static inline bool check(const char *p);

static bool check1(const char *p, int n, bool seen) {
  if (n > 4) return false;
  if (isalnum(*p)) return check1(p+1, n+1, seen);
  if (*p == '-') return check1(p+1, n+1, seen);
  if (*p == '*') return !seen && check1(p+1,n+1,true);
  if (*p == '.') return check(p+1);
  if (*p == '\0') return true;
  return false;
}

static bool check(const char *p) {
  if (isalnum(*p)) return check1(p+1,1,false);
  if (*p == '*') return check1(p+1,1,true);
  return false;
}

Compiling with 64 bit gcc 4.7.2, -O2 (gcc only optimizes tail calls at -O2 and above), we get:

_Z5checkPKc:
    movq    %rbp, -8(%rsp)
    movq    %rbx, -16(%rsp)
    subq    $24, %rsp
    movzbl  (%rdi), %ebx
    movq    %rdi, %rbp
    call    __ctype_b_loc
    movq    (%rax), %rax
    movsbq  %bl, %rdx
    testb   $8, (%rax,%rdx,2)
    jne .L7
    cmpb    $42, %bl
    je  .L8
    xorl    %eax, %eax
    movq    8(%rsp), %rbx
    movq    16(%rsp), %rbp
    addq    $24, %rsp
    ret
.L7:
    leaq    1(%rbp), %rdi
    xorl    %edx, %edx
.L5:
    movq    8(%rsp), %rbx
    movq    16(%rsp), %rbp
    movl    $1, %esi
    addq    $24, %rsp
    jmp _ZL6check1PKcib
.L8:
    leaq    1(%rbp), %rdi
    movl    $1, %edx
    jmp .L5
_ZL6check1PKcib:
    movq    %rbp, -24(%rsp)
    movq    %r12, -16(%rsp)
    movq    %rdi, %rbp
    movq    %r13, -8(%rsp)
    movq    %rbx, -32(%rsp)
    subq    $40, %rsp
    movzbl  (%rdi), %ebx
    movl    %esi, %r13d
    movl    %edx, %r12d
    call    __ctype_b_loc
    movq    (%rax), %rax
    movsbq  %bl, %rcx
    testb   $8, (%rax,%rcx,2)
    jne .L23
    cmpb    $45, %bl
    je  .L23
    cmpb    $42, %bl
    je  .L24
    testb   %bl, %bl
    sete    %al
    cmpb    $46, %bl
    je  .L25
.L12:
    movq    8(%rsp), %rbx
    movq    16(%rsp), %rbp
    movq    24(%rsp), %r12
    movq    32(%rsp), %r13
    addq    $40, %rsp
    ret
.L23:
    leal    1(%r13), %esi
    cmpl    $4, %esi
    jg  .L14
    leaq    1(%rbp), %rdi
    movzbl  %r12b, %edx
.L21:
    movq    8(%rsp), %rbx
    movq    16(%rsp), %rbp
    movq    24(%rsp), %r12
    movq    32(%rsp), %r13
    addq    $40, %rsp
    jmp _ZL6check1PKcib
.L24:
    xorl    %eax, %eax
    testb   %r12b, %r12b
    jne .L12
    leal    1(%r13), %esi
    cmpl    $4, %esi
    jg  .L12
    leaq    1(%rbp), %rdi
    movl    $1, %edx
    jmp .L21
.L14:
    xorl    %eax, %eax
    jmp .L12
.L25:
    leaq    1(%rbp), %rdi
    movq    8(%rsp), %rbx
    movq    16(%rsp), %rbp
    movq    24(%rsp), %r12
    movq    32(%rsp), %r13
    addq    $40, %rsp
    jmp _Z5checkPKc

It looks like the tail calls have indeed been compiled as jumps, so the stack won’t blow up. There is what looks like some unnecessary movement of function arguments between the stack and the function argument registers, but we’re not doing too badly. Probably not quite as fast as a hand-coded goto-based version, but only a little slower, almost certainly a price worth paying for the gain in clarity (in fact, most of the time seems to be spent in the calls to isalnum).