Krazy Kompilation

Let’s take a simple factorial program:

int fact(int n)
  if (n == 0) return 1;
  else return n * fact(n-1);

and look at what gcc compiles it to (using Matthew Godbolt’s excellent GCC Explorer):

With -O1, a nice clean version of the recursive function:

	pushq	%rbx
	movl	%edi, %ebx
	movl	$1, %eax
	testl	%edi, %edi
	je	.L2
	leal	-1(%rdi), %edi
	call	fact(int)
	imull	%ebx, %eax
	popq	%rbx

With -O2, that’s impressive, it’s done some program transformation and now we have a loop:

	testl	%edi, %edi
	movl	$1, %eax
	je	.L4
	imull	%edi, %eax
	subl	$1, %edi
	jne	.L3

With -O3, whaaah!:

	testl	%edi, %edi
	je	.L7
	movl	%edi, %edx
	movl	%edi, %esi
	shrl	$2, %edx
	leal	0(,%rdx,4), %ecx
	testl	%ecx, %ecx
	je	.L8
	cmpl	$6, %edi
	jbe	.L8
	leal	-1(%rdi), %eax
	movl	%edi, -24(%rsp)
	movdqa	.LC1(%rip), %xmm4
	movl	%eax, -20(%rsp)
	leal	-2(%rdi), %eax
	movd	-20(%rsp), %xmm2
	movl	%eax, -16(%rsp)
	leal	-3(%rdi), %eax
	movd	-16(%rsp), %xmm1
	movl	%eax, -12(%rsp)
	xorl	%eax, %eax
	movd	-12(%rsp), %xmm0
	punpckldq	%xmm0, %xmm1
	movd	-24(%rsp), %xmm0
	punpckldq	%xmm2, %xmm0
	punpcklqdq	%xmm1, %xmm0
	movdqa	.LC0(%rip), %xmm1
	movdqa	%xmm1, %xmm3
	psrldq	$4, %xmm1
	addl	$1, %eax
	movdqa	%xmm0, %xmm2
	cmpl	%eax, %edx
	pmuludq	%xmm0, %xmm3
	paddd	%xmm4, %xmm0
	psrldq	$4, %xmm2
	pmuludq	%xmm1, %xmm2
	pshufd	$8, %xmm3, %xmm1
	pshufd	$8, %xmm2, %xmm2
	punpckldq	%xmm2, %xmm1
	ja	.L4
	movdqa	%xmm1, %xmm2
	subl	%ecx, %edi
	cmpl	%ecx, %esi
	psrldq	$8, %xmm2
	movdqa	%xmm2, %xmm0
	psrldq	$4, %xmm2
	pmuludq	%xmm1, %xmm0
	psrldq	$4, %xmm1
	pshufd	$8, %xmm0, %xmm0
	pmuludq	%xmm2, %xmm1
	pshufd	$8, %xmm1, %xmm1
	punpckldq	%xmm1, %xmm0
	movdqa	%xmm0, %xmm1
	psrldq	$4, %xmm1
	movdqa	%xmm1, %xmm2
	psrldq	$4, %xmm1
	pmuludq	%xmm0, %xmm2
	psrldq	$4, %xmm0
	pmuludq	%xmm0, %xmm1
	pshufd	$8, %xmm2, %xmm0
	pshufd	$8, %xmm1, %xmm1
	punpckldq	%xmm1, %xmm0
	movd	%xmm0, -24(%rsp)
	movl	-24(%rsp), %eax
	je	.L13
	imull	%edi, %eax
	subl	$1, %edi
	jne	.L9
	movl	$1, %eax
	movl	$1, %eax
	jmp	.L9

Well, OK, it’s unwound the loop and vectorized with SSE instructions though a more precise analysis will have to wait for a future post. Impressive optimization from GCC though and pretty groovy considering the biggest factorial we can fit in a 32-bit int is fact(15)!


Like, I suspect, many programmers, there are many software tools I have used on a regular basis for many years but remain woefully ignorant of their inner workings and true potential. One of these is the Unix command line.

It’s a common, for example, to want to make the error output of a program to appear as the normal output and trial and error, or Googling or looking at Stack Overflow leads to:

strace ls 2>&1 >/dev/null

which works fine but seems puzzling – we’ve told the program to send error output to normal output, then normal output to /dev/null so why doesn’t that discard everything, similar to:

strace ls >/dev/null 2>&1

This is because we don’t understand what is going on.

An indirection is actually a call to the dup2 system call. From the man page:

dup2() makes newfd be the copy of oldfd, closing newfd first if necessary'

So: n>&m does a dup2(m,n): close fd n if necessary, then make n be a copy of fd m, and n>file means: close n if necessary, open file as fd m, then do dup2(m,n).

Now it all makes sense:

strace ls 2>&1 1>/dev/null 

first of all makes 2 be a copy of 1, then changes 1 to point to /dev/null – the copying is done ‘by value’ as it were (despite the confusing, for C++ programmers anyway, use of ‘&’).

Using strace here is not an accident, but used like this doesn’t tell us much: indirection is handled by the shell, not by the program, so we need to do something like this for further insight:

$ strace -f -etrace=clone,execve,open,dup2 bash -c 'ls >/dev/null 2>&1'
execve("/bin/bash", ["bash", "-c", "ls >/dev/null 2>&1"], [/* 46 vars */]) = 0
clone(Process 20454 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f643f4c09d0) = 20454
Process 20453 suspended
[pid 20454] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 20454] dup2(3, 1)                  = 1
[pid 20454] dup2(1, 2)                  = 2
[pid 20454] execve("/bin/ls", ["ls"], [/* 45 vars */]) = 0
Process 20453 resumed
Process 20454 detached
--- SIGCHLD (Child exited) @ 0 (0) ---

bash forks off a subprocess (a clone syscall these days rather than fork), which then sets up the input and output before calling execve to actually run the command.

We don’t have to limit ourselves to fds 0,1 and 2 that we get to start with:

$ runit () { echo stderr 1>&2; echo stdout; }
$ runit 1>/dev/null
$ runit 2>/dev/null
$ runit 3>&2 2>&1 1>&3
$ (runit 3>&2 2>&1 1>&3) 1> /dev/null
$ (runit 3>&2 2>&1 1>&3) 2> /dev/null

We duplicate 2 to 3, then 1 to 2, then 3 to 1, and we have swapped stdin and stderr.

We can also pass non-standard file descriptors in to programs, though this doesn’t seem to be a technique used much:

#include <unistd.h>
int main()
  char buffer[256];
  ssize_t n;
  while ((n = read(3,buffer,sizeof(buffer))) > 0) {

and do:

$ g++ -Wall cat34.cpp -o cat34
$ echo Hello World | ./cat34 3<&0 4>&1
Hello World

It’s interesting that this also works:

$ echo Hello World | ./cat34 3>&0 4<&1
Hello World

While we are this part of town, let’s talk briefly about named pipes, another feature that has been around for ever, but doesn’t seem to get used as much as it deserves. We can run in to problems though:

Suppose I want to capture an HTTP response from a web server, I can do this:

$ mkfifo f1
$ mkfifo f2
$ mkfifo f3
$ netcat -l 8888 <f1 >f2 &
$ netcat 80 <f2 >f3 &
$ tee foo.txt <f3 >f1 &

and try a download, but alas:

$ GET http://localhost:8888/ >/dev/null
Can't connect to localhost:8888 (Connection refused)

This doesn’t seem right, netcat should be listening on port 8888, I told it so myself! And checking with netstat shows no listener on 8888 and finally ps -aux shows no sign of any netcat processes – what is going on?

Once again, strace helps us see the true reality of things:

$ kill %1
$ strace netcat -l 8888 < f1 > f2

But strace tells us nothing – there is no output! Like the dog that didn’t bark in the night though, this is an important clue, and widening our area of investigation, we find:

$ strace -f -etrace=open,execve bash -c 'netcat -l 8888 < f1 > f2'
execve("/bin/bash", ["bash", "-c", "netcat -l 8888 < f1 > f2"], [/* 46 vars */]) = 0
Process 20621 attached
Process 20620 suspended
[pid 20621] open("f1", O_RDONLY ...

The shell is stalled trying to open the “f1” fifo, before it even gets around to starting the netcat program, which is why the first strace didn’t show anything. What we have forgotten is that opening a pipe blocks if there is no process with the other end open (it doesn’t have to actively reading or writing, it just has to be there). The shell handles redirections in the order they appear, so since our 3 processes are all opening their read fifo first, none have got around to opening their write fifo – we have deadlock, in fact, the classic dining philosophers problem, and a simple solution is for one philosopher to pick up the forks in a different order:

$ netcat -l 8888 >f2 <f1 &
$ netcat 80 <f2 >f3 &
$ tee foo.txt <f3 >f1 &
$ GET http://localhost:8888/ >/dev/null
$ cat foo.txt
HTTP/1.1 301 Moved Permanently
Server: Apache

We can, it should be noted, do this more easily with a normal pipeline and a single fifo, and avoid all these problems:

$ netcat 80 <f1 | tee foo.txt | netcat -l 8888 >f1

but that would be less fun and possibly less instructive.