# Krazy Kompilation

Let’s take a simple factorial program:

```int fact(int n)
{
if (n == 0) return 1;
else return n * fact(n-1);
}
```

and look at what gcc compiles it to (using Matthew Godbolt’s excellent GCC Explorer):

With -O1, a nice clean version of the recursive function:

```fact(int):
pushq	%rbx
movl	%edi, %ebx
movl	\$1, %eax
testl	%edi, %edi
je	.L2
leal	-1(%rdi), %edi
call	fact(int)
imull	%ebx, %eax
.L2:
popq	%rbx
ret
```

With -O2, that’s impressive, it’s done some program transformation and now we have a loop:

```fact(int):
testl	%edi, %edi
movl	\$1, %eax
je	.L4
.L3:
imull	%edi, %eax
subl	\$1, %edi
jne	.L3
rep
ret
.L4:
rep
ret
```

With -O3, whaaah!:

```fact(int):
testl	%edi, %edi
je	.L7
movl	%edi, %edx
movl	%edi, %esi
shrl	\$2, %edx
leal	0(,%rdx,4), %ecx
testl	%ecx, %ecx
je	.L8
cmpl	\$6, %edi
jbe	.L8
leal	-1(%rdi), %eax
movl	%edi, -24(%rsp)
movdqa	.LC1(%rip), %xmm4
movl	%eax, -20(%rsp)
leal	-2(%rdi), %eax
movd	-20(%rsp), %xmm2
movl	%eax, -16(%rsp)
leal	-3(%rdi), %eax
movd	-16(%rsp), %xmm1
movl	%eax, -12(%rsp)
xorl	%eax, %eax
movd	-12(%rsp), %xmm0
punpckldq	%xmm0, %xmm1
movd	-24(%rsp), %xmm0
punpckldq	%xmm2, %xmm0
punpcklqdq	%xmm1, %xmm0
movdqa	.LC0(%rip), %xmm1
.L4:
movdqa	%xmm1, %xmm3
psrldq	\$4, %xmm1
movdqa	%xmm0, %xmm2
cmpl	%eax, %edx
pmuludq	%xmm0, %xmm3
psrldq	\$4, %xmm2
pmuludq	%xmm1, %xmm2
pshufd	\$8, %xmm3, %xmm1
pshufd	\$8, %xmm2, %xmm2
punpckldq	%xmm2, %xmm1
ja	.L4
movdqa	%xmm1, %xmm2
subl	%ecx, %edi
cmpl	%ecx, %esi
psrldq	\$8, %xmm2
movdqa	%xmm2, %xmm0
psrldq	\$4, %xmm2
pmuludq	%xmm1, %xmm0
psrldq	\$4, %xmm1
pshufd	\$8, %xmm0, %xmm0
pmuludq	%xmm2, %xmm1
pshufd	\$8, %xmm1, %xmm1
punpckldq	%xmm1, %xmm0
movdqa	%xmm0, %xmm1
psrldq	\$4, %xmm1
movdqa	%xmm1, %xmm2
psrldq	\$4, %xmm1
pmuludq	%xmm0, %xmm2
psrldq	\$4, %xmm0
pmuludq	%xmm0, %xmm1
pshufd	\$8, %xmm2, %xmm0
pshufd	\$8, %xmm1, %xmm1
punpckldq	%xmm1, %xmm0
movd	%xmm0, -24(%rsp)
movl	-24(%rsp), %eax
je	.L13
.L9:
imull	%edi, %eax
subl	\$1, %edi
jne	.L9
rep
ret
.L7:
movl	\$1, %eax
ret
.L8:
movl	\$1, %eax
jmp	.L9
.L13:
ret
```

Well, OK, it’s unwound the loop and vectorized with SSE instructions though a more precise analysis will have to wait for a future post. Impressive optimization from GCC though and pretty groovy considering the biggest factorial we can fit in a 32-bit int is fact(15)!

# Redirection

Like, I suspect, many programmers, there are many software tools I have used on a regular basis for many years but remain woefully ignorant of their inner workings and true potential. One of these is the Unix command line.

It’s a common, for example, to want to make the error output of a program to appear as the normal output and trial and error, or Googling or looking at Stack Overflow leads to:

```strace ls 2>&1 >/dev/null
```

which works fine but seems puzzling – we’ve told the program to send error output to normal output, then normal output to /dev/null so why doesn’t that discard everything, similar to:

```strace ls >/dev/null 2>&1
```

This is because we don’t understand what is going on.

An indirection is actually a call to the `dup2` system call. From the man page:

`dup2() makes newfd be the copy of oldfd, closing newfd first if necessary'`

So: `n>&m` does a `dup2(m,n)`: close fd `n` if necessary, then make `n` be a copy of fd `m`, and `n>file` means: close `n` if necessary, open file as fd `m`, then do `dup2(m,n)`.

Now it all makes sense:

```strace ls 2>&1 1>/dev/null
```

first of all makes 2 be a copy of 1, then changes 1 to point to `/dev/null` – the copying is done ‘by value’ as it were (despite the confusing, for C++ programmers anyway, use of ‘&’).

Using strace here is not an accident, but used like this doesn’t tell us much: indirection is handled by the shell, not by the program, so we need to do something like this for further insight:

```\$ strace -f -etrace=clone,execve,open,dup2 bash -c 'ls >/dev/null 2>&1'
execve("/bin/bash", ["bash", "-c", "ls >/dev/null 2>&1"], [/* 46 vars */]) = 0
clone(Process 20454 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f643f4c09d0) = 20454
Process 20453 suspended
[pid 20454] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 20454] dup2(3, 1)                  = 1
[pid 20454] dup2(1, 2)                  = 2
[pid 20454] execve("/bin/ls", ["ls"], [/* 45 vars */]) = 0
Process 20453 resumed
Process 20454 detached
--- SIGCHLD (Child exited) @ 0 (0) ---
```

bash forks off a subprocess (a `clone` syscall these days rather than fork), which then sets up the input and output before calling `execve` to actually run the command.

We don’t have to limit ourselves to fds 0,1 and 2 that we get to start with:

```\$ runit () { echo stderr 1>&2; echo stdout; }
\$ runit 1>/dev/null
stderr
\$ runit 2>/dev/null
stdout
\$ runit 3>&2 2>&1 1>&3
stderr
stdout
\$ (runit 3>&2 2>&1 1>&3) 1> /dev/null
stdout
\$ (runit 3>&2 2>&1 1>&3) 2> /dev/null
stderr
```

We duplicate 2 to 3, then 1 to 2, then 3 to 1, and we have swapped `stdin` and `stderr`.

We can also pass non-standard file descriptors in to programs, though this doesn’t seem to be a technique used much:

```#include <unistd.h>
int main()
{
char buffer[256];
ssize_t n;
while ((n = read(3,buffer,sizeof(buffer))) > 0) {
write(4,buffer,n);
}
}
```

and do:

```\$ g++ -Wall cat34.cpp -o cat34
\$ echo Hello World | ./cat34 3<&0 4>&1
Hello World
```

It’s interesting that this also works:

```\$ echo Hello World | ./cat34 3>&0 4<&1
Hello World
```

While we are this part of town, let’s talk briefly about named pipes, another feature that has been around for ever, but doesn’t seem to get used as much as it deserves. We can run in to problems though:

Suppose I want to capture an HTTP response from a web server, I can do this:

```\$ mkfifo f1
\$ mkfifo f2
\$ mkfifo f3
\$ netcat -l 8888 <f1 >f2 &
\$ netcat www.bbc.co.uk 80 <f2 >f3 &
\$ tee foo.txt <f3 >f1 &
```

```\$ GET http://localhost:8888/ >/dev/null
Can't connect to localhost:8888 (Connection refused)
```

This doesn’t seem right, netcat should be listening on port 8888, I told it so myself! And checking with netstat shows no listener on 8888 and finally ps -aux shows no sign of any netcat processes – what is going on?

Once again, strace helps us see the true reality of things:

```\$ kill %1
\$ strace netcat -l 8888 < f1 > f2
```

But strace tells us nothing – there is no output! Like the dog that didn’t bark in the night though, this is an important clue, and widening our area of investigation, we find:

```\$ strace -f -etrace=open,execve bash -c 'netcat -l 8888 < f1 > f2'
execve("/bin/bash", ["bash", "-c", "netcat -l 8888 < f1 > f2"], [/* 46 vars */]) = 0
Process 20621 attached
Process 20620 suspended
[pid 20621] open("f1", O_RDONLY ...
```

The shell is stalled trying to open the “f1” fifo, before it even gets around to starting the netcat program, which is why the first strace didn’t show anything. What we have forgotten is that opening a pipe blocks if there is no process with the other end open (it doesn’t have to actively reading or writing, it just has to be there). The shell handles redirections in the order they appear, so since our 3 processes are all opening their read fifo first, none have got around to opening their write fifo – we have deadlock, in fact, the classic dining philosophers problem, and a simple solution is for one philosopher to pick up the forks in a different order:

```\$ netcat -l 8888 >f2 <f1 &
\$ netcat www.bbc.co.uk 80 <f2 >f3 &
\$ tee foo.txt <f3 >f1 &
\$ GET http://localhost:8888/ >/dev/null
\$ cat foo.txt
HTTP/1.1 301 Moved Permanently
Server: Apache
...
```

We can, it should be noted, do this more easily with a normal pipeline and a single fifo, and avoid all these problems:

```\$ netcat www.bbc.co.uk 80 <f1 | tee foo.txt | netcat -l 8888 >f1
```

but that would be less fun and possibly less instructive.