# Krazy Kompilation

Let’s take a simple factorial program:

```int fact(int n)
{
if (n == 0) return 1;
else return n * fact(n-1);
}
```

and look at what gcc compiles it to (using Matthew Godbolt’s excellent GCC Explorer):

With -O1, a nice clean version of the recursive function:

```fact(int):
pushq	%rbx
movl	%edi, %ebx
movl	\$1, %eax
testl	%edi, %edi
je	.L2
leal	-1(%rdi), %edi
call	fact(int)
imull	%ebx, %eax
.L2:
popq	%rbx
ret
```

With -O2, that’s impressive, it’s done some program transformation and now we have a loop:

```fact(int):
testl	%edi, %edi
movl	\$1, %eax
je	.L4
.L3:
imull	%edi, %eax
subl	\$1, %edi
jne	.L3
rep
ret
.L4:
rep
ret
```

With -O3, whaaah!:

```fact(int):
testl	%edi, %edi
je	.L7
movl	%edi, %edx
movl	%edi, %esi
shrl	\$2, %edx
leal	0(,%rdx,4), %ecx
testl	%ecx, %ecx
je	.L8
cmpl	\$6, %edi
jbe	.L8
leal	-1(%rdi), %eax
movl	%edi, -24(%rsp)
movdqa	.LC1(%rip), %xmm4
movl	%eax, -20(%rsp)
leal	-2(%rdi), %eax
movd	-20(%rsp), %xmm2
movl	%eax, -16(%rsp)
leal	-3(%rdi), %eax
movd	-16(%rsp), %xmm1
movl	%eax, -12(%rsp)
xorl	%eax, %eax
movd	-12(%rsp), %xmm0
punpckldq	%xmm0, %xmm1
movd	-24(%rsp), %xmm0
punpckldq	%xmm2, %xmm0
punpcklqdq	%xmm1, %xmm0
movdqa	.LC0(%rip), %xmm1
.L4:
movdqa	%xmm1, %xmm3
psrldq	\$4, %xmm1
movdqa	%xmm0, %xmm2
cmpl	%eax, %edx
pmuludq	%xmm0, %xmm3
psrldq	\$4, %xmm2
pmuludq	%xmm1, %xmm2
pshufd	\$8, %xmm3, %xmm1
pshufd	\$8, %xmm2, %xmm2
punpckldq	%xmm2, %xmm1
ja	.L4
movdqa	%xmm1, %xmm2
subl	%ecx, %edi
cmpl	%ecx, %esi
psrldq	\$8, %xmm2
movdqa	%xmm2, %xmm0
psrldq	\$4, %xmm2
pmuludq	%xmm1, %xmm0
psrldq	\$4, %xmm1
pshufd	\$8, %xmm0, %xmm0
pmuludq	%xmm2, %xmm1
pshufd	\$8, %xmm1, %xmm1
punpckldq	%xmm1, %xmm0
movdqa	%xmm0, %xmm1
psrldq	\$4, %xmm1
movdqa	%xmm1, %xmm2
psrldq	\$4, %xmm1
pmuludq	%xmm0, %xmm2
psrldq	\$4, %xmm0
pmuludq	%xmm0, %xmm1
pshufd	\$8, %xmm2, %xmm0
pshufd	\$8, %xmm1, %xmm1
punpckldq	%xmm1, %xmm0
movd	%xmm0, -24(%rsp)
movl	-24(%rsp), %eax
je	.L13
.L9:
imull	%edi, %eax
subl	\$1, %edi
jne	.L9
rep
ret
.L7:
movl	\$1, %eax
ret
.L8:
movl	\$1, %eax
jmp	.L9
.L13:
ret
```

Well, OK, it’s unwound the loop and vectorized with SSE instructions though a more precise analysis will have to wait for a future post. Impressive optimization from GCC though and pretty groovy considering the biggest factorial we can fit in a 32-bit int is fact(15)!