Krazy Kompilation

Let’s take a simple factorial program:

int fact(int n)
{
  if (n == 0) return 1;
  else return n * fact(n-1);
}

and look at what gcc compiles it to (using Matthew Godbolt’s excellent GCC Explorer):

With -O1, a nice clean version of the recursive function:

fact(int):
	pushq	%rbx
	movl	%edi, %ebx
	movl	$1, %eax
	testl	%edi, %edi
	je	.L2
	leal	-1(%rdi), %edi
	call	fact(int)
	imull	%ebx, %eax
.L2:
	popq	%rbx
	ret

With -O2, that’s impressive, it’s done some program transformation and now we have a loop:

fact(int):
	testl	%edi, %edi
	movl	$1, %eax
	je	.L4
.L3:
	imull	%edi, %eax
	subl	$1, %edi
	jne	.L3
	rep
	ret
.L4:
	rep
	ret

With -O3, whaaah!:

fact(int):
	testl	%edi, %edi
	je	.L7
	movl	%edi, %edx
	movl	%edi, %esi
	shrl	$2, %edx
	leal	0(,%rdx,4), %ecx
	testl	%ecx, %ecx
	je	.L8
	cmpl	$6, %edi
	jbe	.L8
	leal	-1(%rdi), %eax
	movl	%edi, -24(%rsp)
	movdqa	.LC1(%rip), %xmm4
	movl	%eax, -20(%rsp)
	leal	-2(%rdi), %eax
	movd	-20(%rsp), %xmm2
	movl	%eax, -16(%rsp)
	leal	-3(%rdi), %eax
	movd	-16(%rsp), %xmm1
	movl	%eax, -12(%rsp)
	xorl	%eax, %eax
	movd	-12(%rsp), %xmm0
	punpckldq	%xmm0, %xmm1
	movd	-24(%rsp), %xmm0
	punpckldq	%xmm2, %xmm0
	punpcklqdq	%xmm1, %xmm0
	movdqa	.LC0(%rip), %xmm1
.L4:
	movdqa	%xmm1, %xmm3
	psrldq	$4, %xmm1
	addl	$1, %eax
	movdqa	%xmm0, %xmm2
	cmpl	%eax, %edx
	pmuludq	%xmm0, %xmm3
	paddd	%xmm4, %xmm0
	psrldq	$4, %xmm2
	pmuludq	%xmm1, %xmm2
	pshufd	$8, %xmm3, %xmm1
	pshufd	$8, %xmm2, %xmm2
	punpckldq	%xmm2, %xmm1
	ja	.L4
	movdqa	%xmm1, %xmm2
	subl	%ecx, %edi
	cmpl	%ecx, %esi
	psrldq	$8, %xmm2
	movdqa	%xmm2, %xmm0
	psrldq	$4, %xmm2
	pmuludq	%xmm1, %xmm0
	psrldq	$4, %xmm1
	pshufd	$8, %xmm0, %xmm0
	pmuludq	%xmm2, %xmm1
	pshufd	$8, %xmm1, %xmm1
	punpckldq	%xmm1, %xmm0
	movdqa	%xmm0, %xmm1
	psrldq	$4, %xmm1
	movdqa	%xmm1, %xmm2
	psrldq	$4, %xmm1
	pmuludq	%xmm0, %xmm2
	psrldq	$4, %xmm0
	pmuludq	%xmm0, %xmm1
	pshufd	$8, %xmm2, %xmm0
	pshufd	$8, %xmm1, %xmm1
	punpckldq	%xmm1, %xmm0
	movd	%xmm0, -24(%rsp)
	movl	-24(%rsp), %eax
	je	.L13
.L9:
	imull	%edi, %eax
	subl	$1, %edi
	jne	.L9
	rep
	ret
.L7:
	movl	$1, %eax
	ret
.L8:
	movl	$1, %eax
	jmp	.L9
.L13:
	ret

Well, OK, it’s unwound the loop and vectorized with SSE instructions though a more precise analysis will have to wait for a future post. Impressive optimization from GCC though and pretty groovy considering the biggest factorial we can fit in a 32-bit int is fact(15)!

Advertisements


Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s