# Floating point numbers are made of jelly.

**Posted:**January 9, 2013

**Filed under:**Floating Point Leave a comment

Floating point equality can be quite interesting, leading some people to think that they are actually made of some sort of jelly, and don’t in fact have precise values. This isn’t true of course but sometimes things can get quite bizarre. I was surprised to find this assertion:

printf("%.20g %.20g\n", f(), g()); assert(f() == g());

failing with this output:

`$ ./foo`

0.9979466119096509491 0.9979466119096509491

foo: foo.cpp:15: int main(): Assertion `f() == g()' fails

Aborted

(All code compiled with gcc 4.4.3, on Ubuntu 10.04, 32 bit Celeron).

Now 20 significant figures are enough to tell the difference between two doubles, so this looks a little strange.

More investigation showed that:

double x = f(); assert(x == f());

failed as well and I started to get worried. It’s true that floating point numbers aren’t equal to themselves if they are NaNs, but that didn’t seem to be the case here. Also, the assertion didn’t fail for code that had been optimized.

Some years ago I did some work on code generation for the 387 FPU and then we had trouble with rounding – test programs for the 387 would produce slightly different results than for other architectures like MIPS and Sparc. IEEE754 is pretty precise about how rounding should be done so this seemed a bit strange until we realized that the 387 uses 80-bit precision internally, with rounding to 64 bit doubles only happening when transferring data from the FPU to memory (FISTP and friends). There is a flag to set the internal rounding mode to double precision and once we had set that we got consistent results on all architectures.

Looks like something similar is happening here and a look at the disassembly confirms this:

call _Z1av fstpl 24(%esp) call _Z1av fldl 24(%esp) fxch %st(1) fucompp fnstsw %ax

The first result of calling a() gets written out to memory as a double, hence is rounded, but then it’s reloaded but compared to the unrounded result from the second call.

Investigating further exposes more interesting behaviour:

#include <stdio.h> #include <math.h> #include <assert.h> int main() { double x,y,z; int a; x = 1.0; while (x != 0.0) { y = x; x /= 2; } a = y > 0; printf("%.40Lg\n", (long double)y); printf("%d\n", a); printf("%.20g\n", y/y); printf("%.40Lg\n", (long double)y); x = 1.0; y = nextafter(x,2.0); z = (x+y)/2.0; printf("%d\n", z > x && z < y); x = 0.5; while ((z = sqrt(x)) != x) x = z; y = x; while ((z = x*x) != x) x = z; printf("%.40Lg\n%.40Lg\n", (long double)x, (long double)y); }

Now we have completely different behaviour between optimized and unoptimized code:

`$ g++ -Wall fp.cpp`

$ ./a.out

4.940656458412465441765687928682213723651e-324

1

1

4.940656458412465441765687928682213723651e-324

0

1

1

Unoptimized, pretty much as expected, we iterate down to the smallest representable float, which we divide by itself and get 1. If we average one and the next number, we don’t get a number in between; and finally, if we iterate square roots, we eventually arrive at 1, and squaring that, just gives 1 again.

`$ g++ -O3 -Wall fp.cpp`

$ ./a.out

3.645199531882474602528405933619419816399e-4951

0

-nan

0

1

0

0.9999999999999999999457898913757247782996

Bah, we seem to have iterated right down to the smallest extended value there, but when we divide it by itself, we get a NaN – it seems to have been magically converted to zero, as confirmed by printing it out again (presumably the 387 register has been stored temporarily in memory, while we did the printf, and truncated to 64 bits on the way. Now, we are able to find a value between 1 and 1 + ε, and finally, our repeated square root function seems to converge on something slightly less that 1 and when we go back the other way, we end up at 0 instead.

Fortunately, we don’t need to worry about what to do about this or whether it’s the chip designers or the language designers or the FP standard designers or the compiler writers who are getting it wrong (or maybe it really is that FP numbers are made of jelly). Newer x86 processors have SSE registers for FP, which are 64 bit doubles throughout (as do most other common architectures), so these problems are less likely to occur:

... call _Z3av movsd %xmm0, -8(%rbp) call _Z3av ucomisd -8(%rbp), %xmm0 jne .L8 ...

and if we run our little test program (this time with gcc 4.7.2) on a 64-bit processor we get:

`$ g++ -Wall fp.cpp`

$ ./a.out

4.940656458412465441765687928682213723651e-324

1

1

4.940656458412465441765687928682213723651e-324

0

0

0.9999999999999998889776975374843459576368

for all levels of optimization.

We can force use of the 387 instructions with `-mfpmath=387`

and then we get:

`$ g++ -O3 -mfpmath=387 -Wall fp.cpp`

$ ./a.out

3.645199531882474602528405933619419816399e-4951

0

-nan

0

1

0

0.9999999999999999999457898913757247782996

again, so it doesn’t look like our problems are related to the particular gcc version.

With the new code, we lose the benefits of extra internal precision, but on the whole I prefer having half a chance of understanding what’s going on.