# Testing for Triangularity

**Posted:**March 19, 2013

**Filed under:**Floating Point Leave a comment

A company I used to work for had a standard interview problem of writing a short function that would take 3 numbers (usually assumed to be integers) and determine if they formed a valid triangle. There was much discussion about the best/fastest/clearest/most elegant way of doing this, but I don’t remember the topic of what to do if the numbers were floating point coming up.

Curiously, being interviewed myself at another company not long ago I was asked this very question and wasn’t quite sure what the best thing to do was – for integers, my preferred solution is to sort the 3 numbers as a >= b >= c, then we have a triangle just when c > 0 and a < b+c (or

a-b < c if we are worried about overflow). What about if they are floating point numbers? Clearly if b >> c, then b+c might lose precision, so, for example, we might have c > 0 and b = a, so it’s valid, but if b+c = b, then our test fails.

We could try the same trick as for integers, testing for a-b < c – subtracting the two largest should minimize the rounding error; this seems plausible, but can we be sure we won’t get it wrong?

An earlier post looked at the sequence of integers:

` 0 1 2 3 4 5 6 7`

8 9 10 11 12 13 14 15

16 18 20 22 24 26 28 30

32 36 40 44 48 52 56 60

64 72 80 88 96 104 112 120

...

which are really just thinly disguised (non-negative) floating point numbers. An important feature, obvious here and easy to prove of “real” FP numbers, is that, apart from the first two rows, each row is the double of the row above it, therefore, 2n is representable exactly iff n is. Furthermore, it’s pretty obvious that if a and b are representable, with a/2 <= b <= a, then a-b is representable (and if a and b are numbers in the first two rows, then all differences are representable, so we don’t need to worry about them).

Now let’s return to our triangle: we have a >= b >= c > 0 and clearly a,b,c must be representable. If all the numbers are in the first two rows, then all operations are exact and we get the right answer. Otherwise, a is even: first, consider the case b >= a/2, as above, this implies that a-b is representable and the comparison with c must be exact, so our test is correct. Otherwise, if b < a/2, then since c <= b, c + b <= 2b < a, so we don’t have a triangle, and, assuming that rounding is monotonic (ie. x <= y => round(x) <= round(y), or equivalently, x < y => round(x) <= round(y)), our check works out correct too:

b < a/2, so b+c < a (exact) so c < a-b (exact) so c = round(c) <= round(a-b)

QED

As usual for this blog, we are merely paddling at the shore of a much deeper sea. For much more on triangles, floating point approximations etc, see Kahan’s paper:

(or indeed anything else by the great man:

http://www.cs.berkeley.edu/~wkahan/

http://owpdb.mfo.de/detail?photo_id=5350).

# Floating point numbers are made of jelly.

**Posted:**January 9, 2013

**Filed under:**Floating Point Leave a comment

Floating point equality can be quite interesting, leading some people to think that they are actually made of some sort of jelly, and don’t in fact have precise values. This isn’t true of course but sometimes things can get quite bizarre. I was surprised to find this assertion:

printf("%.20g %.20g\n", f(), g()); assert(f() == g());

failing with this output:

`$ ./foo`

0.9979466119096509491 0.9979466119096509491

foo: foo.cpp:15: int main(): Assertion `f() == g()' fails

Aborted

(All code compiled with gcc 4.4.3, on Ubuntu 10.04, 32 bit Celeron).

Now 20 significant figures are enough to tell the difference between two doubles, so this looks a little strange.

More investigation showed that:

double x = f(); assert(x == f());

failed as well and I started to get worried. It’s true that floating point numbers aren’t equal to themselves if they are NaNs, but that didn’t seem to be the case here. Also, the assertion didn’t fail for code that had been optimized.

Some years ago I did some work on code generation for the 387 FPU and then we had trouble with rounding – test programs for the 387 would produce slightly different results than for other architectures like MIPS and Sparc. IEEE754 is pretty precise about how rounding should be done so this seemed a bit strange until we realized that the 387 uses 80-bit precision internally, with rounding to 64 bit doubles only happening when transferring data from the FPU to memory (FISTP and friends). There is a flag to set the internal rounding mode to double precision and once we had set that we got consistent results on all architectures.

Looks like something similar is happening here and a look at the disassembly confirms this:

call _Z1av fstpl 24(%esp) call _Z1av fldl 24(%esp) fxch %st(1) fucompp fnstsw %ax

The first result of calling a() gets written out to memory as a double, hence is rounded, but then it’s reloaded but compared to the unrounded result from the second call.

Investigating further exposes more interesting behaviour:

#include <stdio.h> #include <math.h> #include <assert.h> int main() { double x,y,z; int a; x = 1.0; while (x != 0.0) { y = x; x /= 2; } a = y > 0; printf("%.40Lg\n", (long double)y); printf("%d\n", a); printf("%.20g\n", y/y); printf("%.40Lg\n", (long double)y); x = 1.0; y = nextafter(x,2.0); z = (x+y)/2.0; printf("%d\n", z > x && z < y); x = 0.5; while ((z = sqrt(x)) != x) x = z; y = x; while ((z = x*x) != x) x = z; printf("%.40Lg\n%.40Lg\n", (long double)x, (long double)y); }

Now we have completely different behaviour between optimized and unoptimized code:

`$ g++ -Wall fp.cpp`

$ ./a.out

4.940656458412465441765687928682213723651e-324

1

1

4.940656458412465441765687928682213723651e-324

0

1

1

Unoptimized, pretty much as expected, we iterate down to the smallest representable float, which we divide by itself and get 1. If we average one and the next number, we don’t get a number in between; and finally, if we iterate square roots, we eventually arrive at 1, and squaring that, just gives 1 again.

`$ g++ -O3 -Wall fp.cpp`

$ ./a.out

3.645199531882474602528405933619419816399e-4951

0

-nan

0

1

0

0.9999999999999999999457898913757247782996

Bah, we seem to have iterated right down to the smallest extended value there, but when we divide it by itself, we get a NaN – it seems to have been magically converted to zero, as confirmed by printing it out again (presumably the 387 register has been stored temporarily in memory, while we did the printf, and truncated to 64 bits on the way. Now, we are able to find a value between 1 and 1 + ε, and finally, our repeated square root function seems to converge on something slightly less that 1 and when we go back the other way, we end up at 0 instead.

Fortunately, we don’t need to worry about what to do about this or whether it’s the chip designers or the language designers or the FP standard designers or the compiler writers who are getting it wrong (or maybe it really is that FP numbers are made of jelly). Newer x86 processors have SSE registers for FP, which are 64 bit doubles throughout (as do most other common architectures), so these problems are less likely to occur:

... call _Z3av movsd %xmm0, -8(%rbp) call _Z3av ucomisd -8(%rbp), %xmm0 jne .L8 ...

and if we run our little test program (this time with gcc 4.7.2) on a 64-bit processor we get:

`$ g++ -Wall fp.cpp`

$ ./a.out

4.940656458412465441765687928682213723651e-324

1

1

4.940656458412465441765687928682213723651e-324

0

0

0.9999999999999998889776975374843459576368

for all levels of optimization.

We can force use of the 387 instructions with `-mfpmath=387`

and then we get:

`$ g++ -O3 -mfpmath=387 -Wall fp.cpp`

$ ./a.out

3.645199531882474602528405933619419816399e-4951

0

-nan

0

1

0

0.9999999999999999999457898913757247782996

again, so it doesn’t look like our problems are related to the particular gcc version.

With the new code, we lose the benefits of extra internal precision, but on the whole I prefer having half a chance of understanding what’s going on.

# Complex Fibonacci

**Posted:**December 22, 2012

**Filed under:**Floating Point, Number Theory Leave a comment

There are a couple of ways to extend the Fibonacci series to a continuous function on real or complex numbers. Here’s a nice one: the grid shows the mapping of the unit strip, `0 <= re(z)`

, `0 <= im(z) <= 1`

, grid points are 0.1 apart. The real axis is picked out in red, notice that it crosses itself exactly at the points of the Fibonacci series – twice at 1 after executing a loop-the-loop (clicking on the image should give you a larger version).

The function is just the standard Binet formula, interpreted over the complex plane:

`(φ`

^{z} - -φ^{-z})/√5

or equivalently:

`(φ`

^{z} - e^{iπz}φ^{-z})/√5

For negative values, the function turns into a nice spiral (the `exp`

component dominates). Here is the `0 <= im(z) <= 0.1`

strip:

The red parallelogram illustrates `F(z+2) = F(z+1)+F(z)`

still holds, even for our complex quantities.

Most of the action takes place around the origin, so let’s zoom in:

The image of the real axis passing through the points -1,1,0,1,1,2,… is clearly visible (the real axis is on the outside, the origin is the leftmost point on the red parallelogram).

Sometimes the series is extended as `(φ`

which produces a mapping that takes the real line to itself, ie. taking the real part of the exponential, which is also nice, here we see the mapping of the strip ^{z} - cos(πz)φ^{-z})/√5`0.1 <= im(z) <= 0.2`

:

Again, the red parallelogram illustrates the recurrence.

# Inverse Square Root II

**Posted:**November 25, 2012

**Filed under:**Bit twiddling, Floating Point Leave a comment

Hmmm, after writing that rather long and laborious post about the bit-level interpretation of fast inverse square root, I came across a couple of excellent blog posts by Christian Plesner Hansen:

http://blog.quenta.org/2012/09/0x5f3759df.html

http://blog.quenta.org/2012/09/0x5f3759df-appendix.html

which got me thinking about all this again (it’s a real mind-worm this one)…

So, now the scales have dropped from my eyes, let’s try again: what we are really doing is using a rough approximation to log2(n), represented as a fixed point number, that happens to be easily computed from the floating point representation. Once we have our approximate log (and exp) function, we can do the the usual computations of roots etc. using normal arithmetic operations on our fixed point representation, before converting back to a float using our exp function.

So, take a number, 2^{n}(1+m) where 0<=m<1, a reasonable approximation to log_{2}(x) is n+m, and we can improve the approximation by adding a small constant offset, σ and, because we are doing this in the fixed point realm, everything works out nicely when we convert back to the floating point realm. Here is a picture for n=0, choosing by eye a value of σ = 0.045:

Now, if interpret a non-negative IEEE-754 float 2^{n}(1+m) as a 9.23 fixed point value (ie. with 9 bits to the left of the binary point, 23 bits to the right), then this fixed point number is (e+m), where e = n+127, and, as above, this is approximates log_{2}(2^{e}(1+m)), so e-127+m = n+m is an approximation to log_{2}(2^{e-127}(1+m)) = log_{2}(2^{n}(1+m)), ie. log_{2} of our original number. Note that e+m is always positive, but e+m-127 may not be, so might need to be represented as a signed value.

As far as actual code goes, first we need to be able to get at the bitwise representation of floats. It’s nice to avoid aliasing issues etc. by using `memcpy`

; my compiler (gcc 4.4.3) at least generates sensible code for this:

static inline float inttofloat(uint32_t n) { float f; memcpy(&f,&n,sizeof(f)); return f; } static inline uint32_t floattoint(float f) { uint32_t n; memcpy(&n,&f,sizeof(n)); return n; }

We are working in our fixed point representation, so adding 127 is actually done by adding 127<<23 = 0x3f800000 and we convert our σ offset in the same way:

uint32_t sigma = 45*(uint32_t(1)<<23)/1000; uint32_t c = (uint32_t(127)<<23) - sigma; int32_t alog(float x) { return floattoint(x)-c; }

We also want an inverse of our approximate log, ie. an approximate exp, but this is just a reverse of the previous step:

float aexp(int32_t n) { return inttofloat(n+c); }

Provided we haven’t left the realm of representable floats, the sign bit of the result should alway be zero.

We can use our log approximation directly as well of course, to convert from fixed point we need:

float unfix(int32_t n) { return float(n)/(1<<23); } float unfix(uint32_t n) { return float(n)/(1<<23); }

Another nice picture, showing our `log2`

approximation over a wider range:

Now we can define:

float sqrt(float x) { return aexp(alog(x)/2); } float invsqrt(float x) { return aexp(alog(x)/-2); } float cubert(float x) { return aexp(alog(x)/3); }

and so on.

To relate this to the original magic function, for an inverse square root, the integer manipulations are:

invsqrt(n) = -((n-c)/2) + c = -(n/2 - c/2) + c // Not exact, but close enough = c/2 - n/2 + c = -(n>>1) + ((c>>1) + c)

Calculating (c>>1) + c we get 0x5f375c29, satisfyingly close to the original magic constant 0x5f3759df…

We can use signed or unsigned fixed point numbers for our logs, the arithmetic will be the same, providing we can avoid overflow, so for example, if to get an approximate value for `log(factorial(50))`

, we can do:

uint32_t fact = 0; for (int i = 1; i <= 50; i++) { fact += alog(i); } printf("%g\n", unfix(fact));

giving the result 213.844, comparing nicely to the true result of 214.208. Note that if I were to have used a signed value for fact, the result would have overflowed.

Be warned, this is only a very rough approximation to the log2 function, only use it if a very crude estimate is good enough, or if you are going do some further refinement of the value. Alternatively, your FPU almost certainly uses some variation of this technique to calculate logs (or at least an initial approximation) so you could just leave it to get on with it.

Like many things, this isn’t particularly new, the standard reference is:

J. N. Mitchell, “Computer multiplication and division using binary logarithms,” IRE Trans. Electron. Computers, vol. 11, pp. 512–517, Aug. 1962

# Fast inverse square root

**Posted:**November 19, 2012

**Filed under:**Bit twiddling, C++, Floating Point 2 Comments

What could be cuter than the infamous fast inverse square root function used in the Quake 3 engine:

http://en.wikipedia.org/wiki/Fast_inverse_square_root

The interesting part is the calculation of the initial approximation, splitting this down into the basic steps, we have:

float isqrt(float f) { uint32_t n = floattoint(f); n >>= 1; n = -n; n += 0x5f000000; n += 0x3759df; float x = inttofloat(n); return x; }

To get some insight into what’s going on here, we need to look at the representation of floating point numbers. An IEEE-754 float consists of a sign bit, an exponent value, and a mantissa. The exponent is an 8-bit value, the mantissa has 23 bits, both unsigned. As usual, a suitable notation is key: simplifying a little (specifically, ignoring NaNs, infinities and denormalized numbers), we shall write a float of the form `{sign:1;exponent:8;mantissa:23}`

as `(s,e,m)`

, with `m`

a real in the range `[0,1)`

, and this represents `2`

, negated if the sign bit is 1.^{e-127}(1+m)

To warm up, it’s helpful to look at a simpler example:

float fsqrt(float f) { unsigned n = floattoint(f); n >>=1; n += 0x1f800000; n += 0x400000; return inttofloat(n); }

Here we are computing an approximation to a normal square root.

Taking it a step at a time: first the shift right `n >>= 1`

, there are two cases for odd and even exponent:

(0,2e,m) => (0,e,m/2) // No carry from exponent (0,2e+1,m) => (0,e,1/2 + m/2) // Carry into mantissa

For `n += 0x1f800000`

: we are adding 63 (0x1f800000 is 63 << 23) onto the exponent:

(s,e,m) => (s,e+63,m)

And finally, `n += 0x400000`

: Generally, if we add p onto the mantissa, where 0 <= p < 2^{23}, and writing m’ = m + p/2^{23}, we have:

(s,e,m) => (s,e,m') if m' < 1 (s,e,m) => (s,e+1,m'-1) if m' >= 1

For `p = 0x400000 = 2`

, we have ^{22}`m' = m+1/2`

. ie:

(s,e,m) => (s,e+1,m'-1) if m' >= 1 => (s,e,m') otherwise

Putting this together, for even powers of two:

2^{2k}(1+m) => 2^{k}(1+m/2): (0,2k+127,m) => (0,k+63,0.5+m/2) => (0,k+126,0.5+m/2) => (0,k+127,m/2)

And for odd powers:

2^{(2k+1)}(1+m) => 2^{k}(1+0.5+m/2): (0,2k+1+127,m) => (0,k+64,m/2) => (0,k+127,m/2) => (0,k+127,0.5+m/2)

Let’s check this is sensible by setting m = 0:

2^{2k}=> 2^{k}2^{(2k+1)}=> 2^{k}(1+0.5)

and putting m = 1 we get:

2^{2k+1}=> 2^{k}(1+1/2) 2^{(2k+2)}=> 2^{k+1}

Our approximation is linear in between powers of two, and continuous at those points too. Also at even powers the graph is tangent to `sqrt(x)`

.

This is all nicely illustrated in a picture:

Returning to the original inverse function, we have an additional negation step, `n = -n`

: to negate a twos-complement number, we flip the bits and add one. There are two main cases, depending on whether the mantissa is zero. If it is zero, there is a carry into the exponent, otherwise we just flip the bits of the exponent. The sign bit will always end up set (I’m ignoring the case when the exponent is zero). The general effect is:

(0,e,0) => (1,256-e,0) (0,e,m) => (1,255-e,1-m)

This time, we are adding 190 onto the exponent (0x5f000000 = 190<<23) – this has the dual effect of resetting the sign bit to 0 and subtracting 66 from the exponent (190 = -66 mod 256).

Let’s see what happens with odd and even powers of two; writing the magic offset added onto the mantissa as `c`

:

(0,2k+127,m) => (0,k+63,0.5+m/2) // n >>= 1 => (1,255-k-63,0.5-m/2) // n = -n => (0,126-k,0.5-m/2) // n += 0x5f000000 => (0,126-k,0.5-m/2+c) // if 0.5-m/2+c < 1 => (0,127-k,-0.5-m/2+c) // if 0.5-m/2+c >= 1

(0,2k+128,m) => (0,k+63,m/2) // n >>= 1 => (1,255-k-63,1-m/2) // n = -n => (0,126-k,1-m/2) // n += 0x5f000000 => (0,126-k,1-m/2+c) // if 1-m/2+c < 1 => (0,127-k,-m/2+c) // if 1-m/2+c >= 1

If we use 0x400000 as the offset, ie. `c`

above is 0.5, and put m=0 in the two cases, we have:

2^{2k}=> 1/2^{k}2^{2k+1}=> 1.5/2^{k+1}

Once again, our approximation coincides exactly at even powers of two, and as before it’s useful to have a picture:

We don’t have nice tangents this time, but the end result isn’t too bad.

We probably could have saved ourselves some work here by noting that the mapping between 32-bit integers (as signed magnitude numbers) and the corresponding floats is monotonic and continuous (for some sense of continuous), so composing with other (anti)monotonic operations gives an (anti)monotonic result, so having checked our function is correct at powers of two, it can’t go too badly adrift in between.

We can improve our approximation by using a smaller mantissa offset, `0x3759df`

, and we end up with the function we came in with:

Not especially pretty, but a good starting point for some Newton-Raphson. Notice that as well as kinks at exact powers of two, this approximation has kinks in between as well (when adding the constant to the mantissa overflows into the exponent).

# A Familiar Series?

**Posted:**October 4, 2012

**Filed under:**Floating Point Leave a comment

```
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 18 20 22 24 26 28 30
32 36 40 44 48 52 56 60
64 72 80 88 96 104 112 120
128 144 160 176 192 208 224 240
256 288 320 352 384 416 448 480
512 576 640 704 768 832 896 960
1024 1152 1280 1408 1536 1664 1792 1920
2048 2304 2560 2816 3072 3328 3584 3840
4096 4608 5120 5632 6144 6656 7168 7680
8192 9216 10240 11264 12288 13312 14336 15360
16384 18432 20480 22528 24576 26624 28672 30720
32768 36864 40960 45056 49152 53248 57344 61440
65536 73728 81920 90112 98304 106496 114688 122880
131072 147456 163840 180224 196608 212992 229376 245760
262144 294912 327680 360448 393216 425984 458752 491520
524288 589824 655360 720896 786432 851968 917504 983040
1048576 1179648 1310720 1441792 1572864 1703936 1835008 1966080
2097152 2359296 2621440 2883584 3145728 3407872 3670016 3932160
4194304 4718592 5242880 5767168 6291456 6815744 7340032 7864320
```

In fact, they are floating point numbers, or rather, integers with the same distribution as floating point numbers, and just need some appropriate scaling, for example by dividing by 8388608, to give a nice set of numbers in the interval [0,1).

Notice that for all but the first row, every entry in a row has the same number of digits in its binary representation – they are normalized, with the first row corresponding to the denormalized numbers of IEEE-754. Without the denormalized numbers, there would be a great gap around zero.

Also, the numbers in each row are the same distance apart, with the gap at row n+1 being twice the gap at row n. Again, the first row is the exception, the gap here is the same as the second row – so the first 16 numbers form a uniformly spaced sequence from 0 – there isn’t anything very peculiar about the denormalized numbers, they just represent a continuation of the first row of normalized numbers to fill the gap down to zero.

We can see how a scaled comparison of floats might work too – within each row, two elements are close if they are within n places of each other, or equivalently, if they are within some constant multiple of the row gap. For elements in different rows, we can either just carry on counting the number of intervening numbers, or continue using the row gap to define a neighbourhood of each number. For example, if “close” means “twice the row gap”, then 22 is close to 18, 20, 24 and 26; 16 is close to 12, 13, 14. 15, 18 and 20; and 15 is close to just 13, 14 and 16. This notion of closeness is discussed by Knuth in TAOCP.

We don’t have to insist on base 2 of course, here’s a similar sequence based on powers of 3, with 10 elements in each segment. This gives us 5 denormalized values:

```
0 1 2 3 4
5 6 7 8 9 10 11 12 13 14
15 18 21 24 27 30 33 36 39 42
45 54 63 72 81 90 99 108 117 126
135 162 189 216 243 270 297 324 351 378
405 486 567 648 729 810 891 972 1053 1134
1215 1458 1701 1944 2187 2430 2673 2916 3159 3402
3645 4374 5103 5832 6561 7290 8019 8748 9477 10206
10935 13122 15309 17496 19683 21870 24057 26244 28431 30618
32805 39366 45927 52488 59049 65610 72171 78732 85293 91854
98415 118098 137781 157464 177147 196830 216513 236196 255879 275562
295245 354294 413343 472392 531441 590490 649539 708588 767637 826686
885735 1062882 1240029 1417176 1594323 1771470 1948617 2125764 2302911 2480058
```

Once again, each row, except the first, every number has the same number of digits in its base 3 representation.