Testing for Triangularity

A company I used to work for had a standard interview problem of writing a short function that would take 3 numbers (usually assumed to be integers) and determine if they formed a valid triangle. There was much discussion about the best/fastest/clearest/most elegant way of doing this, but I don’t remember the topic of what to do if the numbers were floating point coming up.

Curiously, being interviewed myself at another company not long ago I was asked this very question and wasn’t quite sure what the best thing to do was – for integers, my preferred solution is to sort the 3 numbers as a >= b >= c, then we have a triangle just when c > 0 and a < b+c (or
a-b < c if we are worried about overflow). What about if they are floating point numbers? Clearly if b >> c, then b+c might lose precision, so, for example, we might have c > 0 and b = a, so it’s valid, but if b+c = b, then our test fails.

We could try the same trick as for integers, testing for a-b < c – subtracting the two largest should minimize the rounding error; this seems plausible, but can we be sure we won’t get it wrong?

An earlier post looked at the sequence of integers:

0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 18 20 22 24 26 28 30
32 36 40 44 48 52 56 60
64 72 80 88 96 104 112 120

which are really just thinly disguised (non-negative) floating point numbers. An important feature, obvious here and easy to prove of “real” FP numbers, is that, apart from the first two rows, each row is the double of the row above it, therefore, 2n is representable exactly iff n is. Furthermore, it’s pretty obvious that if a and b are representable, with a/2 <= b <= a, then a-b is representable (and if a and b are numbers in the first two rows, then all differences are representable, so we don’t need to worry about them).

Now let’s return to our triangle: we have a >= b >= c > 0 and clearly a,b,c must be representable. If all the numbers are in the first two rows, then all operations are exact and we get the right answer. Otherwise, a is even: first, consider the case b >= a/2, as above, this implies that a-b is representable and the comparison with c must be exact, so our test is correct. Otherwise, if b < a/2, then since c <= b, c + b <= 2b < a, so we don’t have a triangle, and, assuming that rounding is monotonic (ie. x <= y => round(x) <= round(y), or equivalently, x < y => round(x) <= round(y)), our check works out correct too:

b < a/2, so b+c < a (exact) so c < a-b (exact) so c = round(c) <= round(a-b)


As usual for this blog, we are merely paddling at the shore of a much deeper sea. For much more on triangles, floating point approximations etc, see Kahan’s paper:

Click to access Triangle.pdf

(or indeed anything else by the great man:



Floating point numbers are made of jelly.

Floating point equality can be quite interesting, leading some people to think that they are actually made of some sort of jelly, and don’t in fact have precise values. This isn’t true of course but sometimes things can get quite bizarre. I was surprised to find this assertion:

printf("%.20g %.20g\n", f(), g());
assert(f() == g());

failing with this output:

$ ./foo
0.9979466119096509491 0.9979466119096509491
foo: foo.cpp:15: int main(): Assertion `f() == g()' fails

(All code compiled with gcc 4.4.3, on Ubuntu 10.04, 32 bit Celeron).

Now 20 significant figures are enough to tell the difference between two doubles, so this looks a little strange.

More investigation showed that:

double x = f();
assert(x == f());

failed as well and I started to get worried. It’s true that floating point numbers aren’t equal to themselves if they are NaNs, but that didn’t seem to be the case here. Also, the assertion didn’t fail for code that had been optimized.

Some years ago I did some work on code generation for the 387 FPU and then we had trouble with rounding – test programs for the 387 would produce slightly different results than for other architectures like MIPS and Sparc. IEEE754 is pretty precise about how rounding should be done so this seemed a bit strange until we realized that the 387 uses 80-bit precision internally, with rounding to 64 bit doubles only happening when transferring data from the FPU to memory (FISTP and friends). There is a flag to set the internal rounding mode to double precision and once we had set that we got consistent results on all architectures.

Looks like something similar is happening here and a look at the disassembly confirms this:

	call	_Z1av
	fstpl	24(%esp)
	call	_Z1av
	fldl	24(%esp)
	fxch	%st(1)
	fnstsw	%ax

The first result of calling a() gets written out to memory as a double, hence is rounded, but then it’s reloaded but compared to the unrounded result from the second call.

Investigating further exposes more interesting behaviour:

#include <stdio.h>
#include <math.h>
#include <assert.h>
int main()
  double x,y,z;
  int a;
  x = 1.0;
  while (x != 0.0) {
    y = x; x /= 2;
  a = y > 0;
  printf("%.40Lg\n", (long double)y);
  printf("%d\n", a);
  printf("%.20g\n", y/y);
  printf("%.40Lg\n", (long double)y);

  x = 1.0;
  y = nextafter(x,2.0);
  z = (x+y)/2.0;
  printf("%d\n", z > x && z < y);

  x = 0.5;
  while ((z = sqrt(x)) != x) x = z;
  y = x;
  while ((z = x*x) != x) x = z;
  printf("%.40Lg\n%.40Lg\n", (long double)x, (long double)y);

Now we have completely different behaviour between optimized and unoptimized code:

$ g++ -Wall fp.cpp
$ ./a.out

Unoptimized, pretty much as expected, we iterate down to the smallest representable float, which we divide by itself and get 1. If we average one and the next number, we don’t get a number in between; and finally, if we iterate square roots, we eventually arrive at 1, and squaring that, just gives 1 again.

$ g++ -O3 -Wall fp.cpp
$ ./a.out

Bah, we seem to have iterated right down to the smallest extended value there, but when we divide it by itself, we get a NaN – it seems to have been magically converted to zero, as confirmed by printing it out again (presumably the 387 register has been stored temporarily in memory, while we did the printf, and truncated to 64 bits on the way. Now, we are able to find a value between 1 and 1 + ε, and finally, our repeated square root function seems to converge on something slightly less that 1 and when we go back the other way, we end up at 0 instead.

Fortunately, we don’t need to worry about what to do about this or whether it’s the chip designers or the language designers or the FP standard designers or the compiler writers who are getting it wrong (or maybe it really is that FP numbers are made of jelly). Newer x86 processors have SSE registers for FP, which are 64 bit doubles throughout (as do most other common architectures), so these problems are less likely to occur:

    call    _Z3av
    movsd    %xmm0, -8(%rbp)
    call    _Z3av
    ucomisd    -8(%rbp), %xmm0
    jne    .L8

and if we run our little test program (this time with gcc 4.7.2) on a 64-bit processor we get:

$ g++ -Wall fp.cpp
$ ./a.out

for all levels of optimization.

We can force use of the 387 instructions with -mfpmath=387 and then we get:

$ g++ -O3 -mfpmath=387 -Wall fp.cpp
$ ./a.out

again, so it doesn’t look like our problems are related to the particular gcc version.

With the new code, we lose the benefits of extra internal precision, but on the whole I prefer having half a chance of understanding what’s going on.

Complex Fibonacci

There are a couple of ways to extend the Fibonacci series to a continuous function on real or complex numbers. Here’s a nice one: the grid shows the mapping of the unit strip, 0 <= re(z), 0 <= im(z) <= 1, grid points are 0.1 apart. The real axis is picked out in red, notice that it crosses itself exactly at the points of the Fibonacci series – twice at 1 after executing a loop-the-loop (clicking on the image should give you a larger version).


The function is just the standard Binet formula, interpreted over the complex plane:
z - -φ-z)/√5
or equivalently:
z - eiπzφ-z)/√5

For negative values, the function turns into a nice spiral (the exp component dominates). Here is the 0 <= im(z) <= 0.1 strip:

The red parallelogram illustrates F(z+2) = F(z+1)+F(z) still holds, even for our complex quantities.

Most of the action takes place around the origin, so let’s zoom in:


The image of the real axis passing through the points -1,1,0,1,1,2,… is clearly visible (the real axis is on the outside, the origin is the leftmost point on the red parallelogram).

Sometimes the series is extended as z - cos(πz)φ-z)/√5 which produces a mapping that takes the real line to itself, ie. taking the real part of the exponential, which is also nice, here we see the mapping of the strip 0.1 <= im(z) <= 0.2:


Again, the red parallelogram illustrates the recurrence.

Inverse Square Root II

Hmmm, after writing that rather long and laborious post about the bit-level interpretation of fast inverse square root, I came across a couple of excellent blog posts by Christian Plesner Hansen:



which got me thinking about all this again (it’s a real mind-worm this one)…

So, now the scales have dropped from my eyes, let’s try again: what we are really doing is using a rough approximation to log2(n), represented as a fixed point number, that happens to be easily computed from the floating point representation. Once we have our approximate log (and exp) function, we can do the the usual computations of roots etc. using normal arithmetic operations on our fixed point representation, before converting back to a float using our exp function.

So, take a number, 2n(1+m) where 0<=m<1, a reasonable approximation to log2(x) is n+m, and we can improve the approximation by adding a small constant offset, σ and, because we are doing this in the fixed point realm, everything works out nicely when we convert back to the floating point realm. Here is a picture for n=0, choosing by eye a value of σ = 0.045:

Now, if interpret a non-negative IEEE-754 float 2n(1+m) as a 9.23 fixed point value (ie. with 9 bits to the left of the binary point, 23 bits to the right), then this fixed point number is (e+m), where e = n+127, and, as above, this is approximates log2(2e(1+m)), so e-127+m = n+m is an approximation to log2(2e-127(1+m)) = log2(2n(1+m)), ie. log2 of our original number. Note that e+m is always positive, but e+m-127 may not be, so might need to be represented as a signed value.

As far as actual code goes, first we need to be able to get at the bitwise representation of floats. It’s nice to avoid aliasing issues etc. by using memcpy; my compiler (gcc 4.4.3) at least generates sensible code for this:

static inline float inttofloat(uint32_t n)
  float f;
  return f;

static inline uint32_t floattoint(float f)
  uint32_t n;
  return n;

We are working in our fixed point representation, so adding 127 is actually done by adding 127<<23 = 0x3f800000 and we convert our σ offset in the same way:

uint32_t sigma = 45*(uint32_t(1)<<23)/1000;
uint32_t c = (uint32_t(127)<<23) - sigma;
int32_t alog(float x) { 
  return floattoint(x)-c;

We also want an inverse of our approximate log, ie. an approximate exp, but this is just a reverse of the previous step:

float aexp(int32_t n) { 
  return inttofloat(n+c); 

Provided we haven’t left the realm of representable floats, the sign bit of the result should alway be zero.

We can use our log approximation directly as well of course, to convert from fixed point we need:

float unfix(int32_t n) {
  return float(n)/(1<<23);
float unfix(uint32_t n) {
  return float(n)/(1<<23);

Another nice picture, showing our log2 approximation over a wider range:


Now we can define:

float sqrt(float x) { return aexp(alog(x)/2); }
float invsqrt(float x) { return aexp(alog(x)/-2); }
float cubert(float x) { return aexp(alog(x)/3); }

and so on.

To relate this to the original magic function, for an inverse square root, the integer manipulations are:

invsqrt(n) = -((n-c)/2) + c
           = -(n/2 - c/2) + c     // Not exact, but close enough
           =  c/2 - n/2 + c
           = -(n>>1) + ((c>>1) + c)

Calculating (c>>1) + c we get 0x5f375c29, satisfyingly close to the original magic constant 0x5f3759df…

We can use signed or unsigned fixed point numbers for our logs, the arithmetic will be the same, providing we can avoid overflow, so for example, if to get an approximate value for log(factorial(50)), we can do:

  uint32_t fact = 0;
  for (int i = 1; i <= 50; i++) {
    fact += alog(i);
  printf("%g\n", unfix(fact));

giving the result 213.844, comparing nicely to the true result of 214.208. Note that if I were to have used a signed value for fact, the result would have overflowed.

Be warned, this is only a very rough approximation to the log2 function, only use it if a very crude estimate is good enough, or if you are going do some further refinement of the value. Alternatively, your FPU almost certainly uses some variation of this technique to calculate logs (or at least an initial approximation) so you could just leave it to get on with it.

Like many things, this isn’t particularly new, the standard reference is:

J. N. Mitchell, “Computer multiplication and division using binary logarithms,” IRE Trans. Electron. Computers, vol. 11, pp. 512–517, Aug. 1962

Fast inverse square root

What could be cuter than the infamous fast inverse square root function used in the Quake 3 engine:


The interesting part is the calculation of the initial approximation, splitting this down into the basic steps, we have:

float isqrt(float f)
  uint32_t n = floattoint(f);
  n >>= 1;
  n = -n;
  n += 0x5f000000;
  n += 0x3759df;
  float x = inttofloat(n);
  return x;

To get some insight into what’s going on here, we need to look at the representation of floating point numbers. An IEEE-754 float consists of a sign bit, an exponent value, and a mantissa. The exponent is an 8-bit value, the mantissa has 23 bits, both unsigned. As usual, a suitable notation is key: simplifying a little (specifically, ignoring NaNs, infinities and denormalized numbers), we shall write a float of the form {sign:1;exponent:8;mantissa:23} as (s,e,m), with m a real in the range [0,1), and this represents 2e-127(1+m), negated if the sign bit is 1.

To warm up, it’s helpful to look at a simpler example:

float fsqrt(float f)
  unsigned n = floattoint(f);
  n >>=1; 
  n += 0x1f800000;
  n += 0x400000;
  return inttofloat(n);

Here we are computing an approximation to a normal square root.

Taking it a step at a time: first the shift right n >>= 1, there are two cases for odd and even exponent:

(0,2e,m)   => (0,e,m/2)        // No carry from exponent
(0,2e+1,m) => (0,e,1/2 + m/2)  // Carry into mantissa

For n += 0x1f800000: we are adding 63 (0x1f800000 is 63 << 23) onto the exponent:

(s,e,m) => (s,e+63,m)

And finally, n += 0x400000: Generally, if we add p onto the mantissa, where 0 <= p < 223, and writing m’ = m + p/223, we have:

(s,e,m) => (s,e,m') if m' < 1
(s,e,m) => (s,e+1,m'-1) if m' >= 1

For p = 0x400000 = 222, we have m' = m+1/2. ie:

(s,e,m) => (s,e+1,m'-1) if m' >= 1 
        => (s,e,m') otherwise 

Putting this together, for even powers of two:

22k(1+m) => 2k(1+m/2):

(0,2k+127,m)   => (0,k+63,0.5+m/2) 
               => (0,k+126,0.5+m/2)
               => (0,k+127,m/2)

And for odd powers:

2(2k+1)(1+m) => 2k(1+0.5+m/2):

(0,2k+1+127,m) => (0,k+64,m/2)
               => (0,k+127,m/2)
               => (0,k+127,0.5+m/2)

Let’s check this is sensible by setting m = 0:

22k     => 2k
2(2k+1) => 2k(1+0.5)

and putting m = 1 we get:

22k+1   => 2k(1+1/2)
2(2k+2) => 2k+1

Our approximation is linear in between powers of two, and continuous at those points too. Also at even powers the graph is tangent to sqrt(x).

This is all nicely illustrated in a picture:

Returning to the original inverse function, we have an additional negation step, n = -n: to negate a twos-complement number, we flip the bits and add one. There are two main cases, depending on whether the mantissa is zero. If it is zero, there is a carry into the exponent, otherwise we just flip the bits of the exponent. The sign bit will always end up set (I’m ignoring the case when the exponent is zero). The general effect is:

(0,e,0) => (1,256-e,0)
(0,e,m) => (1,255-e,1-m)

This time, we are adding 190 onto the exponent (0x5f000000 = 190<<23) – this has the dual effect of resetting the sign bit to 0 and subtracting 66 from the exponent (190 = -66 mod 256).

Let’s see what happens with odd and even powers of two; writing the magic offset added onto the mantissa as c:

(0,2k+127,m) => (0,k+63,0.5+m/2)     // n >>= 1
             => (1,255-k-63,0.5-m/2) // n = -n
             => (0,126-k,0.5-m/2)    // n += 0x5f000000
             => (0,126-k,0.5-m/2+c)  // if 0.5-m/2+c < 1
             => (0,127-k,-0.5-m/2+c) // if 0.5-m/2+c >= 1
(0,2k+128,m) => (0,k+63,m/2)       // n >>= 1
             => (1,255-k-63,1-m/2) // n = -n
             => (0,126-k,1-m/2)    // n += 0x5f000000
             => (0,126-k,1-m/2+c)  // if 1-m/2+c < 1
             => (0,127-k,-m/2+c)   // if 1-m/2+c >= 1

If we use 0x400000 as the offset, ie. c above is 0.5, and put m=0 in the two cases, we have:

22k => 1/2k
22k+1 => 1.5/2k+1

Once again, our approximation coincides exactly at even powers of two, and as before it’s useful to have a picture:

We don’t have nice tangents this time, but the end result isn’t too bad.

We probably could have saved ourselves some work here by noting that the mapping between 32-bit integers (as signed magnitude numbers) and the corresponding floats is monotonic and continuous (for some sense of continuous), so composing with other (anti)monotonic operations gives an (anti)monotonic result, so having checked our function is correct at powers of two, it can’t go too badly adrift in between.

We can improve our approximation by using a smaller mantissa offset, 0x3759df, and we end up with the function we came in with:

Not especially pretty, but a good starting point for some Newton-Raphson. Notice that as well as kinks at exact powers of two, this approximation has kinks in between as well (when adding the constant to the mantissa overflows into the exponent).

A Familiar Series?

      0        1        2        3        4        5        6        7
      8        9       10       11       12       13       14       15
     16       18       20       22       24       26       28       30
     32       36       40       44       48       52       56       60
     64       72       80       88       96      104      112      120
    128      144      160      176      192      208      224      240
    256      288      320      352      384      416      448      480
    512      576      640      704      768      832      896      960
   1024     1152     1280     1408     1536     1664     1792     1920
   2048     2304     2560     2816     3072     3328     3584     3840
   4096     4608     5120     5632     6144     6656     7168     7680
   8192     9216    10240    11264    12288    13312    14336    15360
  16384    18432    20480    22528    24576    26624    28672    30720
  32768    36864    40960    45056    49152    53248    57344    61440
  65536    73728    81920    90112    98304   106496   114688   122880
 131072   147456   163840   180224   196608   212992   229376   245760
 262144   294912   327680   360448   393216   425984   458752   491520
 524288   589824   655360   720896   786432   851968   917504   983040
1048576  1179648  1310720  1441792  1572864  1703936  1835008  1966080
2097152  2359296  2621440  2883584  3145728  3407872  3670016  3932160
4194304  4718592  5242880  5767168  6291456  6815744  7340032  7864320

In fact, they are floating point numbers, or rather, integers with the same distribution as floating point numbers, and just need some appropriate scaling, for example by dividing by 8388608, to give a nice set of numbers in the interval [0,1).

Notice that for all but the first row, every entry in a row has the same number of digits in its binary representation – they are normalized, with the first row corresponding to the denormalized numbers of IEEE-754. Without the denormalized numbers, there would be a great gap around zero.

Also, the numbers in each row are the same distance apart, with the gap at row n+1 being twice the gap at row n. Again, the first row is the exception, the gap here is the same as the second row – so the first 16 numbers form a uniformly spaced sequence from 0 – there isn’t anything very peculiar about the denormalized numbers, they just represent a continuation of the first row of normalized numbers to fill the gap down to zero.

We can see how a scaled comparison of floats might work too – within each row, two elements are close if they are within n places of each other, or equivalently, if they are within some constant multiple of the row gap. For elements in different rows, we can either just carry on counting the number of intervening numbers, or continue using the row gap to define a neighbourhood of each number. For example, if “close” means “twice the row gap”, then 22 is close to 18, 20, 24 and 26; 16 is close to 12, 13, 14. 15, 18 and 20; and 15 is close to just 13, 14 and 16. This notion of closeness is discussed by Knuth in TAOCP.

We don’t have to insist on base 2 of course, here’s a similar sequence based on powers of 3, with 10 elements in each segment. This gives us 5 denormalized values:

     0       1       2       3       4
     5       6       7       8       9      10      11      12      13      14
    15      18      21      24      27      30      33      36      39      42
    45      54      63      72      81      90      99     108     117     126
   135     162     189     216     243     270     297     324     351     378
   405     486     567     648     729     810     891     972    1053    1134
  1215    1458    1701    1944    2187    2430    2673    2916    3159    3402
  3645    4374    5103    5832    6561    7290    8019    8748    9477   10206
 10935   13122   15309   17496   19683   21870   24057   26244   28431   30618
 32805   39366   45927   52488   59049   65610   72171   78732   85293   91854
 98415  118098  137781  157464  177147  196830  216513  236196  255879  275562
295245  354294  413343  472392  531441  590490  649539  708588  767637  826686
885735 1062882 1240029 1417176 1594323 1771470 1948617 2125764 2302911 2480058

Once again, each row, except the first, every number has the same number of digits in its base 3 representation.