How to Leak a Key

I’m sure everyone has been following the great Heartbleed crisis over the last week. The “Heartbleed Challenge” set up by Cloudflare established that you could get the server private key, specifically, the primes used to generate the key would show up in the Heartbleed data occasionally. Doing a memory dump of a simple OpenSSL server application (https://github.com/matthewarcus/ssl-demo) showed that after reading the keys at startup, it later copies the prime data to higher locations in the heap where they become more likely to be included in a Heartbeat response.

An extract from a hex dump of the heap after making a single request shows:

000273f0  00 00 00 00 01 00 00 00  51 00 00 00 00 00 00 00  # Prime1
00027400  85 52 42 a0 71 54 76 bd  98 49 4f 25 fc d3 e1 89
...
00027460  00 00 00 00 01 00 00 00  51 00 00 00 00 00 00 00  # Prime2
00027470  3b c8 8b 7d 52 74 db e0  af 9e 13 d7 a2 0a fc ea
...
0002d020  60 55 00 00 00 00 00 00  50 45 00 00 00 00 00 00  # Input buffer
0002d030  16 03 01 16 03 03 00 28  e8 d6 b3 68 87 08 d4 7f
...
00031570  00 00 00 00 00 00 00 00  c1 44 00 00 00 00 00 00  # Output buffer
00031580  00 00 00 16 03 03 00 28  82 c9 a8 9a 87 b5 cf e7
...
00038990  90 00 00 00 00 00 00 00  50 00 00 00 00 00 00 00  # Prime 2 again
000389a0  3b c8 8b 7d 52 74 db e0  af 9e 13 d7 a2 0a fc ea
...
00038b50  30 00 00 00 00 00 00 00  50 00 00 00 00 00 00 00  # Prime 1 again
00038b60  85 52 42 a0 71 54 76 bd  98 49 4f 25 fc d3 e1 89  

The first line in each section is the malloc header, the second line is part of the data; also, all of these malloced blocks are in use (otherwise some of the data would be overwritten with free list data).

The first two primes are from the initial processing of the private key, the next two items are the input and output buffers for the connection, then we have two more copies of the primes, which only appear after a connection has been made.

So where are these persistent copies of the primes coming from? Some investigation shows that the first time OpenSSL actually uses an RSA key for a private operation, it calculates various auxiliary values for use in Montgomery arithmetic and stores them away in the key structure. Unfortunately, one of the values that gets stored away is the modulus used, which in this case is one of the primes:

rsa_eay.c:

static int RSA_eay_mod_exp(BIGNUM *r0,
                           const BIGNUM *I,
                           RSA *rsa,
                           BN_CTX *ctx)
{
  ...
  if (rsa->flags & RSA_FLAG_CACHE_PRIVATE)
  {
    if (!BN_MONT_CTX_set_locked(&rsa->_method_mod_p,
                                CRYPTO_LOCK_RSA,
                                p, ctx))
      goto err;
    if (!BN_MONT_CTX_set_locked(&rsa->_method_mod_q,
                                CRYPTO_LOCK_RSA,
                                q, ctx))
      goto err;
    ...
   }
}

bn_mont.c:


BN_MONT_CTX *BN_MONT_CTX_set_locked(BN_MONT_CTX **pmont,
                                    int lock,
                                    const BIGNUM *mod,
                                    BN_CTX *ctx)
{
  ...
  if (!*pmont) {
    ret = BN_MONT_CTX_new();
    if (ret && !BN_MONT_CTX_set(ret, mod, ctx))
    ...
  }
}
...
int BN_MONT_CTX_set(BN_MONT_CTX *mont,
                    const BIGNUM *mod,
                    BN_CTX *ctx)
{
  int ret = 0;
  BIGNUM *Ri,*R;
  BN_CTX_start(ctx);
  if((Ri = BN_CTX_get(ctx)) == NULL) goto err;
  R= &(mont->RR); /* grab RR as a temp */
  if (!BN_copy(&(mont->N),mod)) goto err; /* Set N */
  ...
}

That BN_copy in the last line is the culprit. Once allocated, these values stay in memory until the RSA key is deleted. So even if the original key data is stored in protected or hard to get at memory, a Heartbleed attack may still be able to get to the primes.

This also explains why, as far as I know, no one has managed to capture the CRT parameters or the private exponent itself – they stay safely tucked away in the low part of the heap. Also, depending on the application startup behaviour, it might be that the primes end up below the input buffers (by default, OpenSSL caches up to 32 previously allocated buffers of each type) in which case they won’t be visible to Heartbleed – this might explain why in the Cloudflare challenge, only one of the primes seemed to be turning up.

Update 20/04/14: actually, it’s worse than this – bignum resizing can also leak private data onto the heap in unpredictable ways, see John Graham-Cumming’s post at https://blog.cloudflare.com/searching-for-the-prime-suspect-how-heartbleed-leaked-private-keys/


Loopless Ruler

This function (based on Algorithm L for loopless Gray code generation in TAOCP 7.2.1.1):

void ruler(int n)
{
  int f[n+1];
  for (int j = 0; j <= n; j++) {
    f[j] = j;
  }
  while(true) {
    visit(f,n+1);
    int j = f[0];
    f[0] = 0;
    if (j == n) break;
    f[j] = f[j+1];
    f[j+1] = j+1;
  }
}

enables us to compute the “ruler function”, ρ for 1, 2, 3…:

0 1 0 2 0 1 0 3 0 1 0 2 0 1 0...

It’s a “loopless” algorithm and uses f, an array of “focus pointers” to keep track of what’s going on.

The ruler function ρ(n) is essentially the number of trailing 0’s in the binary representation of n.

This is what’s printed by ruler(4):

0 1 2 3 4 
1 1 2 3 4 
0 2 2 3 4 
2 1 2 3 4 
0 1 3 3 4 
1 1 3 3 4 
0 3 2 3 4 
3 1 2 3 4 
0 1 2 4 4 
1 1 2 4 4 
0 2 2 4 4 
2 1 2 4 4 
0 1 4 3 4 
1 1 4 3 4 
0 4 2 3 4 
4 1 2 3 4 

The left hand side is the ruler function of course, what are the other values?

It’s clearer if we print f in reverse and match up with binary representations:

00000
43210
00001
43211
00010
43220
00011
43212
00100
43310
00101
43311
00110
43230
00111
43213
01000
44210
01001
44211
01010
44220
01011
44212
01100
43410
01101
43411
01110
43240
01111
43214

If n[j] is the jth binary digit of n, then f[j] is just j, unless n[j] is the first of a block of 1’s, in which case f[j] is the index of the first 0 after that block.

For example, 01101, has block starting at 0 and ending at 1, so f[0] = 1, and another block starting at 2, ending at 4, so f[2] = 4; for other j, f[j] = j, so the focus pointers are 43411.

If n is of the form:

j   k
01110111

with f[0] = k, f[k] = k, f[k+1] = j

0n incrementing we get:

j   k
01111000

Now f[0] = 0, f[k] = j and f[k+1] = k+1, other values of f are the same

If k is 0, then we go from:

j   k
01110

with f[0] = 0, f[1] = j

to:

j   k
01111

with f[0] = 0, f[1] = 1

Either way, we can see that the inner loop of ruler is correct.


Ringing the Changes

Apart from a brief visit many years ago to Macclesfield church with my friend Jo, I have never taken part in the ancient English art of bell-ringing, but permutation generation has always seemed a fascinating topic.

In TAOCP 7.2.1.2, Knuth has Algorithm P (“plain changes”) for generating all permutations using adjacent interchanges:

void taocp1(int *a, int n)
{
  int c[n+1];
  int o[n+1];
  for (int j = 1; j <= n; j++) {
    c[j] = 0; o[j] = 1;
  }
 P2:
  visit(a,n);
 P3:
  int j = n; int s = 0;
 P4:
  int q = c[j] + o[j];
  if (q < 0) goto P7;
  if (q == j) goto P6;
 P5:
  swap(a[j-c[j]+s],a[j-q+s]);
  c[j] = q;
  goto P2;
 P6:
  if (j == 1) return;
  s = s+1;
 P7:
  o[j] = -o[j];
  j = j-1;
  goto P4;
}

It works just fine, but I find it a little hard to follow and let’s face it, we just don’t write code like that anymore, with all those gotos, 1-based arrays, etc.

0-basing the arrays isn’t too much of a problem, for eliminating the gotos we could take the functional approach:

void taocp2(int *a, int n)
{
  int c[n];
  int o[n];
  for (int j = 0; j < n; j++) {
    c[j] = 0; o[j] = 1;
  }
  P2(a,c,o,n);
}

void P2(int *a, int *c, int *o, int n) {
  visit(a,n);
  P3(a,c,o,n);
}

void P3(int *a, int *c, int *o, int n) {
  P4(a,c,o,n,n-1,0);
}

void P4(int *a, int *c, int *o, int n, int j, int s) {
  int q = c[j] + o[j];
  if (q < 0) P7(a,c,o,n,j,s);
  else if (q == j+1) P6(a,c,o,n,j,s);
  else P5(a,c,o,n,j,s,q);
}

void P5(int *a, int *c, int *o, int n, int j, int s, int q) {
  swap(a[j-c[j]+s],a[j-q+s]);
  c[j] = q;
  P2(a,c,o,n);
}

void P6(int *a, int *c, int *o, int n, int j, int s) {
  if (j != 1) {
    P7(a,c,o,n,j,s+1);
  }
}

void P7(int *a, int *c, int *o, int n, int j, int s) {
  o[j] = -o[j];
  P4(a,c,o,n,j-1,s);
}

but somehow that seems missing the point (I don’t think that the worst thing about that code is that it still uses modifiable state).

There are actually two component parts to the algorithm: generating a “mixed radix reflected Gray code” which defines the progressive inversions of the array elements, and going from the inversion changes to the actual indexes of the swapped objects.

For generating the Gray codes, with a bit of rewriting, 0-basing our arrays etc. we get:

void gray(int n)
{
  int j, c[n], o[n];
  for (int j = 0; j < n; j++) {
    c[j] = 0; o[j] = 1;
  }
  do {
    visit(c,n);
    for (j = n-1; j > 0; j--) {
      int q = c[j] + o[j];
      if (q < 0) {
        o[j] = -o[j];
      } else if (q == j+1) {
        o[j] = -o[j];
      } else {
        c[j] = q;
        break;
      }
    } 
  } while (j != 0);
}

Each time around, find the highest element whose inversion count can be changed in the desired direction. For elements whose inversion count can’t be changed, change direction. If no element can be changed we are done.

The next step is to calculate the position of element j – but this is just j less the number of elements less than j that appear after j (ie. the number of inversions, c[j]), plus the number of elements greater than j that appear before j – but this is just the number of elements we have been passed over in the “if (q == j+1)” step above, so we can now add in the rest of algorithm P:

void taocp3(int *a, int n)
{
  int c[n];
  int o[n];
  for (int j = 0; j < n; j++) {
    c[j] = 0; o[j] = 1;
  }
  int j;
  do {
    visit(a,n);
    int s = 0;
    for (j = n-1; j > 0; j--) {
      int q = c[j] + o[j];
      if (q < 0) {
        o[j] = -o[j];
      } else if (q == j+1) {
        o[j] = -o[j];
        s++;          // This element will be before element j
      } else {
        swap(a[j-c[j]+s],a[j-q+s]);
        c[j] = q;
        break;
      }
    } 
  } while (j != 0);
}

Knuth’s algorithm essentially Dijkstra’s:

Click to access EWD553.PDF

which explains it very lucidly: there is a 1-1 correspondence between inversion counts & permutations – and a Gray enumeration of inversion gives us a sequence of permutations where only 1 element at a time changes its inversion count, and only by 1 or -1, which can only be if we exchange adjacent elements.

Knuth also gives a “loopless” algorithm for generating reflected Gray sequences, and we could use a table of element positions to construct the permutations from this:

void gray2(int *m, int n)
{
  int a[n],f[n+1],o[n];
  for (int j = 0; j < n; j++) {
    a[j] = 0; f[j] = j; o[j] = 1;
  }
  f[n] = n;
  while (true) {
    visit(a,n);
    int j = f[0];
    f[0] = 0;
    if (j == n) break;
    a[j] += o[j];
    if (a[j] == 0 || a[j] == m[j]-1) {
      o[j] = -o[j];
      f[j] = f[j+1];
      f[j+1] = j+1;
    }
  }
}

The “classic” Johnson-Trotter algorithm for plain changes involves consideration of “mobile” elements: each element has a current direction and it is “mobile” if if it greater than the next adjacent element (if there is one) in that direction. The algorithm proceeds by finding the highest mobile element and moving it accordingly:

void jt(int *a, int n)
{
  int o[n]; // Direction of element i, 1 = left, 0 = right
  int c[n]; // Position of element i
  for (int j = 0; j < n; j++) {
    o[j] = 1; c[j] = j;
  }
  int j; // Will be set to the highest mobile element
  do {
    visit(a,n);
    // Search high to low. 0 is never mobile
    for (j = n-1; j > 0; j--) {
      int p = c[j]; // Position of element j
      if (o[j]) { // Going left
        int k = a[p-1];
        if (p > 0 && k < j) {
          // Swap and adjust positions
          a[p] = k; c[k] = p;
          a[p-1] = j; c[j] = p-1; 
          break;
        }
      } else { // Going right
        int k = a[p+1];
        if (p < n-1 && k < j) {
          a[p] = k; c[k] = p;
          a[p+1] = j; c[j] = p+1; 
          break;
        }
      }
      o[j] = !o[j]; // Change direction & continue
    }
  } while (j > 0);
}

Note that we can check elements from high to low directly & stop as soon as a mobile element is found – there is no need to check every element.

Johnson-Trotter and Algorithm P are essentially doing the same thing: when moving the mobile element, we are changing its inversion count in the same way as we do directly in P.

Performance-wise, the two algorithms are very similar though (somewhat surprisingly to me) the Johnson-Trotter version seems a little faster.

Either way, my laptop generates the 3628800 permutations of 10 elements in 0.03 seconds, outputting them (even to /dev/null) takes 4.5 seconds. It takes about 3 seconds to run through the 479001600 permutations of 12 elements (I haven’t tried outputting them). We can streamline the computation further by taking advantage of the fact that most of the time we are just moving the highest element in the same direction, but that can wait for another day.

Finally, here is a function that gives the next permutation in plain changes without maintaining any state:

bool nextperm(int *a, int n)
{
  int c[n], o[n];
  for (int i = 0; i < n; i++) {
    c[i] = o[i] = 0;
  }
  // Count inversions
  for (int i = 0; i < n; i++) {
    for (int j = i+1; j < n; j++) {
      if (a[j] < a[i]) c[a[i]]++;
    }
  }
  // Find directions
  for (int i = 1, k = 0; i < n-1; i++) {
    k += c[i];
    o[i+1] = k%2;
  }
  // Find mobile element and move it.
  for (int j = n-1, s = 0; j > 0; j--) {
    if (!o[j]) {
      if (c[j] < j) {
	swap(a[j-c[j]+s],a[j-c[j]+s-1]);
	return true;
      }
      s++;
    } else {
      if (c[j] > 0) {
	swap(a[j-c[j]+s],a[j-c[j]+s+1]);
	return true;
      }
    }
  }
  return false;
}

void stateless(int *a, int n)
{
  do {
    visit(a,n);
  } while (nextperm(a,n));
}

We use the Knuth/Dijkstra trick to avoid building a position table. I suspect that n-squared inversion counting can be improved, but we still come in at under second for all permutations of 10, though that’s about 30 times slower than the optimized version.


valof

I’ve never written much BCPL, but it always seemed like a nice language. Many features ended up in C of course, but one that didn’t was VALOF, allowing the inclusion of a statement block in an expression, and using RESULTIS to return a value. Using slightly invented syntax:

LET B = VALOF $(
        FOR I = 1 TO J DO $(
                IF I*I = J THEN RESULTIS 1
        $)
        RESULTIS 0
$)

GCC defines the block-expression extension, allowing us to write:

int main(int argc, char *argv[])
{
  int j = atoi(argv[1]);
  printf("%d\n";,
         ({
           bool found = false;
           for (int i = 1; i &lt; j; i++) {
             if (i*i == j) {
               found = true;
               goto end;
             }
           }
         end:
           found;
         }));
}

but having to use the last expression in the block as the return value can get a bit convoluted.

With C++11, we can do better with immediately applied lambda expressions, with return taking the role of RESULTIS:

int main(int argc, char *argv[])
{
  int j = atoi(argv[1]);
  printf("%d\n";,
         [=](){
           for (int i = 1; i &lt; j; i++) {
             if (i*i == j) return true;
           }
           return false;
         }());
}

If we don’t like the smiley syntax, we can even define a VALOF macro:

#define VALOF(e) [=](){e}()
int main(int argc, char *argv[])
{
  int j = atoi(argv[1]);
  printf("%d\n";,
         VALOF(
             for (int i = 1; i &lt; j; i++) {
               if (i*i == j) return true;
             }
             return false;
           ));
}

though we’ll have to be a little bit careful with commas not enclosed in parentheses (and we might end up needing different macros for different variable capture methods).

Curiously, this idiom seems to be popular in Javascript.