Every programming blog should have a monad tutorial, so here we go: we can define a simple denotational semantics for a typed functional language as:

ℰ ⟦v:A⟧ρ = ρ(v):A
ℰ  ⟦λx:A.e⟧ρ = λa:A.ℰ ⟦e⟧ρ[x→a]: A → B
ℰ ⟦e1e2ρ =
let f: A → B = ℰ ⟦e1ρ
a: A = ℰ ⟦e2ρ
in f a: B

Here ρ is an environment, mapping variables to values, and note that in the rule for a λ expression, the λ in the right hand side is defining a function in the domain of values, whereas the left hand side λ is just a linguistic construct. We could decorate every expression with a type, but that would get untidy. There will be other rules for specific operations on whatever actual datatypes are around, but this gives the underlying functional basis on which everything else depends.

We can see that ℰ⟦e⟧ρ is just a value in some semantic domain, which contains, presumably, some basic types and functions between values and the type of ℰ is something like:

ℰ: Exp[A] → Env → A

where Exp[A] is set of expressions of type A (I’m not going to be rigorous about any of this, I’m assuming we have some type system where this sort of thing makes sense, and also I’m not going to worry about the difference between a syntactic type and a semantic type) and Env is the type of environments.

Just for fun, let’s make a distinction (not that there really is one here) between “ordinary” values and “semantic” values, with M[A] being the semantic value with underlying value type A (imagine an ML or Haskell style type constructor M, with a value constructor, also called M, though often we’ll ignore the distinction between the underlying type and the constructed type).

Now ℰ has type:

ℰ: Exp[A] → Env → M[A]

and the underlying value of a function of type A → B is now A → M[B].

We can also rewrite our semantic equations and take a little time persuading ourselves this is equivalent to the definitions above:

ℰ ⟦v:A⟧ρ = inj(ρ(v)): M(A)
ℰ ⟦λx:A.e⟧ρ = inj(λa:A.ℰ ⟦e⟧ρ[x→a]): M(A → M[B])
ℰ ⟦e1e2ρ =
let a1: M[A → M[B]] = ℰ ⟦e1ρ
a2: M[A] = ℰ ⟦e2ρ
in apply (λf.apply f a2) a1: M[B]

inj and apply are:

inj: A → M[A]
inj(a:A) = M(a) : M[A]
apply: (A → M[B]) → M[A] → M[B]
apply f (M a) = f a

These functions should look familiar; they are the standard monad operations & using a different monad will give us a different semantics for our basic functional operations.

Let’s introduce state, the basic denotational semantics is something like:

ℰ: Exp[A] → Env → State → (A, State)

ℰ ⟦v:A⟧ρ σ = (ρ(v),σ)
ℰ ⟦λx.e⟧ρ σ = (λa.ℰ ⟦e⟧ρ[x→a], σ)
ℰ ⟦e1e2ρ σ =
let (f,σ') = ℰ ⟦e1ρ σ
(a, σ'') = ℰ ⟦e2ρ σ'
in f a σ''

(I’ve omitted type decorations here for clarity).

Let’s do the same trick with a special semantic domain (though this time we’ll leave the type constructors implicit) and we have:

M[A] = State→(A, State)

inj a σ = (a,σ)
apply f g σ =
let (a,σ')a = g σ
in f a σ'

and we can see that we can just plug these definitions into our generic semantic equations above and get something equivalent to the specific state semantics.

So, a monad is just a typed semantic domain together with the operations necessary to specify the standard functional constructions over that domain. Which sort of makes sense, but it’s nice to see it just drop out of the equations (and of course it’s nice to see that a standard denotational semantics for something like state does actually correspond quite closely with the monadic semantics).

None of this is new, in fact the use of monads to provide a uniform framework for program semantics goes right back to Eugenio Moggi’s original work in the 1980s (which was then taken up in functional programming where elements of the semantic domain itself are modelled as normal data objects).

# At The Bakery

After the topical excitement of the last couple of posts, let’s look at an all-time great – Leslie Lamport’s Bakery Algorithm (and of course this is still topical; Lamport is the most recent winner of the Turing Award).

The problem is mutual exclusion without mutual exclusion primitives. Usually, it’s described in the context of a shared memory system (and that is what we will implement here), but will work equally well in a message-passing system with only local state (each thread or process only needs to write to its own part of the store).

For further details, and Lamport’s later thoughts see http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#bakery: “For a couple of years after my discovery of the bakery algorithm, everything I learned about concurrency came from studying it.” – and since Lamport understands more about concurrency than just about anyone on the planet, it’s maybe worth spending some time looking at it ourselves.

I’m not going to attempt to prove the algorithm correct, I’ll leave that to Lamport, but the crucial idea seems to me to be that a thread reading a particular value from another thread is a synchronization signal from that thread – here, reading a false value for the entering variable is a signal that the other thread isn’t in the process of deciding on it’s own number, therefore it is safe for the reading process to proceed.

Implementing on a real multiprocessor system, we find that use of memory barriers or synchronization primitives is essential – the algorithm requires that reads and writes are serialized in the sense that once a value is written, other processes won’t see an earlier value (or earlier values of other variables). This doesn’t conflict with what Lamport says about not requiring low-level atomicity – we can allow reads and writes to happen simultaneously, with the possibility of a read returning a bogus value – and in fact we can simulate this in the program by writing a random value just before a process selects its real ticket number, but once a write has completed, all processes should see the new value.

Another essential feature is the volatile flag – as many have pointed out, this isn’t enough by itself for correct thread synchronization, but for shared memory systems, prevents the compiler from making invalid assumptions about consistency of reads from shared variables.

A final point – correctness requires that ticket numbers can increase without bound, this is hard to arrange in practice, so we just assert if they grow too large (this rarely happens in reality, unless we get carried away with our randomization).

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>

// Compile with: g++ -Wall -O3 bakery.cpp -pthread -o bakery

static const int NTHREADS = 4;

// Some features to play with
//#define NCHECK     // Disable crucial check
//#define NSYNC      // Disable memory barrier
//#define NVOLATILE  // Disable volatile

#if defined NVOLATILE
#define VOLATILE
#else
#define VOLATILE volatile
#endif

VOLATILE int count = 0;
VOLATILE int total = 0;

unsigned getmax(int n)
{
unsigned max = 0;
for (int i = 0; i < n; i++) {
if (number[i] > max) max = number[i];
}
return max;
}

bool check(int i, int j)
{
return number[j] < number[i] ||
(number[j] == number[i] && j < i);
}

inline void synchronize()
{
#if !defined NSYNC
// gcc builtin full memory barrier
__sync_synchronize();
#endif
}

void lock(int i) {
entering[i] = true;
synchronize();
// Simulate non-atomic write
number[i] = rand();
synchronize();
assert(number[i] > 0);
entering[i] = false;
synchronize();
for (int j = 0; j < NTHREADS; j++) {
#if !defined NCHECK
while (entering[j]) { /* nothing */ }
#endif
// At this point, we have read a false value for
// "entering[j]", therefore any number picked by j
// later will takes our choice into account, any value
// chosen earlier (and so might be less than ours)
// will be visible to us in the following test.

// Wait until all threads with smaller numbers or with
// the same number, but with higher priority, finish
// their work:
while ((number[j] != 0) && check(i,j)) { /* nothing */ }
}
}

void unlock(int i)
{
number[i] = 0;
}

{
int i = *(int*)(arg);
while (true) {
lock(i);
total++;
if (total % 1000000 == 0) fprintf(stderr,"%c", 'a'+i);
assert(count==0); // Check we have exclusive access
count++;
// It's not clear that these synchs are unnecessary,
// but nothing seems to break if I remove them.
//synchronize();
count--;
//synchronize();
unlock(i);
// non-critical section...
}
return NULL;
}

int main()
{
for (int i = 0; i < NTHREADS; i++) {
n[i] = i;
}
for (int i = 0; i < NTHREADS; i++) {
}
}

# How to use SRP in OpenSSL

[Update 10/8/14: As Vakharia points out in the comments, there have been a couple of DoS-type problems found with the OpenSSL SRP code, fixed in OpenSSL 1.0.1i. There seems to be a problem with that version with uncertificated SRP connections, see: http://marc.info/?t=140745609300002&r=1&w=2, so you might need to patch for that to work]

SRP seems to be a much neglected protocol & there have even been noises about removing TLS-SRP from some versions of OpenSSL. This is a shame as SRP has many nice properties that are only more attractive after the Heartbleed scare, including forward secrecy and mutual authentication without PKI.

There are various reasons for that neglect including possible concerns over patents, though it’s not very clear what the issue here is, or even if there is one. One issue with using SRP in OpenSSL in particular is that the C API isn’t very well documented, so this is an attempt to improve that situation. This is what I have been able to glean from experimentation, reading various bits of source as well as miscellaneous bits and pieces from the web. It all works for me & seems sensible, but I’m not a crypto expert so caveat emptor applies even more than usual.

For further details on SRP and TLS-SRP see:

The relevant RFCs are:

Also some useful discussions at:

On to the implementation details. All code fragments taken from full program at https://github.com/matthewarcus/ssl-demo.

We will start with the server side. The crucial piece of server data for SRP is the password verifier. We can make a verifier file using the openssl srp command line tool:

\$ openssl srp
Exactly one of the options -add, -delete, -modify -list must be specified.
usage: srp [args] [user]

-verbose        Talk alot while doing things
-config file    A config file
-name arg       The particular srp definition to use
-srpvfile arg   The srp verifier file name
-modify         modify the srp verifier of an existing user
-delete         delete user from verifier file
-list           list user
-gn arg         g and N values to be used for new verifier
-userinfo arg   additional info to be set for user
-passin arg     input file pass phrase source
-passout arg    output file pass phrase source
-engine e         - use engine e, possibly a hardware device.
-rand file:file:...
load the file (or the files in the directory) into
the random number generator

The -gn parameter requires explanation: the security of SRP (like Diffie-Hellman key exchange) is based on the hardness of the discrete logarithm problem, ie. given a large prime N and a co-prime generator g, given g^a mod N, it’s hard to determine a. g and N are fixed in advance and don’t have to be secret so it’s normal to use standard values that do not allow any of the known shortcuts for discrete logarithms – as defined for example in the appendix to RFC 5054 for particular bit sizes (1024 and 1536 are mandatory for TLS-SRP) & it’s the bit size that is given as the -gn argument:

Enter pass phrase for user:
Verifying - Enter pass phrase for user:
V 2qYg0YhL6s9OsjpPt4eD4iIDB/SF.7pEGPLHVIsLvVD9wUU5tqngvzQUA7Uf6nAQtP.K
U4G.9yra1Ia4fkOrUbx2QjGVizyc.QcaCr83nIewI/ry57Vrgg6QQv2U6Z7ClC0Wig5yKH
BDu2Lfny1aEZy3i7oi3dTywvIxDeFCcS0UPIhUgpRnINZ5K2HJiz6TuofvIfYC2EMpD5Q8
PuZ8/fB62TvfFK7TN67cOCCJSroOukrrr/KScmoDZ/odfKUM
FRzonhUh8ApuhBf45xMzsX1Olm1
user
1536

Note that the verifier file has to exist beforehand, even if empty.

Now we have a verifier file, we can load it in our server code. OpenSSL defines a handy SRP_VBASE type that can be used to store verifiers and we can use SRP_VBASE_init to load in the the verifier file we made earlier:

#include
static SRP_VBASE *srpData = NULL;
static const char *srpvfile = "password.srpv";

void setupSRPData(SSL_CTX *ctx)
{
srpData = SRP_VBASE_new(NULL);
CHECK(srpData != NULL);
if (SRP_VBASE_init(srpData, (char *)srpvfile) != 0) {
}
}

We can also make a verifier directly and store it ourselves in an SRP_VBASE structure (or elsewhere):

static const char *srpgroup = "1536";
static const char *username = "user";

void setupSRPData(SSL_CTX *ctx)
{
...
// The structure to put the verifier data
SRP_user_pwd *p =
(SRP_user_pwd *)OPENSSL_malloc(sizeof(SRP_user_pwd));
SRP_gN *gN = SRP_get_default_gN(srpgroup);
CHECK(gN != NULL);
// This check seems a bit pointless, but doesn't do harm.
char *srpCheck = SRP_check_known_gN_param(gN->g, gN->N);
CHECK(srpCheck != NULL);

// Now create the verifier for the password.
// We could get the password from the user at this point.
BIGNUM *salt = NULL, *verifier = NULL;
&verifier, gN->N, gN->g));
// Copy into the SRP_user_pwd structure
p->g = gN->g; p->N = gN->N;
p->s = salt; p->v = verifier;
p->info = NULL;
// And add in to VBASE stack of user data
sk_SRP_user_pwd_push(srpData->users_pwd, p);
}

Once we are done with the srpData structure, we should free it:

if (srpData != NULL) SRP_VBASE_free(srpData);

(I’ve checked this code with valgrind and all seems well – when compiled with -DPURIFY, OpenSSL is quite well-behaved under valgrind, though there always seems to be a few hundred bytes of still-reachable data left at the end).

Now we have our verifier data loaded, we can define a suitable callback function:

int srpServerCallback(SSL *s, int *ad, void *arg)
{
(void)arg;
// Get data for user
if (p == NULL) {
fprintf(stderr, "User %s doesn't exist\n", srpusername);
return SSL3_AL_FATAL;
}
// Set verifier data
CHECK(SSL_set_srp_server_param(s, p->N, p->g,
p->s, p->v, NULL) == SSL_OK);
return SSL_ERROR_NONE;
}

And tell OpenSSL to use it for SRP connections:

...

For a simple demo, using a global variable for the srpData is adequate, we could pass it in as the callback argument (or another context object):

CHECK(SSL_CTX_set_srp_cb_arg(ctx, srpData) == SSL_OK);
...

You’ll have to read the code to find out what the ad parameter is for.

Note that for a non-existent user, we return a fatal error, which result in the connection terminating with a PSK identity not known alert. More secure behaviour (and suggested in the RFC) is probably to simulate authentication with a dummy user, with failure happening in the same way as if the password was wrong (there are some extra fields in the SRP_VBASE structure for helping with this but I haven’t tried to do that yet).

In fact, in the full program, the first time the SRP callback function is called, we don’t have the srpData loaded, and we return -1 from the callback to indicate this:

int srpServerCallback(SSL *s, int *ad, void *arg)
{
...
if (srpData == NULL) {
doSRPData = true;
return -1; // Not ready yet
}
...
}

When the callback fails in this way, the handshake returns a WANT_X509_LOOKUP error and we handle that by loading the srpData at this point (the idea presumably is that we may want to load the SRP data asynchronously & we can do that before called SSL_accept again):

int res = sslAccept(ssl);
if (res == SSL_OK) {
break;
} else if (SSL_get_error(ssl,res) == SSL_ERROR_WANT_X509_LOOKUP
&& doSRPData) {
setupSRPData(ctx);
doSRPData = false;
...

A couple of small details: we don’t want to do client certificate verification for SRP connections, so turn it off:

bool isSRP = SSL_get_srp_username(ssl) != NULL;
bool verify = doVerify && !isSRP && !peerVerified(ssl);
...

Testing the SRP username seemed to be the neatest way of finding out if the connection is actually using SRP (there doesn’t seem to be a standard interface for this).

Also the non-certificate verified SRP ciphersuites aren’t included in the default list of ciphers (since they officially offer no authentication [Update 10/8/15: this should be fixed in OpenSSL 1.0.1i]):

\$ openssl ciphers -v | grep SRP
SRP-DSS-AES-256-CBC-SHA SSLv3 Kx=SRP      Au=DSS  Enc=AES(256)  Mac=SHA1
SRP-RSA-AES-256-CBC-SHA SSLv3 Kx=SRP      Au=RSA  Enc=AES(256)  Mac=SHA1
SRP-DSS-3DES-EDE-CBC-SHA SSLv3 Kx=SRP      Au=DSS  Enc=3DES(168) Mac=SHA1
SRP-RSA-3DES-EDE-CBC-SHA SSLv3 Kx=SRP      Au=RSA  Enc=3DES(168) Mac=SHA1
SRP-DSS-AES-128-CBC-SHA SSLv3 Kx=SRP      Au=DSS  Enc=AES(128)  Mac=SHA1
SRP-RSA-AES-128-CBC-SHA SSLv3 Kx=SRP      Au=RSA  Enc=AES(128)  Mac=SHA1
\$ openssl ciphers -v ALL | grep SRP
SRP-DSS-AES-256-CBC-SHA SSLv3 Kx=SRP      Au=DSS  Enc=AES(256)  Mac=SHA1
SRP-RSA-AES-256-CBC-SHA SSLv3 Kx=SRP      Au=RSA  Enc=AES(256)  Mac=SHA1
SRP-AES-256-CBC-SHA     SSLv3 Kx=SRP      Au=None Enc=AES(256)  Mac=SHA1
SRP-DSS-3DES-EDE-CBC-SHA SSLv3 Kx=SRP      Au=DSS  Enc=3DES(168) Mac=SHA1
SRP-RSA-3DES-EDE-CBC-SHA SSLv3 Kx=SRP      Au=RSA  Enc=3DES(168) Mac=SHA1
SRP-3DES-EDE-CBC-SHA    SSLv3 Kx=SRP      Au=None Enc=3DES(168) Mac=SHA1
SRP-DSS-AES-128-CBC-SHA SSLv3 Kx=SRP      Au=DSS  Enc=AES(128)  Mac=SHA1
SRP-RSA-AES-128-CBC-SHA SSLv3 Kx=SRP      Au=RSA  Enc=AES(128)  Mac=SHA1
SRP-AES-128-CBC-SHA     SSLv3 Kx=SRP      Au=None Enc=AES(128)  Mac=SHA1

so we will set available ciphers appropriately (NULL is also useful for testing, I don’t recommend it for actual use);

const char *cipherlist = "ALL:NULL";
...
CHECK(SSL_CTX_set_cipher_list(ctx,cipherlist) == SSL_OK);
...

That’s about it for the server, now on to the client, which fortunately is a good deal simpler.

Once again, we need to define a callback function:

char *srpCallback(SSL *ssl, void *arg)
{
char *user = (char*)arg;
// Can be passed in on command line
} else {
ssize_t promptsize = 256;
char prompt[promptsize];
CHECK(snprintf(prompt, promptsize,
"Password for %s: ", user) < promptsize);
char *pass = getpass(prompt);
char *result = OPENSSL_strdup(pass);
// getpass uses a static buffer, so clear it out after use.
memset(pass,0,strlen(pass));
return result;
}
}

getpass is officially obsolete, but does the job here (and we will make a concession to real security and null out the password after it’s been returned – hopefully OpenSSL doesn’t hold on the duplicated password any longer than needed).

Now finish the set up:

if (doSRP) {
SSL_CTX_set_srp_client_pwd_callback(ctx, srpCallback);
...

We need to set the username before the handshake as it’s included in the initial hello message. The password is only required if an SRP ciphersuite is actually negotiated.

Finally, we need to get the right ciphersuites. If we include SRP ciphers in the client hello, but no user name, we will get a fatal alert if the server wishes to use SRP (which it may well do if it hasn’t been configured with ECDHE):

\$ openssl ciphers -v
ECDHE-RSA-AES256-GCM-SHA384 TLSv1.2 Kx=ECDH     Au=RSA  Enc=AESGCM(256) Mac=AEAD
ECDHE-ECDSA-AES256-GCM-SHA384 TLSv1.2 Kx=ECDH     Au=ECDSA Enc=AESGCM(256) Mac=AEAD
ECDHE-RSA-AES256-SHA384 TLSv1.2 Kx=ECDH     Au=RSA  Enc=AES(256)  Mac=SHA384
ECDHE-ECDSA-AES256-SHA384 TLSv1.2 Kx=ECDH     Au=ECDSA Enc=AES(256)  Mac=SHA384
ECDHE-RSA-AES256-SHA    SSLv3 Kx=ECDH     Au=RSA  Enc=AES(256)  Mac=SHA1
ECDHE-ECDSA-AES256-SHA  SSLv3 Kx=ECDH     Au=ECDSA Enc=AES(256)  Mac=SHA1
SRP-DSS-AES-256-CBC-SHA SSLv3 Kx=SRP      Au=DSS  Enc=AES(256)  Mac=SHA1
SRP-RSA-AES-256-CBC-SHA SSLv3 Kx=SRP      Au=RSA  Enc=AES(256)  Mac=SHA1
...

so it’s best to not request SRP unless we really want it:

if (doSRP) {
cipherlist = "SRP";
} else {
cipherlist = "DEFAULT:!SRP";
}

Let’s see all this in action:

On the server side:

\$ ./ssl_server -v --srp 5999
renegotiation: allowed
TLSv1.2: SRP-RSA-AES-256-CBC-SHA SSLv3 Kx=SRP      Au=RSA  Enc=AES(256)  Mac=SHA1
Session ID: (null)
Session ID CTX: 49:4E:49:54
No peer certificates.
...

Normal client connection:

\$ ./ssl_client -v --srp localhost:5999
Cipher suites:
SRP-DSS-AES-256-CBC-SHA
SRP-RSA-AES-256-CBC-SHA
SRP-AES-256-CBC-SHA
SRP-DSS-3DES-EDE-CBC-SHA
SRP-RSA-3DES-EDE-CBC-SHA
SRP-3DES-EDE-CBC-SHA
SRP-DSS-AES-128-CBC-SHA
SRP-RSA-AES-128-CBC-SHA
SRP-AES-128-CBC-SHA
TLSv1.2: SRP-RSA-AES-256-CBC-SHA SSLv3 Kx=SRP      Au=RSA  Enc=AES(256)  Mac=SHA1
Session ID CTX: (null)
Peer certificates:
0: Subject: C = GB, O = TEST SERVER
Issuer:  C = GB, O = TEST CA
1: Subject: C = GB, O = TEST CA
Issuer:  C = GB, O = TEST ROOT
Certificate OK
...

Invalid user:

\$ ./ssl_client -v --srp --user invalid localhost:5999
Cipher suites:
SRP-DSS-AES-256-CBC-SHA
SRP-RSA-AES-256-CBC-SHA
SRP-AES-256-CBC-SHA
SRP-DSS-3DES-EDE-CBC-SHA
SRP-RSA-3DES-EDE-CBC-SHA
SRP-3DES-EDE-CBC-SHA
SRP-DSS-AES-128-CBC-SHA
SRP-RSA-AES-128-CBC-SHA
SRP-AES-128-CBC-SHA
'sslConnect(ssl) == SSL_OK' failed: ssl_client.cpp:288
140613521352384:error:1407745B:SSL routines:SSL23_GET_SERVER_HELLO:reason(1115):s23_clnt.c:779:

Finally, as mentioned above, the standard SRP ciphersuites also do certificate-based server authentication – this seems sensible, if the verifier hasn’t been compromised, then SRP guarantees the authenticity of the server, but the incentives for the server to protect its private key are much greater than for it to protect the verifier for a particular user. In the case that we trust the server not to leak the verifier (perhaps we manage it ourselves and are accessing it remotely), we can use the non-PKI SRP ciphersuites by explicitly requesting them from the client:

\$ ./ssl_client -v --srp --user user --cipherlist SRP-AES-256-CBC-SHA localhost:5999
Cipher suites:
SRP-AES-256-CBC-SHA
TLSv1.2: SRP-AES-256-CBC-SHA     SSLv3 Kx=SRP      Au=None Enc=AES(256)  Mac=SHA1
Session ID: E9:F3:09:F9:7E:F8:43:60:0D:43:43:93:11:63:AE:F2:51:9F:4C:A9:4F:19:F8:89:DF:4F:07:02:66:42:F2:E6
Session ID CTX: (null)
No peer certificates.
...

# How to Leak a Key

I’m sure everyone has been following the great Heartbleed crisis over the last week. The “Heartbleed Challenge” set up by Cloudflare established that you could get the server private key, specifically, the primes used to generate the key would show up in the Heartbleed data occasionally. Doing a memory dump of a simple OpenSSL server application (https://github.com/matthewarcus/ssl-demo) showed that after reading the keys at startup, it later copies the prime data to higher locations in the heap where they become more likely to be included in a Heartbeat response.

An extract from a hex dump of the heap after making a single request shows:

000273f0  00 00 00 00 01 00 00 00  51 00 00 00 00 00 00 00  # Prime1
00027400  85 52 42 a0 71 54 76 bd  98 49 4f 25 fc d3 e1 89
...
00027460  00 00 00 00 01 00 00 00  51 00 00 00 00 00 00 00  # Prime2
00027470  3b c8 8b 7d 52 74 db e0  af 9e 13 d7 a2 0a fc ea
...
0002d020  60 55 00 00 00 00 00 00  50 45 00 00 00 00 00 00  # Input buffer
0002d030  16 03 01 16 03 03 00 28  e8 d6 b3 68 87 08 d4 7f
...
00031570  00 00 00 00 00 00 00 00  c1 44 00 00 00 00 00 00  # Output buffer
00031580  00 00 00 16 03 03 00 28  82 c9 a8 9a 87 b5 cf e7
...
00038990  90 00 00 00 00 00 00 00  50 00 00 00 00 00 00 00  # Prime 2 again
000389a0  3b c8 8b 7d 52 74 db e0  af 9e 13 d7 a2 0a fc ea
...
00038b50  30 00 00 00 00 00 00 00  50 00 00 00 00 00 00 00  # Prime 1 again
00038b60  85 52 42 a0 71 54 76 bd  98 49 4f 25 fc d3 e1 89

The first line in each section is the malloc header, the second line is part of the data; also, all of these malloced blocks are in use (otherwise some of the data would be overwritten with free list data).

The first two primes are from the initial processing of the private key, the next two items are the input and output buffers for the connection, then we have two more copies of the primes, which only appear after a connection has been made.

So where are these persistent copies of the primes coming from? Some investigation shows that the first time OpenSSL actually uses an RSA key for a private operation, it calculates various auxiliary values for use in Montgomery arithmetic and stores them away in the key structure. Unfortunately, one of the values that gets stored away is the modulus used, which in this case is one of the primes:

rsa_eay.c:

static int RSA_eay_mod_exp(BIGNUM *r0,
const BIGNUM *I,
RSA *rsa,
BN_CTX *ctx)
{
...
if (rsa-&gt;flags &amp; RSA_FLAG_CACHE_PRIVATE)
{
if (!BN_MONT_CTX_set_locked(&amp;rsa-&gt;_method_mod_p,
CRYPTO_LOCK_RSA,
p, ctx))
goto err;
if (!BN_MONT_CTX_set_locked(&amp;rsa-&gt;_method_mod_q,
CRYPTO_LOCK_RSA,
q, ctx))
goto err;
...
}
}

bn_mont.c:

BN_MONT_CTX *BN_MONT_CTX_set_locked(BN_MONT_CTX **pmont,
int lock,
const BIGNUM *mod,
BN_CTX *ctx)
{
...
if (!*pmont) {
ret = BN_MONT_CTX_new();
if (ret &amp;&amp; !BN_MONT_CTX_set(ret, mod, ctx))
...
}
}
...
int BN_MONT_CTX_set(BN_MONT_CTX *mont,
const BIGNUM *mod,
BN_CTX *ctx)
{
int ret = 0;
BIGNUM *Ri,*R;
BN_CTX_start(ctx);
if((Ri = BN_CTX_get(ctx)) == NULL) goto err;
R= &amp;(mont-&gt;RR); /* grab RR as a temp */
if (!BN_copy(&amp;(mont-&gt;N),mod)) goto err; /* Set N */
...
}

That BN_copy in the last line is the culprit. Once allocated, these values stay in memory until the RSA key is deleted. So even if the original key data is stored in protected or hard to get at memory, a Heartbleed attack may still be able to get to the primes.

This also explains why, as far as I know, no one has managed to capture the CRT parameters or the private exponent itself – they stay safely tucked away in the low part of the heap. Also, depending on the application startup behaviour, it might be that the primes end up below the input buffers (by default, OpenSSL caches up to 32 previously allocated buffers of each type) in which case they won’t be visible to Heartbleed – this might explain why in the Cloudflare challenge, only one of the primes seemed to be turning up.

Update 20/04/14: actually, it’s worse than this – bignum resizing can also leak private data onto the heap in unpredictable ways, see John Graham-Cumming’s post at https://blog.cloudflare.com/searching-for-the-prime-suspect-how-heartbleed-leaked-private-keys/

# Loopless Ruler

This function (based on Algorithm L for loopless Gray code generation in TAOCP 7.2.1.1):

void ruler(int n)
{
int f[n+1];
for (int j = 0; j <= n; j++) {
f[j] = j;
}
while(true) {
visit(f,n+1);
int j = f[0];
f[0] = 0;
if (j == n) break;
f[j] = f[j+1];
f[j+1] = j+1;
}
}

enables us to compute the “ruler function”, ρ for 1, 2, 3…:

0 1 0 2 0 1 0 3 0 1 0 2 0 1 0...

It’s a “loopless” algorithm and uses f, an array of “focus pointers” to keep track of what’s going on.

The ruler function ρ(n) is essentially the number of trailing 0’s in the binary representation of n.

This is what’s printed by ruler(4):

0 1 2 3 4
1 1 2 3 4
0 2 2 3 4
2 1 2 3 4
0 1 3 3 4
1 1 3 3 4
0 3 2 3 4
3 1 2 3 4
0 1 2 4 4
1 1 2 4 4
0 2 2 4 4
2 1 2 4 4
0 1 4 3 4
1 1 4 3 4
0 4 2 3 4
4 1 2 3 4

The left hand side is the ruler function of course, what are the other values?

It’s clearer if we print f in reverse and match up with binary representations:

00000
43210
00001
43211
00010
43220
00011
43212
00100
43310
00101
43311
00110
43230
00111
43213
01000
44210
01001
44211
01010
44220
01011
44212
01100
43410
01101
43411
01110
43240
01111
43214

If n[j] is the jth binary digit of n, then f[j] is just j, unless n[j] is the first of a block of 1’s, in which case f[j] is the index of the first 0 after that block.

For example, 01101, has block starting at 0 and ending at 1, so f[0] = 1, and another block starting at 2, ending at 4, so f[2] = 4; for other j, f[j] = j, so the focus pointers are 43411.

If n is of the form:

j   k
01110111

with f[0] = k, f[k] = k, f[k+1] = j

0n incrementing we get:

j   k
01111000

Now f[0] = 0, f[k] = j and f[k+1] = k+1, other values of f are the same

If k is 0, then we go from:

j   k
01110

with f[0] = 0, f[1] = j

to:

j   k
01111

with f[0] = 0, f[1] = 1

Either way, we can see that the inner loop of ruler is correct.

# Ringing the Changes

Apart from a brief visit many years ago to Macclesfield church with my friend Jo, I have never taken part in the ancient English art of bell-ringing, but permutation generation has always seemed a fascinating topic.

In TAOCP 7.2.1.2, Knuth has Algorithm P (“plain changes”) for generating all permutations using adjacent interchanges:

void taocp1(int *a, int n)
{
int c[n+1];
int o[n+1];
for (int j = 1; j <= n; j++) {
c[j] = 0; o[j] = 1;
}
P2:
visit(a,n);
P3:
int j = n; int s = 0;
P4:
int q = c[j] + o[j];
if (q < 0) goto P7;
if (q == j) goto P6;
P5:
swap(a[j-c[j]+s],a[j-q+s]);
c[j] = q;
goto P2;
P6:
if (j == 1) return;
s = s+1;
P7:
o[j] = -o[j];
j = j-1;
goto P4;
}

It works just fine, but I find it a little hard to follow and let’s face it, we just don’t write code like that anymore, with all those gotos, 1-based arrays, etc.

0-basing the arrays isn’t too much of a problem, for eliminating the gotos we could take the functional approach:

void taocp2(int *a, int n)
{
int c[n];
int o[n];
for (int j = 0; j < n; j++) {
c[j] = 0; o[j] = 1;
}
P2(a,c,o,n);
}

void P2(int *a, int *c, int *o, int n) {
visit(a,n);
P3(a,c,o,n);
}

void P3(int *a, int *c, int *o, int n) {
P4(a,c,o,n,n-1,0);
}

void P4(int *a, int *c, int *o, int n, int j, int s) {
int q = c[j] + o[j];
if (q < 0) P7(a,c,o,n,j,s);
else if (q == j+1) P6(a,c,o,n,j,s);
else P5(a,c,o,n,j,s,q);
}

void P5(int *a, int *c, int *o, int n, int j, int s, int q) {
swap(a[j-c[j]+s],a[j-q+s]);
c[j] = q;
P2(a,c,o,n);
}

void P6(int *a, int *c, int *o, int n, int j, int s) {
if (j != 1) {
P7(a,c,o,n,j,s+1);
}
}

void P7(int *a, int *c, int *o, int n, int j, int s) {
o[j] = -o[j];
P4(a,c,o,n,j-1,s);
}

but somehow that seems missing the point (I don’t think that the worst thing about that code is that it still uses modifiable state).

There are actually two component parts to the algorithm: generating a “mixed radix reflected Gray code” which defines the progressive inversions of the array elements, and going from the inversion changes to the actual indexes of the swapped objects.

For generating the Gray codes, with a bit of rewriting, 0-basing our arrays etc. we get:

void gray(int n)
{
int j, c[n], o[n];
for (int j = 0; j < n; j++) {
c[j] = 0; o[j] = 1;
}
do {
visit(c,n);
for (j = n-1; j > 0; j--) {
int q = c[j] + o[j];
if (q < 0) {
o[j] = -o[j];
} else if (q == j+1) {
o[j] = -o[j];
} else {
c[j] = q;
break;
}
}
} while (j != 0);
}

Each time around, find the highest element whose inversion count can be changed in the desired direction. For elements whose inversion count can’t be changed, change direction. If no element can be changed we are done.

The next step is to calculate the position of element j – but this is just j less the number of elements less than j that appear after j (ie. the number of inversions, c[j]), plus the number of elements greater than j that appear before j – but this is just the number of elements we have been passed over in the “if (q == j+1)” step above, so we can now add in the rest of algorithm P:

void taocp3(int *a, int n)
{
int c[n];
int o[n];
for (int j = 0; j < n; j++) {
c[j] = 0; o[j] = 1;
}
int j;
do {
visit(a,n);
int s = 0;
for (j = n-1; j > 0; j--) {
int q = c[j] + o[j];
if (q < 0) {
o[j] = -o[j];
} else if (q == j+1) {
o[j] = -o[j];
s++;          // This element will be before element j
} else {
swap(a[j-c[j]+s],a[j-q+s]);
c[j] = q;
break;
}
}
} while (j != 0);
}

Knuth’s algorithm essentially Dijkstra’s:

which explains it very lucidly: there is a 1-1 correspondence between inversion counts & permutations – and a Gray enumeration of inversion gives us a sequence of permutations where only 1 element at a time changes its inversion count, and only by 1 or -1, which can only be if we exchange adjacent elements.

Knuth also gives a “loopless” algorithm for generating reflected Gray sequences, and we could use a table of element positions to construct the permutations from this:

void gray2(int *m, int n)
{
int a[n],f[n+1],o[n];
for (int j = 0; j < n; j++) {
a[j] = 0; f[j] = j; o[j] = 1;
}
f[n] = n;
while (true) {
visit(a,n);
int j = f[0];
f[0] = 0;
if (j == n) break;
a[j] += o[j];
if (a[j] == 0 || a[j] == m[j]-1) {
o[j] = -o[j];
f[j] = f[j+1];
f[j+1] = j+1;
}
}
}

The “classic” Johnson-Trotter algorithm for plain changes involves consideration of “mobile” elements: each element has a current direction and it is “mobile” if if it greater than the next adjacent element (if there is one) in that direction. The algorithm proceeds by finding the highest mobile element and moving it accordingly:

void jt(int *a, int n)
{
int o[n]; // Direction of element i, 1 = left, 0 = right
int c[n]; // Position of element i
for (int j = 0; j < n; j++) {
o[j] = 1; c[j] = j;
}
int j; // Will be set to the highest mobile element
do {
visit(a,n);
// Search high to low. 0 is never mobile
for (j = n-1; j > 0; j--) {
int p = c[j]; // Position of element j
if (o[j]) { // Going left
int k = a[p-1];
if (p > 0 && k < j) {
a[p] = k; c[k] = p;
a[p-1] = j; c[j] = p-1;
break;
}
} else { // Going right
int k = a[p+1];
if (p < n-1 && k < j) {
a[p] = k; c[k] = p;
a[p+1] = j; c[j] = p+1;
break;
}
}
o[j] = !o[j]; // Change direction & continue
}
} while (j > 0);
}

Note that we can check elements from high to low directly & stop as soon as a mobile element is found – there is no need to check every element.

Johnson-Trotter and Algorithm P are essentially doing the same thing: when moving the mobile element, we are changing its inversion count in the same way as we do directly in P.

Performance-wise, the two algorithms are very similar though (somewhat surprisingly to me) the Johnson-Trotter version seems a little faster.

Either way, my laptop generates the 3628800 permutations of 10 elements in 0.03 seconds, outputting them (even to /dev/null) takes 4.5 seconds. It takes about 3 seconds to run through the 479001600 permutations of 12 elements (I haven’t tried outputting them). We can streamline the computation further by taking advantage of the fact that most of the time we are just moving the highest element in the same direction, but that can wait for another day.

Finally, here is a function that gives the next permutation in plain changes without maintaining any state:

bool nextperm(int *a, int n)
{
int c[n], o[n];
for (int i = 0; i < n; i++) {
c[i] = o[i] = 0;
}
// Count inversions
for (int i = 0; i < n; i++) {
for (int j = i+1; j < n; j++) {
if (a[j] < a[i]) c[a[i]]++;
}
}
// Find directions
for (int i = 1, k = 0; i < n-1; i++) {
k += c[i];
o[i+1] = k%2;
}
// Find mobile element and move it.
for (int j = n-1, s = 0; j > 0; j--) {
if (!o[j]) {
if (c[j] < j) {
swap(a[j-c[j]+s],a[j-c[j]+s-1]);
return true;
}
s++;
} else {
if (c[j] > 0) {
swap(a[j-c[j]+s],a[j-c[j]+s+1]);
return true;
}
}
}
return false;
}

void stateless(int *a, int n)
{
do {
visit(a,n);
} while (nextperm(a,n));
}

We use the Knuth/Dijkstra trick to avoid building a position table. I suspect that n-squared inversion counting can be improved, but we still come in at under second for all permutations of 10, though that’s about 30 times slower than the optimized version.

# valof

I’ve never written much BCPL, but it always seemed like a nice language. Many features ended up in C of course, but one that didn’t was VALOF, allowing the inclusion of a statement block in an expression, and using RESULTIS to return a value. Using slightly invented syntax:

LET B = VALOF \$(
FOR I = 1 TO J DO \$(
IF I*I = J THEN RESULTIS 1
\$)
RESULTIS 0
\$)

GCC defines the block-expression extension, allowing us to write:

int main(int argc, char *argv[])
{
int j = atoi(argv[1]);
printf("%d\n";,
({
bool found = false;
for (int i = 1; i &lt; j; i++) {
if (i*i == j) {
found = true;
goto end;
}
}
end:
found;
}));
}

but having to use the last expression in the block as the return value can get a bit convoluted.

With C++11, we can do better with immediately applied lambda expressions, with return taking the role of RESULTIS:

int main(int argc, char *argv[])
{
int j = atoi(argv[1]);
printf("%d\n";,
[=](){
for (int i = 1; i &lt; j; i++) {
if (i*i == j) return true;
}
return false;
}());
}

If we don’t like the smiley syntax, we can even define a VALOF macro:

#define VALOF(e) [=](){e}()
int main(int argc, char *argv[])
{
int j = atoi(argv[1]);
printf("%d\n";,
VALOF(
for (int i = 1; i &lt; j; i++) {
if (i*i == j) return true;
}
return false;
));
}

though we’ll have to be a little bit careful with commas not enclosed in parentheses (and we might end up needing different macros for different variable capture methods).

Curiously, this idiom seems to be popular in Javascript.

# Krazy Kompilation

Let’s take a simple factorial program:

int fact(int n)
{
if (n == 0) return 1;
else return n * fact(n-1);
}

and look at what gcc compiles it to (using Matthew Godbolt’s excellent GCC Explorer):

With -O1, a nice clean version of the recursive function:

fact(int):
pushq	%rbx
movl	%edi, %ebx
movl	\$1, %eax
testl	%edi, %edi
je	.L2
leal	-1(%rdi), %edi
call	fact(int)
imull	%ebx, %eax
.L2:
popq	%rbx
ret

With -O2, that’s impressive, it’s done some program transformation and now we have a loop:

fact(int):
testl	%edi, %edi
movl	\$1, %eax
je	.L4
.L3:
imull	%edi, %eax
subl	\$1, %edi
jne	.L3
rep
ret
.L4:
rep
ret

With -O3, whaaah!:

fact(int):
testl	%edi, %edi
je	.L7
movl	%edi, %edx
movl	%edi, %esi
shrl	\$2, %edx
leal	0(,%rdx,4), %ecx
testl	%ecx, %ecx
je	.L8
cmpl	\$6, %edi
jbe	.L8
leal	-1(%rdi), %eax
movl	%edi, -24(%rsp)
movdqa	.LC1(%rip), %xmm4
movl	%eax, -20(%rsp)
leal	-2(%rdi), %eax
movd	-20(%rsp), %xmm2
movl	%eax, -16(%rsp)
leal	-3(%rdi), %eax
movd	-16(%rsp), %xmm1
movl	%eax, -12(%rsp)
xorl	%eax, %eax
movd	-12(%rsp), %xmm0
punpckldq	%xmm0, %xmm1
movd	-24(%rsp), %xmm0
punpckldq	%xmm2, %xmm0
punpcklqdq	%xmm1, %xmm0
movdqa	.LC0(%rip), %xmm1
.L4:
movdqa	%xmm1, %xmm3
psrldq	\$4, %xmm1
movdqa	%xmm0, %xmm2
cmpl	%eax, %edx
pmuludq	%xmm0, %xmm3
psrldq	\$4, %xmm2
pmuludq	%xmm1, %xmm2
pshufd	\$8, %xmm3, %xmm1
pshufd	\$8, %xmm2, %xmm2
punpckldq	%xmm2, %xmm1
ja	.L4
movdqa	%xmm1, %xmm2
subl	%ecx, %edi
cmpl	%ecx, %esi
psrldq	\$8, %xmm2
movdqa	%xmm2, %xmm0
psrldq	\$4, %xmm2
pmuludq	%xmm1, %xmm0
psrldq	\$4, %xmm1
pshufd	\$8, %xmm0, %xmm0
pmuludq	%xmm2, %xmm1
pshufd	\$8, %xmm1, %xmm1
punpckldq	%xmm1, %xmm0
movdqa	%xmm0, %xmm1
psrldq	\$4, %xmm1
movdqa	%xmm1, %xmm2
psrldq	\$4, %xmm1
pmuludq	%xmm0, %xmm2
psrldq	\$4, %xmm0
pmuludq	%xmm0, %xmm1
pshufd	\$8, %xmm2, %xmm0
pshufd	\$8, %xmm1, %xmm1
punpckldq	%xmm1, %xmm0
movd	%xmm0, -24(%rsp)
movl	-24(%rsp), %eax
je	.L13
.L9:
imull	%edi, %eax
subl	\$1, %edi
jne	.L9
rep
ret
.L7:
movl	\$1, %eax
ret
.L8:
movl	\$1, %eax
jmp	.L9
.L13:
ret

Well, OK, it’s unwound the loop and vectorized with SSE instructions though a more precise analysis will have to wait for a future post. Impressive optimization from GCC though and pretty groovy considering the biggest factorial we can fit in a 32-bit int is fact(15)!

# Redirection

Like, I suspect, many programmers, there are many software tools I have used on a regular basis for many years but remain woefully ignorant of their inner workings and true potential. One of these is the Unix command line.

It’s a common, for example, to want to make the error output of a program to appear as the normal output and trial and error, or Googling or looking at Stack Overflow leads to:

strace ls 2>&1 >/dev/null

which works fine but seems puzzling – we’ve told the program to send error output to normal output, then normal output to /dev/null so why doesn’t that discard everything, similar to:

strace ls >/dev/null 2>&1

This is because we don’t understand what is going on.

An indirection is actually a call to the dup2 system call. From the man page:

dup2() makes newfd be the copy of oldfd, closing newfd first if necessary'

So: n>&m does a dup2(m,n): close fd n if necessary, then make n be a copy of fd m, and n>file means: close n if necessary, open file as fd m, then do dup2(m,n).

Now it all makes sense:

strace ls 2>&1 1>/dev/null

first of all makes 2 be a copy of 1, then changes 1 to point to /dev/null – the copying is done ‘by value’ as it were (despite the confusing, for C++ programmers anyway, use of ‘&’).

Using strace here is not an accident, but used like this doesn’t tell us much: indirection is handled by the shell, not by the program, so we need to do something like this for further insight:

\$ strace -f -etrace=clone,execve,open,dup2 bash -c 'ls >/dev/null 2>&1'
execve("/bin/bash", ["bash", "-c", "ls >/dev/null 2>&1"], [/* 46 vars */]) = 0
clone(Process 20454 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f643f4c09d0) = 20454
Process 20453 suspended
[pid 20454] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 20454] dup2(3, 1)                  = 1
[pid 20454] dup2(1, 2)                  = 2
[pid 20454] execve("/bin/ls", ["ls"], [/* 45 vars */]) = 0
Process 20453 resumed
Process 20454 detached
--- SIGCHLD (Child exited) @ 0 (0) ---

bash forks off a subprocess (a clone syscall these days rather than fork), which then sets up the input and output before calling execve to actually run the command.

We don’t have to limit ourselves to fds 0,1 and 2 that we get to start with:

\$ runit () { echo stderr 1>&2; echo stdout; }
\$ runit 1>/dev/null
stderr
\$ runit 2>/dev/null
stdout
\$ runit 3>&2 2>&1 1>&3
stderr
stdout
\$ (runit 3>&2 2>&1 1>&3) 1> /dev/null
stdout
\$ (runit 3>&2 2>&1 1>&3) 2> /dev/null
stderr

We duplicate 2 to 3, then 1 to 2, then 3 to 1, and we have swapped stdin and stderr.

We can also pass non-standard file descriptors in to programs, though this doesn’t seem to be a technique used much:

#include <unistd.h>
int main()
{
char buffer[256];
ssize_t n;
while ((n = read(3,buffer,sizeof(buffer))) > 0) {
write(4,buffer,n);
}
}

and do:

\$ g++ -Wall cat34.cpp -o cat34
\$ echo Hello World | ./cat34 3<&0 4>&1
Hello World

It’s interesting that this also works:

\$ echo Hello World | ./cat34 3>&0 4<&1
Hello World

While we are this part of town, let’s talk briefly about named pipes, another feature that has been around for ever, but doesn’t seem to get used as much as it deserves. We can run in to problems though:

Suppose I want to capture an HTTP response from a web server, I can do this:

\$ mkfifo f1
\$ mkfifo f2
\$ mkfifo f3
\$ netcat -l 8888 <f1 >f2 &
\$ netcat www.bbc.co.uk 80 <f2 >f3 &
\$ tee foo.txt <f3 >f1 &

\$ GET http://localhost:8888/ >/dev/null
Can't connect to localhost:8888 (Connection refused)

This doesn’t seem right, netcat should be listening on port 8888, I told it so myself! And checking with netstat shows no listener on 8888 and finally ps -aux shows no sign of any netcat processes – what is going on?

Once again, strace helps us see the true reality of things:

\$ kill %1
\$ strace netcat -l 8888 < f1 > f2

But strace tells us nothing – there is no output! Like the dog that didn’t bark in the night though, this is an important clue, and widening our area of investigation, we find:

\$ strace -f -etrace=open,execve bash -c 'netcat -l 8888 < f1 > f2'
execve("/bin/bash", ["bash", "-c", "netcat -l 8888 < f1 > f2"], [/* 46 vars */]) = 0
Process 20621 attached
Process 20620 suspended
[pid 20621] open("f1", O_RDONLY ...

The shell is stalled trying to open the “f1” fifo, before it even gets around to starting the netcat program, which is why the first strace didn’t show anything. What we have forgotten is that opening a pipe blocks if there is no process with the other end open (it doesn’t have to actively reading or writing, it just has to be there). The shell handles redirections in the order they appear, so since our 3 processes are all opening their read fifo first, none have got around to opening their write fifo – we have deadlock, in fact, the classic dining philosophers problem, and a simple solution is for one philosopher to pick up the forks in a different order:

\$ netcat -l 8888 >f2 <f1 &
\$ netcat www.bbc.co.uk 80 <f2 >f3 &
\$ tee foo.txt <f3 >f1 &
\$ GET http://localhost:8888/ >/dev/null
\$ cat foo.txt
HTTP/1.1 301 Moved Permanently
Server: Apache
...

We can, it should be noted, do this more easily with a normal pipeline and a single fifo, and avoid all these problems:

\$ netcat www.bbc.co.uk 80 <f1 | tee foo.txt | netcat -l 8888 >f1

but that would be less fun and possibly less instructive.

# tolower

In an Age of Highly Pipelined RISC Architectures, we might be tempted to write a basic ASCII lower-case function as:

int tolower1(int c)
{
return c|-(c-0101U<032)&040;
}

and avoid any branching. We subtract ‘A’ from c and do an unsigned comparison with ‘Z’-‘A’, convert to a bitmask and use that to ‘or’ in the lower case bit where necessary.

A more quiche-eating approach would be the clear and obvious:

int tolower2(int c)
{
if (c >='A' && c <= 'Z') return c+'a'-'A';
else return c;
}

at which hard-nosed hacker types will sneer, however, an Age of Highly Pipelined RISC Architectures is also an Age of Conditional Move Instructions, and GCC generates the pleasantly concise:

tolower2:
leal	-65(%rdi), %ecx
leal	32(%rdi), %edx
movl	%edi, %eax
cmpl	\$25, %ecx
cmovbe	%edx, %eax
ret

for our simple approach rather than the distinctly bloated:

tolower1:
leal	-65(%rdi), %eax
cmpl	\$25, %eax
setbe	%al
movzbl	%al, %eax
negl	%eax
andl	\$32, %eax
orl	%edi, %eax
ret

All is not lost though for the hard-nosed hacker, for this is also an Age of Vectorization and we can do something similar using SSE2 instructions to lower-case not one but 16 characters at a time:

#include <x86intrin.h>
void tolower128(char *p)
{