# Yin Yang

**Posted:**December 20, 2012

**Filed under:**Uncategorized Leave a comment

An old puzzle about call/cc in Scheme. Don’t know the original source, possibly that place with the dome near the Charles River:

(let* ((yin ((lambda (cc) (display #\@) cc) (call-with-current-continuation (lambda (c) c)))) (yang ((lambda (cc) (display #\*) cc) (call-with-current-continuation (lambda (c) c))))) (yin yang))

First thing is to simplify somewhat, we don’t seem to lose anything essential by getting rid of the outer lambdas:

(let ((yin (call-with-current-continuation (lambda (c) c)))) (display #\@) (let ((yang (call-with-current-continuation (lambda (c) c)))) (display #\*) (yin yang)))

OK, still seems to work. Let’s try and work this out informally – yin is bound to a continuation that sets yin to its argument, prints something, then binds yang to a continuation that prints something else and whatever the first continuation was applied to is applied to the argument of the second continuation. Hmmm, so far, as clear as mud, sounds like that Marx brothers sketch.

Let’s try and write down what the continuations are, to do that properly we need to convert to continuation passing style, CPS. Every function has an extra argument, its continuation and instead of returning a value, it calls its continuation on the value in a tail call. So, without further ado:

(define (id x) x) (define ((id2 x) k) (k x)) (define (((throw k) x) k2) (k x)) (define ((ccc f) k) ((f (throw k)) k)) ((ccc id2) (lambda (k) (display #\@) ((ccc id2) (lambda (x) (display #\*) ((k x) id)))))

Here, `id2`

is a CPS-style identity function – it just applies the continuation `k`

to its main argument `x`

. `throw`

can be used to replace the current continuation, `ccc`

is our CPS style `call-with-current-continuation`

, and uses `throw`

to replace the current continuation.

`id`

is our ‘top-level’ continuation, it doesn’t really matter what it is – it doesn’t ever get called.

Still as clear as mud, simplifying a bit more might help. Let’s give names to the continuation functions and pull them out to the top level as combinators:

(define ((g k) x) (display #\*) ((k x) id)) (define (f k) (display #\@) ((ccc id2) (g k))) ((ccc id2) f)

Now, running through our definition of `ccc`

, we get:

((ccc id2) k) => ((id2 (throw k)) k) => (k (throw k))

Sort of what we expect, we call the current continuation with the current continuation (wrapped in `throw`

, which essentially converts a continuation into a normal function that itself takes a continuation as a parameter). So now we have:

(define ((g k) x) (display #\*) ((k x) id)) (define (f k) (display #\@) ((g k) (throw (g k)))) (f (throw f))

Maybe some more inlining would help:

(define ((g k) x) (display #\*) ((k x) id)) (define (f k) (display #\@) (display #\*) ((k (throw (g k))) id)) (f (throw f))

Or maybe we can get some insight by omitting the IO and just writing:

(define ((g k) x) ((k x) id)) (define (f k) ((g k) (throw (g k)))) (f (throw f))

Now:

(f (throw f)) => ((g (throw f)) (throw (g (throw f)))) => (((throw f) (throw (g (throw f)))) id) => (f (throw (g (throw f)))) => ((g (throw (g (throw f)))) (throw (g (throw (g (throw f)))))) => (((throw (g (throw f))) (throw (g (throw (g (throw f)))))) id) => ((g (throw f))(throw (g (throw (g (throw f)))))) => (((throw f)(throw (g (throw (g (throw f)))))) id) => (f (throw (g (throw (g (throw f)))))) => ...

I’m sure you get the picture.

We can do a similar evaluation more succinctly by defining:

Tkxy = kx Gkx = kxI Fk = Gk(T(Gk))

And now:

F(TF) => G(TF)(T(G(TF)) => TF(T(G(TF))I => F(T(G(TF))) => G(T(G(TF)))(T(G(T(G(TF))))) => T(G(TF))(T(G(T(G(TF)))))I => G(TF)(T(G(T(G(TF))))) => TF(T(G(T(G(TF)))))I => F(T(G(T(G(TF))))) => ... F(T(G(T(G(T(G(TF))))))) => ...

All the Ts and Is do here is cancel one another out, so we can define, even more succinctly (and inlining G a little):

Fk = k(Gk) Gkx = kx

FF => F(GF) => GF(G(GF)) => F(G(G(F))) => G(G(F))(G(G(G(F)))) => G(F)(G(G(G(F)))) => F(G(G(G(F)))) => ... ...

It’s curious that here, `F = λk.k(Gk) = λk.k(λx.kx)`

and `λx.kx`

is eta-convertible to just `k`

, so `FF = (λk.k(λx.kx))(λk.k(λx.kx))`

is eta-convertible to `Ω = YI = (λx.xx)(λx.xx)`

.

Returning to scheme, we can do a similar trick and get:

(define ((g k) x) (display #\*) (k x)) (define (f k) (display #\@) ((g k) (g k))) (f f)

or inlining the first call to g in f:

(define (f k) (display #\@) (display #\*) (k (g k)))

Now our evaluation is:

(f f) => (f (g f)) => (f (g f)) => ((g f) (g (g f))) => (f (g (g f))) => ...

Simplifying further, we end up with the equivalent program:

((lambda(k)(display #\@)(display #\*)(k (lambda(x)(display #\*)(k x)))) (lambda(k)(display #\@)(display #\*)(k (lambda(x)(display #\*)(k x)))))

and if that isn’t cute, I don’t know what is…

# Inverse Square Root II

**Posted:**November 25, 2012

**Filed under:**Bit twiddling, Floating Point Leave a comment

Hmmm, after writing that rather long and laborious post about the bit-level interpretation of fast inverse square root, I came across a couple of excellent blog posts by Christian Plesner Hansen:

http://blog.quenta.org/2012/09/0x5f3759df.html

http://blog.quenta.org/2012/09/0x5f3759df-appendix.html

which got me thinking about all this again (it’s a real mind-worm this one)…

So, now the scales have dropped from my eyes, let’s try again: what we are really doing is using a rough approximation to log2(n), represented as a fixed point number, that happens to be easily computed from the floating point representation. Once we have our approximate log (and exp) function, we can do the the usual computations of roots etc. using normal arithmetic operations on our fixed point representation, before converting back to a float using our exp function.

So, take a number, 2^{n}(1+m) where 0<=m<1, a reasonable approximation to log_{2}(x) is n+m, and we can improve the approximation by adding a small constant offset, σ and, because we are doing this in the fixed point realm, everything works out nicely when we convert back to the floating point realm. Here is a picture for n=0, choosing by eye a value of σ = 0.045:

Now, if interpret a non-negative IEEE-754 float 2^{n}(1+m) as a 9.23 fixed point value (ie. with 9 bits to the left of the binary point, 23 bits to the right), then this fixed point number is (e+m), where e = n+127, and, as above, this is approximates log_{2}(2^{e}(1+m)), so e-127+m = n+m is an approximation to log_{2}(2^{e-127}(1+m)) = log_{2}(2^{n}(1+m)), ie. log_{2} of our original number. Note that e+m is always positive, but e+m-127 may not be, so might need to be represented as a signed value.

As far as actual code goes, first we need to be able to get at the bitwise representation of floats. It’s nice to avoid aliasing issues etc. by using `memcpy`

; my compiler (gcc 4.4.3) at least generates sensible code for this:

static inline float inttofloat(uint32_t n) { float f; memcpy(&f,&n,sizeof(f)); return f; } static inline uint32_t floattoint(float f) { uint32_t n; memcpy(&n,&f,sizeof(n)); return n; }

We are working in our fixed point representation, so adding 127 is actually done by adding 127<<23 = 0x3f800000 and we convert our σ offset in the same way:

uint32_t sigma = 45*(uint32_t(1)<<23)/1000; uint32_t c = (uint32_t(127)<<23) - sigma; int32_t alog(float x) { return floattoint(x)-c; }

We also want an inverse of our approximate log, ie. an approximate exp, but this is just a reverse of the previous step:

float aexp(int32_t n) { return inttofloat(n+c); }

Provided we haven’t left the realm of representable floats, the sign bit of the result should alway be zero.

We can use our log approximation directly as well of course, to convert from fixed point we need:

float unfix(int32_t n) { return float(n)/(1<<23); } float unfix(uint32_t n) { return float(n)/(1<<23); }

Another nice picture, showing our `log2`

approximation over a wider range:

Now we can define:

float sqrt(float x) { return aexp(alog(x)/2); } float invsqrt(float x) { return aexp(alog(x)/-2); } float cubert(float x) { return aexp(alog(x)/3); }

and so on.

To relate this to the original magic function, for an inverse square root, the integer manipulations are:

invsqrt(n) = -((n-c)/2) + c = -(n/2 - c/2) + c // Not exact, but close enough = c/2 - n/2 + c = -(n>>1) + ((c>>1) + c)

Calculating (c>>1) + c we get 0x5f375c29, satisfyingly close to the original magic constant 0x5f3759df…

We can use signed or unsigned fixed point numbers for our logs, the arithmetic will be the same, providing we can avoid overflow, so for example, if to get an approximate value for `log(factorial(50))`

, we can do:

uint32_t fact = 0; for (int i = 1; i <= 50; i++) { fact += alog(i); } printf("%g\n", unfix(fact));

giving the result 213.844, comparing nicely to the true result of 214.208. Note that if I were to have used a signed value for fact, the result would have overflowed.

Be warned, this is only a very rough approximation to the log2 function, only use it if a very crude estimate is good enough, or if you are going do some further refinement of the value. Alternatively, your FPU almost certainly uses some variation of this technique to calculate logs (or at least an initial approximation) so you could just leave it to get on with it.

Like many things, this isn’t particularly new, the standard reference is:

J. N. Mitchell, “Computer multiplication and division using binary logarithms,” IRE Trans. Electron. Computers, vol. 11, pp. 512–517, Aug. 1962

# Fast inverse square root

**Posted:**November 19, 2012

**Filed under:**Bit twiddling, C++, Floating Point 2 Comments

What could be cuter than the infamous fast inverse square root function used in the Quake 3 engine:

http://en.wikipedia.org/wiki/Fast_inverse_square_root

The interesting part is the calculation of the initial approximation, splitting this down into the basic steps, we have:

float isqrt(float f) { uint32_t n = floattoint(f); n >>= 1; n = -n; n += 0x5f000000; n += 0x3759df; float x = inttofloat(n); return x; }

To get some insight into what’s going on here, we need to look at the representation of floating point numbers. An IEEE-754 float consists of a sign bit, an exponent value, and a mantissa. The exponent is an 8-bit value, the mantissa has 23 bits, both unsigned. As usual, a suitable notation is key: simplifying a little (specifically, ignoring NaNs, infinities and denormalized numbers), we shall write a float of the form `{sign:1;exponent:8;mantissa:23}`

as `(s,e,m)`

, with `m`

a real in the range `[0,1)`

, and this represents `2`

, negated if the sign bit is 1.^{e-127}(1+m)

To warm up, it’s helpful to look at a simpler example:

float fsqrt(float f) { unsigned n = floattoint(f); n >>=1; n += 0x1f800000; n += 0x400000; return inttofloat(n); }

Here we are computing an approximation to a normal square root.

Taking it a step at a time: first the shift right `n >>= 1`

, there are two cases for odd and even exponent:

(0,2e,m) => (0,e,m/2) // No carry from exponent (0,2e+1,m) => (0,e,1/2 + m/2) // Carry into mantissa

For `n += 0x1f800000`

: we are adding 63 (0x1f800000 is 63 << 23) onto the exponent:

(s,e,m) => (s,e+63,m)

And finally, `n += 0x400000`

: Generally, if we add p onto the mantissa, where 0 <= p < 2^{23}, and writing m’ = m + p/2^{23}, we have:

(s,e,m) => (s,e,m') if m' < 1 (s,e,m) => (s,e+1,m'-1) if m' >= 1

For `p = 0x400000 = 2`

, we have ^{22}`m' = m+1/2`

. ie:

(s,e,m) => (s,e+1,m'-1) if m' >= 1 => (s,e,m') otherwise

Putting this together, for even powers of two:

2^{2k}(1+m) => 2^{k}(1+m/2): (0,2k+127,m) => (0,k+63,0.5+m/2) => (0,k+126,0.5+m/2) => (0,k+127,m/2)

And for odd powers:

2^{(2k+1)}(1+m) => 2^{k}(1+0.5+m/2): (0,2k+1+127,m) => (0,k+64,m/2) => (0,k+127,m/2) => (0,k+127,0.5+m/2)

Let’s check this is sensible by setting m = 0:

2^{2k}=> 2^{k}2^{(2k+1)}=> 2^{k}(1+0.5)

and putting m = 1 we get:

2^{2k+1}=> 2^{k}(1+1/2) 2^{(2k+2)}=> 2^{k+1}

Our approximation is linear in between powers of two, and continuous at those points too. Also at even powers the graph is tangent to `sqrt(x)`

.

This is all nicely illustrated in a picture:

Returning to the original inverse function, we have an additional negation step, `n = -n`

: to negate a twos-complement number, we flip the bits and add one. There are two main cases, depending on whether the mantissa is zero. If it is zero, there is a carry into the exponent, otherwise we just flip the bits of the exponent. The sign bit will always end up set (I’m ignoring the case when the exponent is zero). The general effect is:

(0,e,0) => (1,256-e,0) (0,e,m) => (1,255-e,1-m)

This time, we are adding 190 onto the exponent (0x5f000000 = 190<<23) – this has the dual effect of resetting the sign bit to 0 and subtracting 66 from the exponent (190 = -66 mod 256).

Let’s see what happens with odd and even powers of two; writing the magic offset added onto the mantissa as `c`

:

(0,2k+127,m) => (0,k+63,0.5+m/2) // n >>= 1 => (1,255-k-63,0.5-m/2) // n = -n => (0,126-k,0.5-m/2) // n += 0x5f000000 => (0,126-k,0.5-m/2+c) // if 0.5-m/2+c < 1 => (0,127-k,-0.5-m/2+c) // if 0.5-m/2+c >= 1

(0,2k+128,m) => (0,k+63,m/2) // n >>= 1 => (1,255-k-63,1-m/2) // n = -n => (0,126-k,1-m/2) // n += 0x5f000000 => (0,126-k,1-m/2+c) // if 1-m/2+c < 1 => (0,127-k,-m/2+c) // if 1-m/2+c >= 1

If we use 0x400000 as the offset, ie. `c`

above is 0.5, and put m=0 in the two cases, we have:

2^{2k}=> 1/2^{k}2^{2k+1}=> 1.5/2^{k+1}

Once again, our approximation coincides exactly at even powers of two, and as before it’s useful to have a picture:

We don’t have nice tangents this time, but the end result isn’t too bad.

We probably could have saved ourselves some work here by noting that the mapping between 32-bit integers (as signed magnitude numbers) and the corresponding floats is monotonic and continuous (for some sense of continuous), so composing with other (anti)monotonic operations gives an (anti)monotonic result, so having checked our function is correct at powers of two, it can’t go too badly adrift in between.

We can improve our approximation by using a smaller mantissa offset, `0x3759df`

, and we end up with the function we came in with:

Not especially pretty, but a good starting point for some Newton-Raphson. Notice that as well as kinks at exact powers of two, this approximation has kinks in between as well (when adding the constant to the mantissa overflows into the exponent).

# Reversing a 64-bit word

**Posted:**November 18, 2012

**Filed under:**Bit twiddling Leave a comment

Here’s a cute way of reversing a 64-bit word (there is a similar, slightly faster, but slightly more obscure method in TAOCP 4a, so if you really need to save the cycles, use that one).

We use ternary swaps: for example, to bit-reverse a 3-bit sequence, we can just swap the high and low bits. To bit-reverse a 9-bit segment, break into [3,3,3] and swap the top and bottom 3-bit sections, and then bit-reverse each 3-bit section separately.

More generally, we can reverse any bit sequence by breaking it up into [M,N,M], swapping top and bottom sections (by shifting up and down by M+N), and recursing.

So, to reverse a 63 bit number, since 63 = 3*3*7, we can nest two 3-swaps and a 7-swap. We can do the actual swapping of sections with another classy Knuth function, nicely expressible as a C++ template:

template <typename T, T m, int k> static inline T swapbits(T p) { T q = ((p>>k)^p)&m; return p^q^(q<<k); }

`m`

is a mask, `k`

a shift amount, the function swaps the bits at positions given by the set bits of `m`

and `m<<k`

(clearly these two should be disjoint, ie. `m & m<<k == 0`

)

All we need to do now is calculate the masks and the shift amounts.

First 3-swap; the binary mask is `...001001001`

and we need to shift the mask up by 2. We can see that `m1 + m1<<1 + m1<<2 = 2^63-1`

, so we can calculate the mask with a (compile-time) division.

Second 3-swap: the binary mask is `...000000111000000111`

and we need to shift the mask up by 6. Once again, we can easily compute the correct mask.

The 7-swap (ie. reverse 7 sections of 9 bits), we do in two stages, do a 3-swap for the top and bottom sections, so the mask is `111111111 + 111111111<<36`

, finally we swap the top and bottom 27 bits, so the mask is just 2^27-1, and the shift is 36.

This reverses the bottom 63 bits, a simple rotate by one then puts everything into the right place.

uint64_t bitreverse (uint64_t n) { static const uint64_t m1 = ((uint64_t(1)<<63)-1)/(1+(1<<1)+(1<<2)); static const uint64_t m2 = ((uint64_t(1)<<63)-1)/(1+(1<<3)+(1<<6)); static const uint64_t m3 = ((uint64_t(1)<<9)-1)+(((uint64_t(1)<<9)-1)<<36); static const uint64_t m4 = (uint64_t(1)<<27)-1; n = swapbits<uint64_t, m1, 2>(n); n = swapbits<uint64_t, m2, 6>(n); n = swapbits<uint64_t, m3, 18>(n); n = swapbits<uint64_t, m4, 36>(n); n = (n >> 63) | (n << 1); return n; }

Here is what gcc makes of that:

_Z10bitreversey: movq %rdi, %rdx movabsq $1317624576693539401, %rax movabsq $126347562148695559, %rcx shrq $2, %rdx xorq %rdi, %rdx andq %rax, %rdx movq %rdx, %rax salq $2, %rdx xorq %rdi, %rax xorq %rdx, %rax movq %rax, %rdx shrq $6, %rdx xorq %rax, %rdx andq %rcx, %rdx movabsq $35115652612607, %rcx xorq %rdx, %rax salq $6, %rdx xorq %rdx, %rax movq %rax, %rdx shrq $18, %rdx xorq %rax, %rdx andq %rcx, %rdx xorq %rdx, %rax salq $18, %rdx xorq %rdx, %rax movq %rax, %rdx shrq $36, %rdx xorq %rax, %rdx andl $134217727, %edx xorq %rdx, %rax salq $36, %rdx xorq %rdx, %rax rorq $63, %rax ret

For comparison, here is Knuth’s 64-bit reverse (I’ve just hard-coded the constants this time). It’s based on a 32-bit reverse that breaks one 17-bit segment into [7,3,7] and the remaining 15-bit segment into [3,7,3] – we can do both swaps with the same shift of 10. First step is to swap adjacent bits which can be done slightly faster than a general swap. Very cunning:

uint64_t kbitreverse (uint64_t n) { static const uint64_t m0 = 0x5555555555555555LLU; static const uint64_t m1 = 0x0300c0303030c303LLU; static const uint64_t m2 = 0x00c0300c03f0003fLLU; static const uint64_t m3 = 0x00000ffc00003fffLLU; n = ((n>>1)&m0) | (n&m0)<<1; n = swapbits<uint64_t, m1, 4>(n); n = swapbits<uint64_t, m2, 8>(n); n = swapbits<uint64_t, m3, 20>(n); n = (n >> 34) | (n << 30); return n; }

and the corresponding compiler output:

_Z11kbitreversey: movq %rdi, %rdx movabsq $6148914691236517205, %rax movabsq $216384095313249027, %rcx shrq %rdx andq %rax, %rdx andq %rdi, %rax addq %rax, %rax orq %rdx, %rax movq %rax, %rdx shrq $4, %rdx xorq %rax, %rdx andq %rcx, %rdx movabsq $54096023692247103, %rcx xorq %rdx, %rax salq $4, %rdx xorq %rdx, %rax movq %rax, %rdx shrq $8, %rdx xorq %rax, %rdx andq %rcx, %rdx movabsq $17575006191615, %rcx xorq %rdx, %rax salq $8, %rdx xorq %rdx, %rax movq %rax, %rdx shrq $20, %rdx xorq %rax, %rdx andq %rcx, %rdx xorq %rdx, %rax salq $20, %rdx xorq %rdx, %rax rorq $34, %rax ret

Knuth wins here by 1 instruction! Oh well, maybe I’ll have better luck next time…

# Embedded Python Interpreter

**Posted:**November 4, 2012

**Filed under:**C++, Python, Unix 1 Comment

And now for something completely different…

Often, I’d like to embed a reasonably capable command interpreter in a C++ application. Python seems a likely candidate, so here’s some investigative code using separate processes (the next step will be to use threads, if that’s possible, so the interpreter can live in the same memory space as our application, that can wait for part II though). As well as the mechanics of embedding Python, we have a pleasant excursion through the sometimes murky worlds of signal handling and pseudo-terminals.

The server structure is conventional (though not necessarily suitable for a serious production server), on each incoming connection we `fork`

a handler process, this in turn splits into two processes, which form their own process group under the control of a pseudo-terminal (pty). One forwarding process copies data between the socket and the master side of the pty, the other process runs the interpreter itself on the slave side. Simple enough, with a few subtleties. To get signal handling right, we have to ignore SIGINT in the forwarding process (otherwise it will terminate on interrupt, taking the interpreter with it), but leave the default handler in the interpreter process – Python sets up its own signal handler, but it only seems to do this if the handler hasn’t been redefined already. Also, Python seems to insist that it uses fds 0,1 and 2 so we need to rebind them, and, finally, to get Python to do line editing, we need to `import readline`

in the interpreter.

My main interest here is in getting external access to the interpreter, rather than the mechanics of calling between C and Python, so we just have a couple of simple functions `init()`

and `func()`

defined in the embedded interpreter as examples. At this simple level I don’t think we need to worry about reference counts etc.

#include <Python.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <fcntl.h> #include <signal.h> #include <time.h> #include <errno.h> #include <netinet/ip.h> #include <sys/epoll.h> // Some handy macros to help with error checking // When prototyping, it's a good idea to check every // system call for errors, these macros help to keep // the code uncluttered. #define CHECK(e) \ ((e)? \ (void)0: \ (fprintf(stderr, "'%s' failed at %s:%d\n - %s\n", \ #e, __FILE__, __LINE__,strerror(errno)), \ exit(0))) #define CHECKSYS(e) (CHECK((e)==0)) #define CHECKFD(e) (CHECK((e)>=0)) // We are told not to use signal, due to portability problems // so we will define a similar function ourselves with sigaction void setsignal(int signal, sighandler_t handler) { struct sigaction sa; memset(&sa,0,sizeof(sa)); sa.sa_handler = handler; CHECKSYS(sigaction(signal,&sa,NULL)); } // Make a suitable server socket, as a small concession to // security, we will hardwire the loopback address as the // bind address. People elsewhere can come in through an SSH // tunnel. int makeserversock(int port) { int serversock = socket(AF_INET,SOCK_STREAM,0); CHECKFD(serversock); sockaddr_in saddr; saddr.sin_family = PF_INET; saddr.sin_port = htons(port); saddr.sin_addr.s_addr = htonl(INADDR_LOOPBACK); int optval = 1; CHECKSYS(setsockopt(serversock, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof optval)); CHECKSYS(bind(serversock,(sockaddr*)&saddr,sizeof(saddr))); CHECKSYS(listen(serversock,10)); return serversock; } // Copy data between our socket fd and the master // side of the pty. A simple epoll loop. int runforwarder(int mpty, int sockfd) { static const int MAX_EVENTS = 10; int epollfd = epoll_create(MAX_EVENTS); CHECKFD(epollfd); epoll_event event; memset (&event, 0, sizeof(event)); event.events = EPOLLIN; event.data.fd = sockfd; CHECKSYS(epoll_ctl(epollfd, EPOLL_CTL_ADD, sockfd, &event)); event.data.fd = mpty; CHECKSYS(epoll_ctl(epollfd, EPOLL_CTL_ADD, mpty, &event)); char ibuff[256]; while (true) { struct epoll_event events[MAX_EVENTS]; int nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1); // Maybe treat EINTR specially here. CHECK(nfds >= 0); for (int i = 0; i < nfds; ++i) { int fd = events[i].data.fd; if (events[i].events & EPOLLIN) { ssize_t nread = read(fd,ibuff,sizeof(ibuff)); CHECK(nread >= 0); if (nread == 0) { goto finish; } else { write(mpty+sockfd-fd,ibuff,nread); } } else if (events[i].events & (EPOLLERR|EPOLLHUP)) { goto finish; } else { fprintf(stderr, "Unexpected event for %d: 0x%x\n", fd, events[i].events); goto finish; } } } finish: CHECKSYS(close(mpty)); CHECKSYS(close(sockfd)); CHECKSYS(close(epollfd)); return 0; } // The "application" functions to be accessible from // the embedded interpreter int myinit() { srand(time(NULL)); return 0; } int myfunc() { return rand(); } // Python wrappers around our application functions static PyObject* emb_init(PyObject *self, PyObject *args) { if (!PyArg_ParseTuple(args, ":init")) return NULL; return Py_BuildValue("i", myinit()); } static PyObject* emb_func(PyObject *self, PyObject *args) { if (!PyArg_ParseTuple(args, ":func")) return NULL; return Py_BuildValue("i", myfunc()); } static PyMethodDef EmbMethods[] = { {"init", emb_init, METH_VARARGS, "(Re)initialize the application."}, {"func", emb_func, METH_VARARGS, "Run the application"}, {NULL, NULL, 0, NULL} }; int runinterpreter(char *argname, int fd) { CHECKFD(dup2(fd,0)); CHECKFD(dup2(fd,1)); CHECKFD(dup2(fd,2)); CHECKSYS(close(fd)); Py_SetProgramName(argname); Py_Initialize(); Py_InitModule("emb", EmbMethods); PyRun_SimpleString("from time import time,ctime\n"); PyRun_SimpleString("from emb import init,func\n"); PyRun_SimpleString("print('Today is',ctime(time()))\n"); PyRun_SimpleString("import readline\n"); PyRun_InteractiveLoop(stdin, "-"); Py_Finalize(); return 0; } int main(int argc, char *argv[]) { int port = -1; if (argc > 1) { port = atoi(argv[1]); } else { fprintf(stderr, "Usage: %s <port>\n", argv[0]); exit(0); } setsignal(SIGCHLD, SIG_IGN); int serversock = makeserversock(port); while (true) { int sockfd = accept(serversock,NULL,NULL); CHECKFD(sockfd); if (fork() != 0) { // Server side, close new connection and continue CHECKSYS(close(sockfd)); } else { // Client side, close server socket CHECKSYS(close(serversock)); serversock = -1; // Create a pseudo-terminal int mpty = posix_openpt(O_RDWR); CHECKFD(mpty); CHECKSYS(grantpt(mpty)); // pty magic CHECKSYS(unlockpt(mpty)); // Start our own session CHECK(setsid()>0); int spty = open(ptsname(mpty),O_RDWR); // spty is now our controlling terminal CHECKFD(spty); // Now split into two processes, one copying data // between socket and pty; the other running the // actual interpreter. if (fork() != 0) { CHECKSYS(close(spty)); // Ignore sigint here setsignal(SIGINT, SIG_IGN); return runforwarder(sockfd,mpty); } else { CHECKSYS(close(sockfd)); CHECKSYS(close(mpty)); // Default sigint here - will be replace by interpreter setsignal(SIGINT, SIG_DFL); return runinterpreter(argv[0],spty); } } } }

Compilation needs something like:

g++ -g -L/usr/lib/python2.6/config -lpython2.6 -I/usr/include/python2.6 -Wall embed.cpp -o embed

Suitable flags can be obtained by doing:

/usr/bin/python2.6-config --cflags /usr/bin/python2.6-config --ldflags

Of course, all this will depend on your exact Python version and where it is installed. Embedding has changed somewhat in Python 3, but most of this will still apply.

To connect to the interpreter, we can use our good friend `netcat`

, with some extra tty mangling (we want eg. control-C to be handled by the pty defined above in the server code, not the user terminal, so we put that into raw mode).

#!/bin/sh ttystate=`stty --save` stty raw -echo netcat $* stty $ttystate

We set up the server socket to only listen on the loopback interface, so in order to have secure remote access, we can set up an SSH tunnel by running something like:

$ ssh -N -L 9998:localhost:9999 <serverhost>

on the client host.

Finally, we can run some Python:

$ connect localhost 9998 ('Today is', 'Sun Nov 4 21:09:09 2012') >>> print 1 1 >>> init() 0 >>> func() 191482566 >>> ^C KeyboardInterrupt >>> ^D $

# Lambda Interpreter, Part II, Semantics

**Posted:**October 24, 2012

**Filed under:**Lambda Calculus, ML Leave a comment

First thing, we need to be able to define identifiers, at least at the top level, for `S`

, `K`

, `I`

, etc. A simple list of string-value pairs will do, we will fill in our default environment later:

fun getenv [] v = NONE | getenv ((v',e)::s) v = if (v = v') then SOME e else getenv s v; (* Make an environment from a list of strings *) fun mkenv [] k = k | mkenv (v::e::s) k = mkenv s ((v,parse e)::k)

We will define two evaluation mechanisms, one will reduce one redex at a time, with the index of the redex passed in to the function, the other will just reduce until a normal form is reached, if there is one.

It’s convenient to use option types to indicate if a subexpression has been reduced, and it’s even more convenient to define some monadic-style helper functions:

fun lift f NONE = NONE | lift f (SOME x) = SOME (f x); fun bind f NONE = NONE | bind f (SOME a) = f a; (* Apply f to a, if no result, apply g to b *) fun try f a g b = case f a of r as SOME _ => r | NONE => g b; fun get (SOME x) = x;

Here’s the single redex reducer: try to reduce each subexpression, keeping track of redex indexes. Return `SOME(e')`

if a sub evaluation succeeds, else return `NONE`

with some monadic trickery to handle the mechanics. We avoid expanding global identifiers unless they are in functional position.

fun reduce env n e = let fun aux n (Var _) = NONE | aux 0 (App(Lambda(v,e1),e2)) = SOME (subst v e2 e1) | aux n (App(Lambda(v,e1),e2)) = try (lift (fn e1' => App(Lambda(v,e1'),e2)) o aux (n-1)) e1 (lift (fn e2' => App(Lambda(v,e1),e2')) o aux (n-1)) e2 | aux n (App(e1 as Var v,e2)) = try (bind (fn e1' => aux n (App(e1',e2))) o getenv env) v (lift (fn e2' => App(e1,e2')) o aux n) e2 | aux n (App(e1,e2)) = try (lift (fn e1' => App(e1',e2)) o aux n) e1 (lift (fn e2' => App(e1,e2')) o aux n) e2 | aux n (Lambda(v,e1)) = (lift (fn e1' => Lambda(v,e1')) o aux n) e1 in aux n e end;

That’s all very well for reducing individual redexes, let’s define something that will just let rip on an expression. We’d like to do a proper normal order evaluation, so we’ll use an evaluation stack: go down left branch of expression, reducing beta redexes as we go. When we have got to the end, there are no more top-level redexes, so recursively evaluate the items on the stack, and finally fold back up in into a single expression:

fun eval env e = let fun foldapp(e1::e2::s) = foldapp(App(e1,e2)::s) | foldapp ([e1]) = e1; fun aux (Lambda(v,e1)) (e2::s) = aux (subst v e2 e1) s | aux (Lambda(v,e1)) [] = Lambda(v, eval env e1) | aux (App(e1,e2)) s = aux e1 (e2::s) | aux (e as Var _) [] = e | aux (e as Var v) s = (case getenv env v of SOME e' => aux e' s | _ => foldapp (map (eval env) (e::s))); in aux e [] end;

All we need now is a suitable environment, and we can start reducing.

val stdenv = mkenv ["S", "λxyz.(xz)(yz)", "K", "λxy.x", "I", "λx.x", "Y", "λf.(λx.f(xx))(λx.f(xx))", "M", "λxy.y(xy)", "T", "λxy.x", "F", "λxy.y", "Z", "λn.n(λx.F)T", "0", "λf.λn.n", "N", "λn.λf.λx.f(nfx)", "P", "λnfx.n(λgh.h(gf))(λu.x)(λu.u)", "*", "λmnfx.m(nf)x", "+", "λmnfx.mf(nfx)", "1", "N0", "2", "N1", "3", "N2", "4", "N3", "5", "N4", "6", "N5", "7", "N6", "8", "N7", "9", "N8", "H", "Y(λgn.(Zn)1(*n(g(Pn))))" ] [];

As well as the usual suspects, `M`

is my favourite combinator. `T`

and `F`

help define conditionals, and there is the usual definition of Church numerals. `H`

, obviously, is the factorial function.

Finally, we define some convenience functions:

fun run e = case reduce stdenv 0 (pp e) of SOME e' => run e' | NONE => e; (* Evaluate first redex *) fun e0 e = pp (get (reduce stdenv 0 e)); (* Evaluate second redex *) fun e1 e = pp (get (reduce stdenv 1 e)); (* Evaluate a symbol *) fun esym s = pp (get (getenv stdenv s)); (* Evaluate to normal form *) fun e s = pp (eval stdenv (parse s));

And now at last we can do some reductions:

First we probably should check that variable capture is handled properly:

- e "(\\abcd.abcd)xyzw"; xyzw val it = App (App (App #,Var #),Var "w") : Exp - e "(\\vxx'x''.vxx'x'')xyzw"; xyzw val it = App (App (App #,Var #),Var "w") : Exp

It’s interesting that in the first reduction of the latter, there is a sort of cascade of primes being added:

- e0 (parse "(\\vxx'x''.vxx'x'')xyzw"); (λx'x''x'''.xx'x''x''')yzw val it = App (App (App #,Var #),Var "w") : Exp

Anyway, that seems to check out so we can move on to something more interesting. Let’s derive the Turing fixpoint combinator from `M`

and `Y`

:

- e1(e1(e0(parse"YM"))); (λx.M(xx))(λx.M(xx)) (λxy.y(xxy))(λx.M(xx)) (λxy.y(xxy))(λxy.y(xxy)) val it = App (Lambda ("x",Lambda #),Lambda ("x",Lambda #)) : Exp

Nice. And we can do factorials:

- e "H4"; λfx.f(f(f(f(f(f(f(f(f(f(f(f(f(f(f(f(f(f(f(f(f(f(f(fx))))))))))))))))))))))) val it = Lambda ("f",Lambda ("x",App #)) : Exp

The output is a little big to put inline, but the enthusiastic can try:

run (parse "H2");

See [[here]] for output.

# Lambda Interpreter, Part I, Syntax.

**Posted:**October 9, 2012

**Filed under:**Lambda Calculus, ML Leave a comment

Here’s a simple lambda calculus interpreter I wrote a little while ago. It’s in ML, a wonderful language for its polymorphic type inference, pattern matching, and effortless higher order functions.

We start off nice and simple with some abstract syntax:

type Variable = string; datatype Exp = Var of Variable | App of Exp * Exp | Lambda of Variable * Exp;

We could use de Bruijn indices for variables, making variable substitution easier, but I’m a bit of a traditionalist and it’s not too difficult to do it the hard way.

First job is printing expressions, a nice little recursive scanner will do, and we can sort out bracketing, eliding `λx.λy.e`

to `λxy.e`

and so on with some nifty pattern matching:

fun pexp (Var x) = x | pexp (e as Lambda _) = "λ" ^ plambda e | pexp (App(e1 as App _,e2)) = pexp e1 ^ pexp1 e2 | pexp (App(e1,e2)) = pexp1 e1 ^ pexp1 e2 and pexp1 (Var x) = x | pexp1 e = "(" ^ pexp e ^ ")" and plambda(Lambda(v,e)) = v ^ plambda e | plambda e = "." ^ pexp e; fun pp e = (print (pexp e ^ "\n"); e)

Next up, variable substitution, a little tedious, but has to be done. First, we need to know if a variable occurs free in an expression. If it does, we need to find a variable that doesn’t, which we do by decorating with primes. Having got ourselves a variable that isn’t free, we can use it to substitute for the one that is, and that’s all there is to it. The code is probably clearer:

fun isfree c (Var c') = c = c' | isfree c (Lambda(c',e)) = if (c = c') then false else isfree c e | isfree c (App(e1,e2)) = if isfree c e1 then true else isfree c e2; fun occurs c (Var c') = c = c' | occurs c (Lambda(c',e)) = if (c = c') then true else occurs c e | occurs c (App(e1,e2)) = if occurs c e1 then true else occurs c e2; (* Add primes to variable until we find one not occurring in e *) fun findnew v e = let val v' = v ^ "'" in if not (occurs v' e) then v' else findnew v' e end; fun subst v e1 (e2 as (Var v')) = if v = v' then e1 else e2 | subst v e1 (App(e2,e3)) = App(subst v e1 e2, subst v e1 e3) | subst v e1 (e2 as Lambda(v',e3)) = if not (isfree v e2) then e2 (* Includes case v = v' *) else if isfree v' e1 then (* Find a variable not in e1 or e3 to use *) let val v'' = findnew v' (App(e1,e3)) in subst v e1 (Lambda(v'', subst v' (Var v'') e3)) end else Lambda(v', subst v e1 e3);

Phew, glad that’s over. Next, we need to lex and parse expressions. Lexing is straightforward, variables are single letters, we allow either `\`

or `λ`

for lambda; it seems that we get the UTF-8 bytes for `λ`

one at a time, so that’s just about our only multi character token, apart from primed identifiers (since we use primes to avoid variable capture in substitutions, we want to be able to read them in as well). Lex input is just a list of characters from an exploded string.

datatype LexItem = LAM | BRA | KET | DOT | VAR of string; fun lex [] t = rev t | lex (#" "::s) t = lex s t | lex (#"\n"::s) t = lex s t | lex (#"\t"::s) t = lex s t | lex (#"\\"::s) t = lex s (LAM::t) | lex (#"("::s) t = lex s (BRA::t) | lex (#")"::s) t = lex s (KET::t) | lex (#"."::s) t = lex s (DOT::t) | lex (#"\206" :: #"\187" ::s) t = lex s (LAM::t) | lex (c::s) t = lexvar s [c] t and lexvar (#"'"::s) v t = lexvar s (#"'"::v) t | lexvar s v t = lex s (VAR (implode(rev v))::t);

Parsing is even more fun. This is a hand-built LR table-driven parser; table driven parsers are good for tricks like semi-intelligent error recovery or doing parse conflict resolution on the fly (useful eg. for languages with redefinable operation precedences like ML). We don’t do anything like that though, we don’t even explicitly detect errors, and instead rely on ML’s pattern matching – if the input is ungrammatical, we get an inexhaustive match error:

fun parse s = let (* Items appearing on the parse stack *) datatype ParseItem = B | E of Exp | V of Variable; fun aux [E e] [] = e | aux (E e1::E e2::s) t = aux (E(App(e2,e1))::s) t | aux s (LAM::VAR c::DOT::t) = aux (V c::s) t | aux s (LAM::VAR c::VAR c'::t) = aux (V c::s) (LAM::VAR c'::t) | aux s (BRA::t) = aux (B::s) t | aux ((a as E _):: B :: s) (KET::t) = aux (a::s) t | aux s (VAR c::t) = aux(E(Var c)::s) t | aux (E e::V v::s) t = aux (E(Lambda(v,e))::s) t; in aux [] (lex (explode s) []) end

Well, that’s it for the moment. Next installment – evaluating expressions.

# Templated bit twiddling

**Posted:**October 6, 2012

**Filed under:**Bit twiddling Leave a comment

Sometimes it’s nice to combine C++ template metaprogramming with some bit twiddling, here’s a classic parallelized bit counting algorithm with all the magic numbers constructed as part of template instantiation:

#include <stdio.h> #include <stdlib.h> template <typename T, int k, T m> struct A { static T f(T n) { static const int k1 = k/2; n = A<T, k1, m^(m<<k1)>::f(n); return ((n>>k)&m) + (n&m); } }; template <typename T, T m> struct A<T, 0, m> { static T f(T n) { return n; } }; template<typename T> static inline int bitcount (T n) { static const int k = 4 * sizeof(n); return A<T, k, (T(1)<<k)-1>::f(n); } int main(int argc, char *argv[]) { unsigned long n = strtoul(argv[1], NULL, 10); printf("bitcount(%lu) = %d\n", n, bitcount(n)); }

We could extract the magic number construction out as a separate template, and there are some extra optimizations that this code doesn’t do, for example, the level 1 code can be simplified to:

i = i - ((i >> 1) & 0x55555555);

rather than what we get from the above:

i = ((i >> 1)&0x55555555) + (i&0x55555555);

Adding an appropriate partial template instantiation for `A<T,1,m>`

is left as an exercise.

In fact, there are lots of shortcuts that can be made here, we aren’t really being competitive with Bit Twiddling Hacks:

unsigned int v; // count bits set in this (32-bit value) unsigned int c; // store the total here static const int S[] = {1, 2, 4, 8, 16}; // Magic Binary Numbers static const int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF}; c = v - ((v >> 1) & B[0]); c = ((c >> S[1]) & B[1]) + (c & B[1]); c = ((c >> S[2]) + c) & B[2]; c = ((c >> S[3]) + c) & B[3]; c = ((c >> S[4]) + c) & B[4];

which would need 3 template specializations and we are starting to get a bit complicated…

In the same style then, here’s a bit reverse function, with a separate magic number calculating template:

#include <stdio.h> #include <stdlib.h> template <typename T, int k> struct Magic { static const T m = ~T(0)/(T(1)+(T(1)<<k)); }; template <typename T, int k> struct RevAux { static inline T f(T n) { T n1 = RevAux<T,k/2>::f(n); return ((n1>>k) & Magic<T,k>::m) | ((n1 & Magic<T,k>::m)<<k); } }; template <typename T> struct RevAux<T,0> { static inline T f(T n) { return n; } }; template <typename T> T rev(T n) { return RevAux<T,4*sizeof(T)>::f(n); } int main(int argc, char *argv[]) { unsigned long n = strtoul(argv[1],NULL,16); printf("%lx %lx\n", n, rev(n)); }

Now, how do we check that the template parameter T is unsigned (otherwise the right shifts are probably going to go wrong)…

A simple optimization is to replace the top level function with:

template <typename T> static inline T rev(T n) { static const int k = 4*sizeof(T); n = RevAux<T,k/2>::f(n); return (n>>k)|(n<<k); }

but GCC seems to have worked that out for itself (compiling gcc 4.4.3 , -O3 -S):

.globl _Z7reversej pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax popl %ebp movl %eax, %edx andl $1431655765, %edx shrl %eax addl %edx, %edx andl $1431655765, %eax orl %eax, %edx movl %edx, %eax andl $858993459, %eax shrl $2, %edx andl $858993459, %edx sall $2, %eax orl %edx, %eax movl %eax, %edx andl $252645135, %edx shrl $4, %eax andl $252645135, %eax sall $4, %edx orl %eax, %edx movl %edx, %eax andl $16711935, %eax shrl $8, %edx sall $8, %eax andl $16711935, %edx orl %edx, %eax rorl $16, %eax ret

This sort of thing isn’t going to work very well if the compiler isn’t up to scratch, but GCC seems to have done it right here.

There are slightly faster ways to reverse bits using ternary swap (swap top and bottom thirds, recurse), but they need to be more ad hoc unless you have a ternary computer with word length a power of 3, so less amenable to this sort of generic treatment (or at least, it becomes a lot harder).

# My Favourite Combinator

**Posted:**October 5, 2012

**Filed under:**Lambda Calculus Leave a comment

A fixpoint operator `F`

is such that:

Ff = f(Ff)

ie. `F = λf.f(Ff)`

ie. `F`

is a fixpoint of `λF.λf.f(Ff)`

ie. fixpoints are fixpoints of `M = λx.λy.y(xy)`

`M`

is my favourite combinator.

[As an aside, let’s do that constructive type theory thing and prove it as a theorem:

x: (A->B)->A

y: A->B

xy: A

y(xy): B

M: ((A->B)->A)->((A->B)->B) (MP applied twice)

Phew, glad that checks out]

Church’s `Y`

operator is:

`Y = λf.(λx.f(xx))(λx.f(xx))`

[Yf = (λx.f(xx))(λx.f(xx)) [beta]

= f((λx.f(xx))(λx.f(xx))) [beta]

= f(Yf) [subst of equals]]

Apply `Y`

to `M`

:

YM = (λx.M(xx))(λx.M(xx))

= (λx.λy.y(xxy))(λx.λy.y(xxy)) [λx.M(xx) = λx.λy.y(xxy)]

= BB = T

`T`

is the Turing fixpoint combinator. Now isn’t that weird? No doubt there is some deep reason behind this, but what it might be I have no idea. Perhaps it’s down to the well-known fact that `Yf`

needs substitution of equals to show equivalence with `f(Yf)`

, whereas `T(f)`

directly reduces to `f(Tf)`

:

Tf = (λx.λy.y(xxy))Bf [def]

= f(BBf) [beta]

= f(Tf) [def]

One up for Turing there!

# Nice series—look familiar?

**Posted:**October 4, 2012

**Filed under:**Floating Point Leave a comment

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 22 24 26 28 30 32 36 40 44 48 52 56 60 64 72 80 88 96 104 112 120 128 144 160 176 192 208 224 240 256 288 320 352 384 416 448 480 512 576 640 704 768 832 896 960 1024 1152 1280 1408 1536 1664 1792 1920 2048 2304 2560 2816 3072 3328 3584 3840 4096 4608 5120 5632 6144 6656 7168 7680 8192 9216 10240 11264 12288 13312 14336 15360 16384 18432 20480 22528 24576 26624 28672 30720 32768 36864 40960 45056 49152 53248 57344 61440 65536 73728 81920 90112 98304 106496 114688 122880 131072 147456 163840 180224 196608 212992 229376 245760 262144 294912 327680 360448 393216 425984 458752 491520 524288 589824 655360 720896 786432 851968 917504 983040 1048576 1179648 1310720 1441792 1572864 1703936 1835008 1966080 2097152 2359296 2621440 2883584 3145728 3407872 3670016 3932160 4194304 4718592 5242880 5767168 6291456 6815744 7340032 7864320

The answer, of course, is that they are floating point numbers. Well, clearly that’s not true, they are integers, but the distribution is the same as floating point numbers, and just need some appropriate scaling, for example, by dividing by 8388608 to get a nice set of numbers in the interval [0,1).

Notice that for all but the first row, every entry in a row has the same number of digits in its binary representation – they are normalized, with the first row corresponding to the denormalized numbers of IEEE-754. Without the denormalized numbers, zero would be sitting in splendid isolation, surrounded by a great hole, ready to catch the unwary.

Also, the numbers in each row are the same distance apart, with the gap at row n+1 being twice the gap at row n. Again, the first row is the exception, the gap here is the same as the second row – so the first 16 numbers form a uniformly spaced sequence from 0 – there isn’t anything very peculiar about the denormalized numbers, they just represent a continuation of the first row of normalized numbers to fill the gap down to zero.

We can see how a scaled comparison of floats might work too – within each row, two elements are close if they are within n places of each other, or equivalently, if they are within some constant multiple of the row gap. For elements in different rows, we can either just carry on counting the number of intervening numbers, or continue using the row gap to define a neighbourhood of each number. For example, if “close” means “twice the row gap”, then 22 is close to 18, 20, 24 and 26; 16 is close to 12, 13, 14. 15, 18 and 20; and 15 is close to just 13, 14 and 16. This notion of closeness is discussed by Knuth in TAOCP.

We don’t have to be twoist about this, here’s a similar sequence based on powers of 3, with 10 elements in each segment. This gives us 5 denormalized values:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 18 21 24 27 30 33 36 39 42 45 54 63 72 81 90 99 108 117 126 135 162 189 216 243 270 297 324 351 378 405 486 567 648 729 810 891 972 1053 1134 1215 1458 1701 1944 2187 2430 2673 2916 3159 3402 3645 4374 5103 5832 6561 7290 8019 8748 9477 10206 10935 13122 15309 17496 19683 21870 24057 26244 28431 30618 32805 39366 45927 52488 59049 65610 72171 78732 85293 91854 98415 118098 137781 157464 177147 196830 216513 236196 255879 275562 295245 354294 413343 472392 531441 590490 649539 708588 767637 826686 885735 1062882 1240029 1417176 1594323 1771470 1948617 2125764 2302911 2480058

Once again, each row, except the first, every number has the same number of digits in its base 3 representation.