valof

I’ve never written much BCPL, but it always seemed like a nice language. Many features ended up in C of course, but one that didn’t was VALOF, allowing the inclusion of a statement block in an expression, and using RESULTIS to return a value. Using slightly invented syntax:

LET B = VALOF $(
        FOR I = 1 TO J DO $(
                IF I*I = J THEN RESULTIS 1
        $)
        RESULTIS 0
$)

GCC defines the block-expression extension, allowing us to write:

int main(int argc, char *argv[])
{
  int j = atoi(argv[1]);
  printf("%d\n";,
         ({
           bool found = false;
           for (int i = 1; i < j; i++) {
             if (i*i == j) {
               found = true;
               goto end;
             }
           }
         end:
           found;
         }));
}

but having to use the last expression in the block as the return value can get a bit convoluted.

With C++11, we can do better with immediately applied lambda expressions, with return taking the role of RESULTIS:

int main(int argc, char *argv[])
{
  int j = atoi(argv[1]);
  printf("%d\n";,
         [=](){
           for (int i = 1; i < j; i++) {
             if (i*i == j) return true;
           }
           return false;
         }());
}

If we don’t like the smiley syntax, we can even define a VALOF macro:

#define VALOF(e) [=](){e}()
int main(int argc, char *argv[])
{
  int j = atoi(argv[1]);
  printf("%d\n";,
         VALOF(
             for (int i = 1; i < j; i++) {
               if (i*i == j) return true;
             }
             return false;
           ));
}

though we’ll have to be a little bit careful with commas not enclosed in parentheses (and we might end up needing different macros for different variable capture methods).

Curiously, this idiom seems to be popular in Javascript.

Advertisements

Krazy Kompilation

Let’s take a simple factorial program:

int fact(int n)
{
  if (n == 0) return 1;
  else return n * fact(n-1);
}

and look at what gcc compiles it to (using Matthew Godbolt’s excellent GCC Explorer):

With -O1, a nice clean version of the recursive function:

fact(int):
	pushq	%rbx
	movl	%edi, %ebx
	movl	$1, %eax
	testl	%edi, %edi
	je	.L2
	leal	-1(%rdi), %edi
	call	fact(int)
	imull	%ebx, %eax
.L2:
	popq	%rbx
	ret

With -O2, that’s impressive, it’s done some program transformation and now we have a loop:

fact(int):
	testl	%edi, %edi
	movl	$1, %eax
	je	.L4
.L3:
	imull	%edi, %eax
	subl	$1, %edi
	jne	.L3
	rep
	ret
.L4:
	rep
	ret

With -O3, whaaah!:

fact(int):
	testl	%edi, %edi
	je	.L7
	movl	%edi, %edx
	movl	%edi, %esi
	shrl	$2, %edx
	leal	0(,%rdx,4), %ecx
	testl	%ecx, %ecx
	je	.L8
	cmpl	$6, %edi
	jbe	.L8
	leal	-1(%rdi), %eax
	movl	%edi, -24(%rsp)
	movdqa	.LC1(%rip), %xmm4
	movl	%eax, -20(%rsp)
	leal	-2(%rdi), %eax
	movd	-20(%rsp), %xmm2
	movl	%eax, -16(%rsp)
	leal	-3(%rdi), %eax
	movd	-16(%rsp), %xmm1
	movl	%eax, -12(%rsp)
	xorl	%eax, %eax
	movd	-12(%rsp), %xmm0
	punpckldq	%xmm0, %xmm1
	movd	-24(%rsp), %xmm0
	punpckldq	%xmm2, %xmm0
	punpcklqdq	%xmm1, %xmm0
	movdqa	.LC0(%rip), %xmm1
.L4:
	movdqa	%xmm1, %xmm3
	psrldq	$4, %xmm1
	addl	$1, %eax
	movdqa	%xmm0, %xmm2
	cmpl	%eax, %edx
	pmuludq	%xmm0, %xmm3
	paddd	%xmm4, %xmm0
	psrldq	$4, %xmm2
	pmuludq	%xmm1, %xmm2
	pshufd	$8, %xmm3, %xmm1
	pshufd	$8, %xmm2, %xmm2
	punpckldq	%xmm2, %xmm1
	ja	.L4
	movdqa	%xmm1, %xmm2
	subl	%ecx, %edi
	cmpl	%ecx, %esi
	psrldq	$8, %xmm2
	movdqa	%xmm2, %xmm0
	psrldq	$4, %xmm2
	pmuludq	%xmm1, %xmm0
	psrldq	$4, %xmm1
	pshufd	$8, %xmm0, %xmm0
	pmuludq	%xmm2, %xmm1
	pshufd	$8, %xmm1, %xmm1
	punpckldq	%xmm1, %xmm0
	movdqa	%xmm0, %xmm1
	psrldq	$4, %xmm1
	movdqa	%xmm1, %xmm2
	psrldq	$4, %xmm1
	pmuludq	%xmm0, %xmm2
	psrldq	$4, %xmm0
	pmuludq	%xmm0, %xmm1
	pshufd	$8, %xmm2, %xmm0
	pshufd	$8, %xmm1, %xmm1
	punpckldq	%xmm1, %xmm0
	movd	%xmm0, -24(%rsp)
	movl	-24(%rsp), %eax
	je	.L13
.L9:
	imull	%edi, %eax
	subl	$1, %edi
	jne	.L9
	rep
	ret
.L7:
	movl	$1, %eax
	ret
.L8:
	movl	$1, %eax
	jmp	.L9
.L13:
	ret

Well, OK, it’s unwound the loop and vectorized with SSE instructions though a more precise analysis will have to wait for a future post. Impressive optimization from GCC though and pretty groovy considering the biggest factorial we can fit in a 32-bit int is fact(15)!


Redirection

Like, I suspect, many programmers, there are many software tools I have used on a regular basis for many years but remain woefully ignorant of their inner workings and true potential. One of these is the Unix command line.

It’s a common, for example, to want to make the error output of a program to appear as the normal output and trial and error, or Googling or looking at Stack Overflow leads to:

strace ls 2>&1 >/dev/null

which works fine but seems puzzling – we’ve told the program to send error output to normal output, then normal output to /dev/null so why doesn’t that discard everything, similar to:

strace ls >/dev/null 2>&1

This is because we don’t understand what is going on.

An indirection is actually a call to the dup2 system call. From the man page:

dup2() makes newfd be the copy of oldfd, closing newfd first if necessary'

So: n>&m does a dup2(m,n): close fd n if necessary, then make n be a copy of fd m, and n>file means: close n if necessary, open file as fd m, then do dup2(m,n).

Now it all makes sense:

strace ls 2>&1 1>/dev/null 

first of all makes 2 be a copy of 1, then changes 1 to point to /dev/null – the copying is done ‘by value’ as it were (despite the confusing, for C++ programmers anyway, use of ‘&’).

Using strace here is not an accident, but used like this doesn’t tell us much: indirection is handled by the shell, not by the program, so we need to do something like this for further insight:

$ strace -f -etrace=clone,execve,open,dup2 bash -c 'ls >/dev/null 2>&1'
execve("/bin/bash", ["bash", "-c", "ls >/dev/null 2>&1"], [/* 46 vars */]) = 0
clone(Process 20454 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f643f4c09d0) = 20454
Process 20453 suspended
[pid 20454] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 20454] dup2(3, 1)                  = 1
[pid 20454] dup2(1, 2)                  = 2
[pid 20454] execve("/bin/ls", ["ls"], [/* 45 vars */]) = 0
Process 20453 resumed
Process 20454 detached
--- SIGCHLD (Child exited) @ 0 (0) ---

bash forks off a subprocess (a clone syscall these days rather than fork), which then sets up the input and output before calling execve to actually run the command.

We don’t have to limit ourselves to fds 0,1 and 2 that we get to start with:

$ runit () { echo stderr 1>&2; echo stdout; }
$ runit 1>/dev/null
stderr
$ runit 2>/dev/null
stdout
$ runit 3>&2 2>&1 1>&3
stderr
stdout
$ (runit 3>&2 2>&1 1>&3) 1> /dev/null
stdout
$ (runit 3>&2 2>&1 1>&3) 2> /dev/null
stderr

We duplicate 2 to 3, then 1 to 2, then 3 to 1, and we have swapped stdin and stderr.

We can also pass non-standard file descriptors in to programs, though this doesn’t seem to be a technique used much:

#include <unistd.h>
int main()
{
  char buffer[256];
  ssize_t n;
  while ((n = read(3,buffer,sizeof(buffer))) > 0) {
    write(4,buffer,n);
  }
}

and do:

$ g++ -Wall cat34.cpp -o cat34
$ echo Hello World | ./cat34 3<&0 4>&1
Hello World

It’s interesting that this also works:

$ echo Hello World | ./cat34 3>&0 4<&1
Hello World

While we are this part of town, let’s talk briefly about named pipes, another feature that has been around for ever, but doesn’t seem to get used as much as it deserves. We can run in to problems though:

Suppose I want to capture an HTTP response from a web server, I can do this:

$ mkfifo f1
$ mkfifo f2
$ mkfifo f3
$ netcat -l 8888 <f1 >f2 &
$ netcat www.bbc.co.uk 80 <f2 >f3 &
$ tee foo.txt <f3 >f1 &

and try a download, but alas:

$ GET http://localhost:8888/ >/dev/null
Can't connect to localhost:8888 (Connection refused)

This doesn’t seem right, netcat should be listening on port 8888, I told it so myself! And checking with netstat shows no listener on 8888 and finally ps -aux shows no sign of any netcat processes – what is going on?

Once again, strace helps us see the true reality of things:

$ kill %1
$ strace netcat -l 8888 < f1 > f2

But strace tells us nothing – there is no output! Like the dog that didn’t bark in the night though, this is an important clue, and widening our area of investigation, we find:

$ strace -f -etrace=open,execve bash -c 'netcat -l 8888 < f1 > f2'
execve("/bin/bash", ["bash", "-c", "netcat -l 8888 < f1 > f2"], [/* 46 vars */]) = 0
Process 20621 attached
Process 20620 suspended
[pid 20621] open("f1", O_RDONLY ...

The shell is stalled trying to open the “f1” fifo, before it even gets around to starting the netcat program, which is why the first strace didn’t show anything. What we have forgotten is that opening a pipe blocks if there is no process with the other end open (it doesn’t have to actively reading or writing, it just has to be there). The shell handles redirections in the order they appear, so since our 3 processes are all opening their read fifo first, none have got around to opening their write fifo – we have deadlock, in fact, the classic dining philosophers problem, and a simple solution is for one philosopher to pick up the forks in a different order:

$ netcat -l 8888 >f2 <f1 &
$ netcat www.bbc.co.uk 80 <f2 >f3 &
$ tee foo.txt <f3 >f1 &
$ GET http://localhost:8888/ >/dev/null
$ cat foo.txt
HTTP/1.1 301 Moved Permanently
Server: Apache
...

We can, it should be noted, do this more easily with a normal pipeline and a single fifo, and avoid all these problems:

$ netcat www.bbc.co.uk 80 <f1 | tee foo.txt | netcat -l 8888 >f1

but that would be less fun and possibly less instructive.


tolower

In an Age of Highly Pipelined RISC Architectures, we might be tempted to write a basic ASCII lower-case function as:

int tolower1(int c)
{
  return c|-(c-0101U<032)&040;
}

and avoid any branching. We subtract ‘A’ from c and do an unsigned comparison with ‘Z’-‘A’, convert to a bitmask and use that to ‘or’ in the lower case bit where necessary.

A more quiche-eating approach would be the clear and obvious:

int tolower2(int c)
{
  if (c >='A' && c <= 'Z') return c+'a'-'A';
  else return c;
}

at which hard-nosed hacker types will sneer, however, an Age of Highly Pipelined RISC Architectures is also an Age of Conditional Move Instructions, and GCC generates the pleasantly concise:

tolower2:
	leal	-65(%rdi), %ecx
	leal	32(%rdi), %edx
	movl	%edi, %eax
	cmpl	$25, %ecx
	cmovbe	%edx, %eax
	ret

for our simple approach rather than the distinctly bloated:

tolower1:
	leal	-65(%rdi), %eax
	cmpl	$25, %eax
	setbe	%al
	movzbl	%al, %eax
	negl	%eax
	andl	$32, %eax
	orl	%edi, %eax
	ret

All is not lost though for the hard-nosed hacker, for this is also an Age of Vectorization and we can do something similar using SSE2 instructions to lower-case not one but 16 characters at a time:

#include <x86intrin.h>
void tolower128(char *p)
{
  __m128i v1 = _mm_loadu_si128((__m128i*)p);
  __m128i v2 = _mm_cmplt_epi8(v1,_mm_set1_epi8('A'));
  __m128i v3 = _mm_cmpgt_epi8(v1,_mm_set1_epi8('Z'));
  __m128i v4 = _mm_or_si128(v2,v3);
  __m128i v5 = _mm_andnot_si128(v4,_mm_set1_epi8(0x20));
  _mm_storeu_si128((__m128i*)p, _mm_or_si128(v1,v5));
}

Now everyone should be happy!


Markov Computation

Markov computation is an attractively simple definition of universal computability.

A program is a list of rewrite rules:

xxx => yyy

or a terminal rule:

xxx =>. yyy

where xxx and yyy are arbitrary strings over some alphabet.

A computation takes a string over that alphabet and repeatedly rewrites it until termination: each rewrite uses the first rule with a LHS occuring anywhere in the string, which is then replaced by the RHS of the rule. If the rule is terminal the computation stops, it also stops if no rule matches.

Since I had a need to brush up on my Perl recently, I came up with this:

#!/usr/bin/perl -w
use strict;

my $line = shift @ARGV;
my @rules;

while (<>) {
    if (/^\s*(\S*)\s*=>(\.?)\s*(\S*)/) {
        push @rules,[$1,length $1,$3,$2 ne ""];
    }
}

print "$line\n";
for (my $i = 0; $i < @rules; ) {
     my ($lhs,$len,$rhs,$term) = @{$rules[$i++]};
     my $ix = index $line,$lhs;
     if ($ix >= 0) {
        substr($line,$ix,$len) = $rhs;
        print "$line\n";
        $i = $term?@rules:0;
     }
}

The line to rewrite is given as a command line argument and the program is read from input (either stdin or from files given on the command line). We use some regexps to match the program lines and, while it’s tempting to use regexp rewriting for running the program, we want to avoid any string characters being interpreted specially – we can use quotemeta for this but here we just use the non-regexp string operators (note the slightly wacky Perl use of substr as an l-value).

Now we have our evaluator, let’s write some programs:

Here is a multiplication program:

$ cat mult.mkv
x1 => 1x
x| => |1
x => |1
+| => |
+1 => 1+x
1* => *+
*1 => *
*| =>

$ ./markov '111*11' mult.mkv
111*11
11*+11
11*1+x1
11*1+1x
11*1+1|1
11*11+x|1
11*11+|11
11*11|11
1*+11|11
1*1+x1|11
1*1+1x|11
1*1+1|111
1*11+x|111
1*11+|1111
1*11|1111
*+11|1111
*1+x1|1111
*1+1x|1111
*1+1|11111
*11+x|11111
*11+|111111
*11|111111
*1|111111
*|111111
111111

Add small Roman numerals:

$ cat roman.mkv
# Final change to subtractive form
@X => X@
@IIII =>. IV
@VIIII =>. IX
@ =>.
# Initial expansion of subtractive form
IV => IIII
IX => VIIII
# Simplification
IIIII => V
VV => X
# Rules for *
I*X => XI*
I*V => VI*
V*X => XV*
* =>
# Move from LHS to RHS
I+I => II
V+I => VI
V+V => X
X+ => X
I+ => +I*
V+ => +V*
# Go to finish
+ => @

$ ./markov I+II+III+IV+V+VI+VII+VIII+IX+X < roman.mkv
I+II+III+IV+V+VI+VII+VIII+IX+X
I+II+III+IIII+V+VI+VII+VIII+IX+X
I+II+III+IIII+V+VI+VII+VIII+VIIII+X
III+III+IIII+V+VI+VII+VIII+VIIII+X
IIIIII+IIII+V+VI+VII+VIII+VIIII+X
VI+IIII+V+VI+VII+VIII+VIIII+X
VIIIII+V+VI+VII+VIII+VIIII+X
VV+V+VI+VII+VIII+VIIII+X
X+V+VI+VII+VIII+VIIII+X
X+XI+VII+VIII+VIIII+X
XXI+VII+VIII+VIIII+X
XX+I*VII+VIII+VIIII+X
XX+VI*II+VIII+VIIII+X
XX+VIII+VIII+VIIII+X
XXVIII+VIII+VIIII+X
XXVII+I*VIII+VIIII+X
XXVII+VI*III+VIIII+X
XXVII+VIIII+VIIII+X
XXVI+I*VIIII+VIIII+X
XXVI+VI*IIII+VIIII+X
XXVI+VIIIII+VIIII+X
XXVI+VV+VIIII+X
XXVI+X+VIIII+X
XXVI+XVIIII+X
XXV+I*XVIIII+X
XXV+XI*VIIII+X
XXV+XVI*IIII+X
XXV+XVIIIII+X
XXV+XVV+X
XXV+XX+X
XXV+XXX
XX+V*XXX
XX+XV*XX
XX+XXV*X
XX+XXXV*
XX+XXXV
XXXXXV

And here is one for checking the Collatz conjecture:

$ cat collatz.mkv
>11 => 1>
>1 => <111
> =>
1< => <111111
< => 1
11 => >11

$ ./markov 11111 collatz.mkv
11111
>11111
1>111
11>1
11<111
1<111111111
1111111111111111
1>11111111111111
11>111111111111
111>1111111111
1111>11111111
11111>111111
111111>1111
1111111>11
11111111>
11111111
>11111111
1>111111
11>1111
111>11
1111>
1111
>1111
1>11
11>
11
>11
1>
1

See the 1st edition of Mendelson’s excellent book on Mathematical Logic for a full treatment, including a demonstration of equivalence with other notions of computability.


IPv6 TUN reflector

We had a look as a simple network simulation using TUN a couple of posts ago: https://matthewarcus.wordpress.com/2013/05/18/fun-with-tun/.

Let’s now have a look at getting it all working with IPv6. Changing the address swapping code is fairly straightforward, and for some extra points we’ll add a facility for printing out the source and destination addresses of each packet forwarded, the correct function for doing this is now inet_ntop, which works for both v4 and v6 addresses.

Here are the main changes [see https://github.com/matthewarcus/stuff/tree/master/tun for full code]:

It’s convenient to define a 32-bit swap function:

void swap32(uint8_t *p, uint8_t *q)
{
  uint32_t t = get32(p);
  put32(p,get32(q));
  put32(q,t);
}

and our main function is now:

#define SRC_OFFSET4 12
#define DST_OFFSET4 16
#define SRC_OFFSET6 8
#define DST_OFFSET6 24

void reflect(uint8_t *p, size_t nbytes)
{
  uint8_t version = p[0] >> 4;
  switch (version) {
  case 4:
    if (verbosity > 0) {
      char fromaddr[INET_ADDRSTRLEN];
      char toaddr[INET_ADDRSTRLEN];
      inet_ntop(AF_INET, p+SRC_OFFSET4, fromaddr, sizeof(fromaddr));
      inet_ntop(AF_INET, p+DST_OFFSET4, toaddr, sizeof(toaddr));
      printf("%zu: %s->%s\n", nbytes, fromaddr, toaddr);
    }
    // Swap source and dest of an IPv4 packet
    // No checksum recalculation is necessary
    swap32(p+SRC_OFFSET4,p+DST_OFFSET4);
    break;
  case 6:
    if (verbosity > 0) {
      char fromaddr[INET6_ADDRSTRLEN];
      char toaddr[INET6_ADDRSTRLEN];
      inet_ntop(AF_INET6, p+SRC_OFFSET6, fromaddr, sizeof(fromaddr));
      inet_ntop(AF_INET6, p+DST_OFFSET6, toaddr, sizeof(toaddr));
      printf("%zu: %s->%s\n", nbytes, fromaddr, toaddr);
    }
    // Swap source and dest of an IPv6 packet
    // No checksum recalculation is necessary
    for (int i = 0; i < 4; i++) {
      swap32(p+SRC_OFFSET6+4*i,p+DST_OFFSET6+4*i);
    }
    break;
  default:
    fprintf(stderr, "Unknown protocol %u\n", version);
    exit(0);
  }
}

Setting up the the v6 addresses for our new interface is a little different. As before, we bring the interface up:

$ ip link set tun0 up

Now, we need to add a link-local address, mandatory for all IPv6 interfaces:

$ ip -6 addr add fe80::1/64 dev tun0

We can use ping6 to try this out, localizing the request to the tun0 interface:

$ ping6 -I tun0 fe80::a617:31ff:fe5a:334f
PING fe80::a617:31ff:fe5a:334f(fe80::a617:31ff:fe5a:334f) from fe80::1 tun0: 56 data bytes
64 bytes from fe80::a617:31ff:fe5a:334f: icmp_seq=1 ttl=64 time=0.164 ms
...

This address is also the link-local address of my Wifi interface, but there is no ambiguity as we must specify which interface to use.

We can also add a private network address. IPv6 does not have the same concept of a private network as IPv4, instead we define Unique Local Addresses: append 0xfd to a random 10 digit hex global id and add an arbitrary 4 digit subnet identifier. Any random global id is fine – the idea is to ensure that any given network will have a different id from any other private network it is likely to come in contact with – we don’t need to worry about true global uniqueness though the Birthday Paradox tells us that we are likely to have a potential conflict with only about a million private networks (there might be lots of people out there with the same name and birthday as you, but you are unlikely to meet one of them at random).

We can generate our own random address, for example, using the method described in RFC4193, or use /dev/random:

$ hexdump -v -e '/1 "%02x"' -n 5 /dev/urandom; echo
2acd2c8bc4

or just copy a random sequence from somewhere on the Internet, for example, the one used here:

$ ip -6 route add fd2a:cd2c:8bc4:0::/64 dev tun0

This adds a local network with a global id of 2acd2c8bc4 and a subnet id of 0.

We can also define a larger subnet:

$ ip -6 route add fd2a:cd2c:8bc4:1100::/56 dev tun0

Now traffic to any IPv6 address of form fd2a:cd2c:8bc4:11xx:… will be sent to our TUN device:

$ ping6 fd2a:cd2c:8bc4:11ff::23
PING fd2a:cd2c:8bc4:11ff::23(fd2a:cd2c:8bc4:11ff::23) 56 data bytes
64 bytes from fd2a:cd2c:8bc4:11ff::23: icmp_seq=1 ttl=64 time=0.110 ms
...

Indeed, we can define all subnets for another global id:

$ hexdump -e '/1 "%02x"' -n 5 /dev/urandom; echo
40bd2f7ba0
$ sudo ip -6 route add fd40:bd2f:7ba0::/48 dev tun0

Just for interest, here’s our entire IPv6 routing table:

$ route -A inet6
Kernel IPv6 routing table
Destination Next Hop Flag Met Ref Use If
fd2a:cd2c:8bc4::/64 :: U 1024 0 0 tun0
fd2a:cd2c:8bc4:1100::/56 :: U 1024 0 0 tun0
fd40:bd2f:7ba0::/48 :: U 1024 0 0 tun0
fe80::/64 :: U 256 0 0 wlan0
fe80::/64 :: U 256 0 0 tun0
::/0 :: !n -1 1 524 lo
::1/128 :: Un 0 1 35 lo
fe80::1/128 :: Un 0 1 10 lo
fe80::a617:31ff:fe5a:334f/128 :: Un 0 1 7 lo
ff00::/8 :: U 256 0 0 wlan0
ff00::/8 :: U 256 0 0 tun0
::/0 :: !n -1 1 524 lo

Finally, to set up a simple service to use IPv6:

In one terminal:

$ nc -l -6 9901
...

In another:

$ nc -6 fd2a:cd2c:8bc4:11ff::23 9901
...

and our logging now looks like this:

$ ./reflect -v
Capability CAP_NET_ADMIN: 1 0 1
Created tun device tun0
48: fe80::1->ff02::2
72: fe80::1->fd2a:cd2c:8bc4:11ff::23
72: fe80::1->fd2a:cd2c:8bc4:11ff::23
72: fe80::1->fd2a:cd2c:8bc4:11ff::23
72: fe80::1->fd2a:cd2c:8bc4:11ff::23
48: fe80::1->ff02::2
48: fe80::1->ff02::2
80: fe80::1->fd2a:cd2c:8bc4:11ff::23
80: fe80::1->fd2a:cd2c:8bc4:11ff::23
72: fe80::1->fd2a:cd2c:8bc4:11ff::23
79: fe80::1->fd2a:cd2c:8bc4:11ff::23
72: fe80::1->fd2a:cd2c:8bc4:11ff::23
72: fe80::1->fd2a:cd2c:8bc4:11ff::23
72: fe80::1->fd2a:cd2c:8bc4:11ff::23
72: fe80::1->fd2a:cd2c:8bc4:11ff::23

Those ff02::2 addresses are for IPv6 router discovery. The rest are two TCP flows, one in each direction (we can get more detail from Wireshark, in particular, the relevant port numbers, but this gives the general idea).


Fun with Flora

It’s been a while since I’ve played with AVR microcontrollers:

http://www.matthewarcus.org.uk/avr-ntp/

but I couldn’t resist getting myself an Adafruit Flora:

http://www.adafruit.com/products/659

The tutorials etc. on the Adafruit website mainly cover using Windows and Apple development platforms, though there is much useful information in their forums on using Linux – I suppose we are meant to be able to work these things out for ourselves (actually, I suspect the real reason is that there is too much variation across different distributions to be able to give a single consistent set of instructions). Anyway, this is what worked for me to get everything working nicely on Ubuntu 12.10.

Firstly, set up your IDE. The standard 1.0.4 installation is what is recommended by Adafruit, so we’ll start there (see later for 1.5.2). We need a Flora entry in the boards.txt file, and a board specific header file, to go in variants/flora, telling the compiler where the pins etc. are. These files can be extracted from the “official” Adafruit Arduino IDE for the Flora, or the excellent kkolbo has put them in a handy zip file, referenced here:

http://adafruit.com/forums/viewtopic.php?f=22&t=39012

Now we have a Flora entry on the Tools/Board menu and we can try downloading a “sketch”, as Arduino programs apparently are called. My Flora already has a blinkenlights program on it, so we’ll be wanting to change the timing or something so we can see that we really are running a new program. So, here we have the “Hello, world” of the embedded world (sketch taken directly from Adafruit tutorial):

// Pin D7 has an LED connected on FLORA.
// give it a name:
int led = 7;

// the setup routine runs once when you press reset:
void setup() {                
  // initialize the digital pin as an output.
  pinMode(led, OUTPUT);     
}

// the loop routine runs over and over again forever:
void loop() {
  digitalWrite(led, HIGH);   // turn the LED on
  delay(1500);
  digitalWrite(led, LOW);    // turn the LED off
  delay(500);
}

Nice, a little way from a Stratum-1 NTP server, but it’s a start.

To get our sketch running on the Flora, we must bear some things in mind. First thing (this held me up for a while), you upload with “Upload” and not with “Upload Using Programmer” – I suppose you only do that if you have an actual Programmer, pretty obvious really, secondly there is a “modem-manager” program, part of the Linux network manager that sometimes does strange things to the USB-pretending-to-be-a-serial-port that we are attempting to program over, and thirdly, you must, of course, have the right permissions to access said USB/serial device – for a serial over USB link, you will be looking for a device called /dev/ttyACMx and this should appear in the “Tools/Serial Ports” menu item – if no port is shown there but a suitable device exists in /dev then you probably have a permissions problem. Make sure you have the port selected too, with a tick next it. You can run the IDE as root or fiddle with group membership, but the proper way of dealing with this is with the udev device manager (which also allows us to tell modem-manager to keep its hands off).

Like the Leonardo, the Flora has one USB port, used both by the program, and for programming and it seems that the device reports different USB product ids depending on what mode it’s in – just after a reset, it reports product id 0004, then after going to “normal” mode, the product id becomes 8004 (and some of the USB properties change), and we need to set udev rules for both. (“lsusb” and “udevadm info -a -n /dev/ttyACMx” are useful here to find out what is going on).

The upshot is that we need two udev lines, for the different states of the USB connection and the different idProducts that are reported:

SUBSYSTEMS=="usb", ATTRS{idVendor}=="239a", ATTRS{idProduct}=="0004", MODE="0660", GROUP="plugdev", ENV{ID_MM_DEVICE_IGNORE}="1"
SUBSYSTEMS=="usb", ATTRS{idVendor}=="239a", ATTRS{idProduct}=="8004", MODE="0660", GROUP="plugdev", ENV{ID_MM_DEVICE_IGNORE}="1"

The group setting ensures normal users (which Ubuntu adds to the plugdev group) can access the port, and the ID_MM_DEVICE_IGNORE tells modem-manager to ignore this device, so we no longer need to disable modem-manager while programming. There are some suggestions on the internet of using a SYMLINK+=”ttyACM%n” entry too, but that doesn’t seem to be necessary for our purposes (actually, adding “SYMLINK+=ttyUSB0” gives us a means of adding a consistent device name that is recognized by the Arduino IDE, but doesn’t clash with the kernel assigned ACM names). Add these lines into, say, /etc/udev/rules/90-flora.rules, restart udev with service udev restart and reconnect or reset your Flora. You should see something like:

$ ls -l /dev/ttyACM*
crw-rw---- 1 root plugdev 166, 1 May 18 15:59 /dev/ttyACM1

and programming should now work just fine as a normal user (assuming they are in the plugdev group). Sometimes the soft reset that the programmer does to initiate upload doesn’t work and you have to do a hard reset, but that seems to be a standard problem with Leonardo/Flora. Most of the time it seems to work just fine with the software reset.

A couple of observations: when programming, the software reset of the serial port fails if the port isn’t selected in the Serial Port menu, though programming after a hard reset will still suceed. Also sometimes if the upload fails, the avrdude process isn’t shut down properly and keeps the serial port device open, so when the USB connection is reestablished, the device minor number goes up (ie. we now have /dev/ttyACM1 rather than /dev/ttyACM0). If this happens, “killall avrdude” will restore the status quo.

Finally, I had originally tried using an Arduino 1.5.2 installation I set up when I was thinking of having a play with the ARM-based Arduino Due. Initially this didn’t work too well for the Flora, but after sorting out the USB device problems as above and copying the relevant seeming lines from the Leonardo section of the boards.txt file to the Flora section (and copying the variants/flora directory), it was just fine:

flora.name=Adafruit Flora
flora.upload.tool=avrdude
flora.upload.protocol=avr109
flora.upload.maximum_size=28672
flora.upload.speed=57600
flora.upload.disable_flushing=true
flora.upload.use_1200bps_touch=true
flora.upload.wait_for_upload_port=true
flora.bootloader.tool=avrdude
flora.bootloader.low_fuses=0xff
flora.bootloader.high_fuses=0xd8
flora.bootloader.extended_fuses=0xcb
flora.bootloader.path=caterina
flora.bootloader.file=Caterina-Flora.hex
flora.bootloader.unlock_bits=0x3F
flora.bootloader.lock_bits=0x2F
flora.build.mcu=atmega32u4
flora.build.f_cpu=8000000L
flora.build.vid=0x239A
flora.build.pid=0x8004
flora.build.core=arduino
flora.build.variant=flora
flora.build.extra_flags=-DUSB_VID={build.vid} -DUSB_PID={build.pid}

(I’m not sure that I’ve got that bootloader, but I don’t need it at the moment).

Anyway, a nice bit of kit, good work from Limor Fried and all at Adafruit, and now I can make the LED flash just as I like, time to think of some more interesting applications. Watch this space…


Fun with TUN

TUN devices are much used for virtualization, VPNs, network testing programs, etc. A TUN device essentially is a network interface that also exists as a user space file descriptor, data sent to the interface can be read from the file descriptor, and data written to the file descriptor emerges from the network interface.

Here’s a simple example of their use. We create a TUN device that simulates an entire network, with traffic to each network address just routed back to the original host.

For a complete program, see:

https://github.com/matthewarcus/stuff/blob/master/tun/reflect.cpp

First create your TUN device, this is fairly standard, most public code seems to be derived from Maxim Krasnyansky’s:

https://www.kernel.org/doc/Documentation/networking/tuntap.txt

and our code is no different:

int tun_alloc(char *dev) 
{
  assert(dev != NULL);
  int fd = open("/dev/net/tun", O_RDWR);
  CHECKFD(fd);

  struct ifreq ifr; 
  memset(&ifr, 0, sizeof(ifr)); 
  ifr.ifr_flags = IFF_TUN | IFF_NO_PI;
  strncpy(ifr.ifr_name, dev, IFNAMSIZ); 
  CHECKSYS(ioctl(fd, TUNSETIFF, (void *) &ifr));
  strncpy(dev, ifr.ifr_name, IFNAMSIZ); 
  return fd;
}

We want a TUN device (rather than TAP, essentially the same thing but at the ethernet level) and we don’t want packet information at the moment. We copy the name of the allocated device to the char array given as a parameter.

Now all our program needs to do is create the TUN device and sit in a loop copying packets:

int main(int argc, char *argv[])
{
  char dev[IFNAMSIZ+1];
  memset(dev,0,sizeof(dev));
  if (argc > 1) strncpy(dev,argv[1],sizeof(dev)-1);

  // Allocate the tun device
  int fd = tun_alloc(dev);
  if (fd < 0) exit(0);

  uint8_t buf[2048];
  while(true) {
    // Sit in a loop, read a packet from fd, reflect
    // addresses and write back to fd.
    ssize_t nread = read(fd,buf,sizeof(buf));
    CHECK(nread >= 0);
    if (nread == 0) break;
    reflect(buf,nread);
    ssize_t nwrite = write(fd,buf,nread);
    CHECK(nwrite == nread);
  }
}

The TUN mechanism ensures that we get exactly one packet for each read, we don’t need to worry about fragmentation, and we just send each packet back with the source and destination IPs swapped:

static inline void put32(uint8_t *p, size_t offset, uint32_t n)
{
  memcpy(p+offset,&n,sizeof(n));
}

static inline uint32_t get32(uint8_t *p, size_t offset)
{
  uint32_t n;
  memcpy(&n,p+offset,sizeof(n));
  return n;
}

void reflect(uint8_t *p, size_t nbytes)
{
  (void)nbytes;
  uint8_t version = p[0] >> 4;
  switch (version) {
  case 4:
    break;
  case 6:
    fprintf(stderr, "IPv6 not implemented yet\n");
    exit(0);
  default:
    fprintf(stderr, "Unknown protocol %u\n", version);
    exit(0);
  }
  uint32_t src = get32(p,12);
  uint32_t dst = get32(p,16);
  put32(p,12,dst);
  put32(p,16,src);
}

We don’t need to recalculate the header checksum as it doesn’t get changed by just swapping two 32 bit segments.

Handling IPV6 is left as an exercise for the reader (we just need to use a different offset and address size I think).

In this day and age, security should be prominent in our minds, particularly for long-running programs like our TUN server, so for extra points, let’s add in some capability processing.

(You might need to install a libcap-dev package for this to work, for example, with “sudo apt-get install libcap-dev” and link with -lcap).

Once we have started up, we should check if we have the required capability, we just require CAP_NET_ADMIN to be permitted:

  cap_t caps = cap_get_proc();
  CHECK(caps != NULL);

  cap_value_t cap = CAP_NET_ADMIN;
  const char *capname = STRING(CAP_NET_ADMIN);

  cap_flag_value_t cap_permitted;
  CHECKSYS(cap_get_flag(caps, cap,
                        CAP_PERMITTED, &cap_permitted));
  if (!cap_permitted) {
    fprintf(stderr, "%s not permitted, exiting\n", capname);
    exit(0);
  }

and then make effective what we require:

  CHECKSYS(cap_clear(caps));
  CHECKSYS(cap_set_flag(caps, CAP_PERMITTED, 1, &cap, CAP_SET));
  CHECKSYS(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap, CAP_SET));
  CHECKSYS(cap_set_proc(caps));

Finally, after creating our TUN object, before entering our main loop, we can relinquish our extra privileges altogether:

  CHECKSYS(cap_clear(caps));
  CHECKSYS(cap_set_proc(caps));
  CHECKSYS(cap_free(caps));

For completeness, here are the error checking macros used above:

#define CHECKAUX(e,s)                            \
 ((e)? \
  (void)0: \
  (fprintf(stderr, "'%s' failed at %s:%d - %s\n", \
           s, __FILE__, __LINE__,strerror(errno)), \
   exit(0)))
#define CHECK(e) (CHECKAUX(e,#e))
#define CHECKSYS(e) (CHECKAUX((e)==0,#e))
#define CHECKFD(e) (CHECKAUX((e)>=0,#e))
#define STRING(e) #e

Of course, production code will want to do something more sophisticated than calling exit(0) when an error occurs…

To use, compile for example with:

g++ -W -Wall -O3 reflect.cpp -lcap -o reflect

We can set permissions for our new executable to include the relevant capability, so we don’t need to start it as root:

$ sudo setcap cap_net_admin+ep ./reflect

Actually start it:

$ ./reflect&
Capability CAP_NET_ADMIN: 1 0 1
Created tun device tun0

We now have an interface, but it isn’t configured:

$ ifconfig tun0
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
POINTOPOINT NOARP MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

With the interface running, set up networking:

$ sudo ip link set tun0 up
$ sudo ip addr add 10.0.0.1/8 dev tun0

Check all is well:

$ ifconfig tun0
tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:10.0.0.1 P-t-P:10.0.0.1 Mask:255.0.0.0
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:500
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

And try it out:

$ ping -c 1 10.0.0.41
PING 10.0.0.41 (10.0.0.41) 56(84) bytes of data.
64 bytes from 10.0.0.41: icmp_req=1 ttl=64 time=0.052 ms

--- 10.0.0.41 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.052/0.052/0.052/0.000 ms

Let’s check performance, firstly, a flood ping on the loopback device:

$ sudo ping -f -c10000 -s1500 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 1500(1528) bytes of data.

--- 127.0.0.1 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 778ms
rtt min/avg/max/mdev = 0.003/0.006/0.044/0.002 ms, pipe 2, ipg/ewma 0.077/0.006 ms

compared to one through the TUN connection:

$ sudo ping -f -c10000 -s1500 10.0.0.100
PING 10.0.0.100 (10.0.0.100) 1500(1528) bytes of data.

--- 10.0.0.100 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 945ms
rtt min/avg/max/mdev = 0.022/0.032/3.775/0.038 ms, pipe 2, ipg/ewma 0.094/0.032 ms

Respectable. We have got ourselves a network!


Testing for Triangularity

A company I used to work for had a standard interview problem of writing a short function that would take 3 numbers (usually assumed to be integers) and determine if they formed a valid triangle. There was much discussion about the best/fastest/clearest/most elegant way of doing this, but I don’t remember the topic of what to do if the numbers were floating point coming up.

Curiously, being interviewed myself at another company not long ago I was asked this very question and wasn’t quite sure what the best thing to do was – for integers, my preferred solution is to sort the 3 numbers as a >= b >= c, then we have a triangle just when c > 0 and a < b+c (or
a-b < c if we are worried about overflow). What about if they are floating point numbers? Clearly if b >> c, then b+c might lose precision, so, for example, we might have c > 0 and b = a, so it’s valid, but if b+c = b, then our test fails.

We could try the same trick as for integers, testing for a-b < c – subtracting the two largest should minimize the rounding error; this seems plausible, but can we be sure we won’t get it wrong?

An earlier post looked at the sequence of integers:

0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 18 20 22 24 26 28 30
32 36 40 44 48 52 56 60
64 72 80 88 96 104 112 120
...

which are really just thinly disguised (non-negative) floating point numbers. An important feature, obvious here and easy to prove of “real” FP numbers, is that, apart from the first two rows, each row is the double of the row above it, therefore, 2n is representable exactly iff n is. Furthermore, it’s pretty obvious that if a and b are representable, with a/2 <= b <= a, then a-b is representable (and if a and b are numbers in the first two rows, then all differences are representable, so we don’t need to worry about them).

Now let’s return to our triangle: we have a >= b >= c > 0 and clearly a,b,c must be representable. If all the numbers are in the first two rows, then all operations are exact and we get the right answer. Otherwise, a is even: first, consider the case b >= a/2, as above, this implies that a-b is representable and the comparison with c must be exact, so our test is correct. Otherwise, if b < a/2, then since c <= b, c + b <= 2b < a, so we don’t have a triangle, and, assuming that rounding is monotonic (ie. x <= y => round(x) <= round(y), or equivalently, x < y => round(x) <= round(y)), our check works out correct too:

b < a/2, so b+c < a (exact) so c < a-b (exact) so c = round(c) <= round(a-b)

QED

As usual for this blog, we are merely paddling at the shore of a much deeper sea. For much more on triangles, floating point approximations etc, see Kahan’s paper:

http://wwwp.cs.unc.edu/~snoeyink/c/c205/Triangle.pdf

(or indeed anything else by the great man:

http://www.cs.berkeley.edu/~wkahan/

http://owpdb.mfo.de/detail?photo_id=5350).


Subbag Enumeration

A bag or multiset is just a map from some base set X to the natural numbers, so we can represent such a bag by a sequence of (non-negative) integers. Subbags are defined in the obvious way (and if s is subbag of t, t is a superbag of s). Given a subbag s of m, we can find the next subbag (in reverse lexicographic order) with this function:

bool nextbag(vector<int> &s, 
             const vector<int> &m,
             size_t start = 0)
{
  for (size_t i = start; i < s.size(); i++) {
    if (s[i] == m[i]) {
      s[i] = 0;
    } else {
      s[i]++;
      return true;
    }
  }
  return false;
}

This sets the subsequence [start..] of s to the (reverse) lexicographically next sequence. m gives maximum values for each field of s and we return false if we have cycled back round to the zero sequence. Note that if m[i] < 0 then effectively there is no maximum value for each element.

We might also want to skip from s to the next subbag that is not a superbag of s. This is accomplished by:

bool nextnonsuper(vector<int> &s, const vector<int> &m)
{
  for (size_t i = 0; i < s.size(); i++) {
    if (s[i] != 0) {
      s[i] = 0;
      return nextbag(s,m,i+1);
    }
  }
  return false;
}

Here we see why we wanted the start parameter in nextbag. For example, we go from:

[0,0,0,3,4,...] => [0,0,0,0,5,...]

Given f: bag -> A where A is an ordered set and where f is monotonic (if s is a subbag of t, then f(s) <= f(t)), if we want to find all (sub)bags where f(s) <= x, we can use nextnonsuper to skip some of the enumeration:

while (true) {
  if (f(s) >= t) s = nextnonsuper(s)
  else s = nextbag(s)
  ...
}

For an example, we can use a weighted sum over the bag elements (we will assume that the weights are positive).

int weights(const vector<int> &s, const vector<int> &w)
{
  int total = 0;
  for (size_t i = 0; i < s.size(); i++) {
    total += s[i]*w[i];
  }
  return total;
}

and we can add some boiler plate to make the whole thing into a nice standalone application:

// Print out our vector, include weights if required
// Optionally print in reverse order
void print(const vector<int> &s, 
           const vector<int> *w = NULL, 
           bool printreverse = false)
{
  int sz = s.size();
  for (int i = 0; i < sz; i++) {
    int index = printreverse?sz-i-1:i;
    cout << s[index];
    if (w != NULL) cout << ":" << (*w)[index];
    cout << (i < sz-1?" ":"\n");
  }
}

int main(int argc, char *argv[])
{
  int target;    // The sum we are aiming for
  vector<int> s; // Where we build the sequence
  vector<int> m; // The number of each items available
  vector<int> w; // The weight of each item
  bool printreverse = false; // Print sequences in reverse
  bool printweight  = false; // Print the weights as well
  bool printless    = false; // Print sequences less than target
  bool printgreater = false; // Print those greater than target

  // Crude but effective argument processing
  while (argv[1][0] == '-') {
    if (strcmp(argv[1],"-r") == 0) {
      printreverse = true;
      argv++; argc--;
    } else if (strcmp(argv[1],"-l") == 0) {
      printless = true;
      argv++; argc--;
    } else if (strcmp(argv[1],"-g") == 0) {
      printgreater = true;
      argv++; argc--;
    } else if (strcmp(argv[1],"-w") == 0) {
      printweight = true;
      argv++; argc--;
    } else {
      cerr << "Invalid flag: " << argv[1] << endl;
      exit(0);
    }
  }
  // Read target
  // target = 0 means no target
  if (sscanf(argv[1],"%d",&target) != 1) {
    cerr << "Invalid target: " << argv[1] << endl;
    exit(0);
  }
  // Read count/weight pairs
  for (int i = 2; i < argc; i++) {
    int a,b;
    if (sscanf(argv[i],"%d:%d",&a,&b) != 2) {
      cerr << "Invalid arg: " << argv[i] << endl;
      exit(0);
    }
    m.push_back(a);
    w.push_back(b);
  }
  // Initialize result sequence.
  s.resize(m.size());
  while(true) {
    int total = weights(s,w);
    // Print result if desired
    if (target == 0 || 
        total == target || 
        (total < target && printless) ||
        (total > target && printgreater)) {
      print(s, printweight?&w:NULL, printreverse);
    }
    if (target > 0 && total >= target) {
      // Have hit or exceeded target, so skip superbags
      if (!nextnonsuper(s,m)) break;
    } else {
      if (!nextbag(s,m)) break;
    }
  }
}

Compiling all this as bagenum.cpp, we can now do some enumerations. All the ways to make 5 from the bag {1,1,1,2,2,3}:

$ ./bagenum 5 3:1 2:2 1:3
3 1 0
1 2 0
2 0 1
0 1 1

We can specify a target of 0 to get all the subbags of the input, and the -w flag prints the weight out as well:

$ ./bagenum -w 0 3:1 2:2
0:1 0:2
1:1 0:2
2:1 0:2
3:1 0:2
0:1 1:2
1:1 1:2
2:1 1:2
3:1 1:2
0:1 2:2
1:1 2:2
2:1 2:2
3:1 2:2

We can enumerate bitstrings with a given number of bits set

$ ./bagenum 2 1:1 1:1 1:1 1:1
1 1 0 0
1 0 1 0
0 1 1 0
1 0 0 1
0 1 0 1
0 0 1 1

Or with at most a given number set:

$ ./bagenum -l 2 1:1 1:1 1:1 1:1
0 0 0 0
1 0 0 0
0 1 0 0
1 1 0 0
0 0 1 0
1 0 1 0
0 1 1 0
0 0 0 1
1 0 0 1
0 1 0 1
0 0 1 1

Or the ways of rolling 4 with 3 dice numbered 0-5 – much the same as rolling 7 with 3 normal dice (the -r parameter prints the results in reverse order, ie. normal lexicographic order):

$ ./bagenum -r 4 5:1 5:1 5:1
0 0 4
0 1 3
0 2 2
0 3 1
0 4 0
1 0 3
1 1 2
1 2 1
1 3 0
2 0 2
2 1 1
2 2 0
3 0 1
3 1 0
4 0 0

All this works quite nicely with unlimited multiplicities, provided we also set an upper bound, so we can enumerate the solutions to the classic Polya problem of the ways to make 50c with usual US coins (50 as it happens):

$ ./bagenum -w 50 -1:1 -1:5 -1:10 -1:25 -1:50
50:1 0:5 0:10 0:25 0:50
45:1 1:5 0:10 0:25 0:50
40:1 2:5 0:10 0:25 0:50
35:1 3:5 0:10 0:25 0:50
30:1 4:5 0:10 0:25 0:50
25:1 5:5 0:10 0:25 0:50
20:1 6:5 0:10 0:25 0:50
15:1 7:5 0:10 0:25 0:50
10:1 8:5 0:10 0:25 0:50
5:1 9:5 0:10 0:25 0:50
0:1 10:5 0:10 0:25 0:50
40:1 0:5 1:10 0:25 0:50
35:1 1:5 1:10 0:25 0:50
...

Finally, the -g flag prints the subbags found that are greater than the target – but only the ones all of whose subbags are less than the target (if you see what I mean):

./bagenum -g -w 8 1:2 1:3 1:4 1:5
1:2 1:3 1:4 0:5
0:2 1:3 0:4 1:5
0:2 0:3 1:4 1:5

(So the complement of subbags are the ones whose sum is less than or equal to a target but every superbag is greater, as in the classic knapsack problem).

Once again, this works fine with unlimited multiplicities:

$ ./bagenum -g -w 8 -1:2 -1:3 -1:4 -1:5
4:2 0:3 0:4 0:5
3:2 1:3 0:4 0:5
1:2 2:3 0:4 0:5
0:2 3:3 0:4 0:5
2:2 0:3 1:4 0:5
1:2 1:3 1:4 0:5
0:2 2:3 1:4 0:5
0:2 0:3 2:4 0:5
2:2 0:3 0:4 1:5
0:2 1:3 0:4 1:5
0:2 0:3 1:4 1:5
0:2 0:3 0:4 2:5

All in all, a nice little combinatorial Swiss Army knife and useful when trying to understand the stuff in:

http://www.amazon.co.uk/The-Computer-Programming-Volume-Fascicle/dp/0201853949/

for example (I like the TAOCP fascicles, they are much more convenient to read in the bar, in the bath or on the bus). This one has lots more about this sort of problem: special cases with more efficient solutions, as well as more general algorithms for more general combinatorial problems.