Posted on by Chris Warburton

Here' an old blog post I found floating about unpublished:

For the past few years I've been fascinated by Algorithmic Information Theory, which is a generalisation of Shannon's Information Theory with rather profound results.

Information Theory

Claude Shannon is famous for two things. In his Masters thesis, generally regarded as the most influential Masters thesis ever written, he showed that binary arithmetic, and the true/false logic of Boole, can be represented as electronic circuits containing switches. This could be seen as the beginning of digital electronics.

His other great work kick started digital communications. Shannon was interested in how to send messages, for example over a telegraph wire. He defined the mathematical quantity of information, along with its fundamental unit the bit. His key insight was that the content of a message doesn't matter; only the information of the content needs to be sent. One way to think about information is as the answers to yes/no questions, like in the game 'twenty questions'. The distinction between data (the content of a message) and information (what needs to be transmitted) is that data can be answers to any questions at all, whereas information answers the minimum number of questions needed.

For example, let's say I tell you that I can't remember my bank balance, but I know it's either £1000 or £1500. I then go off to an ATM and find out that it's £1500; delighted, I want to tell you. How can we phrase the message in binary zeros and ones? Well we can write the number 1500 in binary, which is 10111011100; at 11 bits long, this seems a bit long. We can see why if we think of the bits as answers to yes/no questions: if you sum up some numbers to get the result, we're sending answers to the questions: - Do I need to include 1024 in the sum? - Do I need to include 512 in the sum? - Do I need to include 256 in the sum? - Do I need to include 128 in the sum? - Do I need to include 64 in the sum? - Do I need to include 32 in the sum? - Do I need to include 16 in the sum? - Do I need to include 8 in the sum? - Do I need to include 4 in the sum? - Do I need to include 2 in the sum? - Do I need to include 1 in the sum?

We get 11 bits because we're answering 11 questions! These 11 questions are a pretty good way to send any number between 0 and 2047, but that's overkill for what we need. How about we just use the following question: - Is it £1000?

We can answer this with 1 bit, because all we need to send is a 0.

Notice that we didn't actually need to include the number 1500 in our message; we just needed to agree on the question, known as the protocol, which could equally well have been "Is it £1500" (making our message a 1). Shannon showed that if both ends agree on a protocol of possible messages, the information that is sent just needs to tell the true message apart from the other possibilities. The definition of a bit is the amount of information required to tell apart two equally likely possibilities. In this example we had two possibilities and thus we sent one bit. However, there may be some redundancy, even here. For example, let's say that I have a tendency to make conservative estimates, in fact I'm three times as likely to underestimate than overestimate (75%/25%). In this case £1500 is more likely than £1000, I may just be erring on the side of caution by thinking it's lower.

Whilst we can't send less than one bit in a message, this bias does allow us to make more ingenious question schemes if we do it multiple times. If I tell you that I checked my balance once a week for a month, and it started at either £100 or £200, then went up to either £600 or £900, went up more to either £1000 or £1500, then finished at either £2000 or £2500. I then go to get a bank statement to check. I can send the real values as 4 bits: one to determine each measurement (let's say 0 for the lower, 1 for the higher). Instead, since we know I'll probably be underestimating 3 and overestimating 1, I can tell you which one I was overestimating. I can do this in 2 bits, for example "Is it an odd week?" and "Is it the earlier of the two?"; thus week 3 becomes the bits 10. Using 2 bits is much more optimal, since each week is equally likely and there are 2 pairs of weeks.

The requirement for a common protocol may seem restrictive, but it could easily contain, for example, all characters used in the English language; or a high-definition video signal. Shannon then went on to show that not only can we remove redundant bits quite easily if we know the probabilities, but we can also add redundant extra bits in a systematic way to overcome unreliable, noisy communication channels. For example if we want to send a 100 bit message, and there's a 1% chance of each bit coming out wrong at the other end, then we can send a 101 bit message to take into account that one bit will probably go wrong and will need resending. Shannon showed how the probability of an error can be made arbitrarily small in a systematic way; messages take more bits, but they're more robust to failure.

Algorithmic Information Theory

Shannon's theory is all about averages and probabilities. Each message is independent and identically distributed (IID); the best way to compress a message into a few bits is to tune your protocol to match the situation as well as possible.

Algorithmic Information Theory takes a different approach. It chooses a single, fixed protocol with equally-likely zeros and ones (ie. optimum for evenly distributed choices, like the results of tossing a coin). When a message is received, the bits are fed directly into a computer, using a machine language which allows any combination of zeros and ones, and is "self-delimiting" (ie. it contains an "end of file" sequence which stops the machine reading any more bits). The computer's output is taken to be the original message.

With the protocol fixed, the question of optimal, short messages (ie. information, rather than data) becomes the question of how we can make the programs we send as small as possible. In other words, how much can we compress our message, if we don't care about decompression time?

This has been tackled by people like Solomonoff, Kolmogorov, Levin, Chaitin, Martin-Lof and others, and there are some very interesting results.

We can swap the computer we're using and it will change the length of every optimal program by at most a constant amount, and the amount depends only on the computers, not on the messages. How? Well both computers are universal, so we can make the new computer emulate the old one, and include the same emulator with all of our messages.

We can also never know whether particular programs are optimal (or "elegant" as Chaitin calls them), since we would have to know the output of every smaller program and show that they're different from this one. However, we can't do this due to the halting problem.

Solomonoff has shown that, if we're part-way through receiving/decoding a message (the prefix), we can make excellent predictions about the as-yet-undecoded content (the suffix) of the message by ranking all possibilities in order of their compressed size. Specifically, for each suffix we find all of the programs which output the (known) prefix followed by that particular suffix, give them a probability of 1/2^program size, then sum these individual probabilities to get the probability of that suffix. This dominates any other prediction strategy, and works even when the prefix is empty (ie. we haven't received anything yet).

Hutter has shown that Solomonoff's prediction results can be inserted into decision theory to produce a totally rational, super-intelligent, ultimate decision maker, known as AIXI.

Chaitin has shown that any formal system (for example, set theory) can only prove theorems which can be compressed to a size smaller or equal to what the system can be compressed to. This is actually quite intuitive: given any formal system, we can enumerate it's theorems until the cows come home. Thus, for anything the system can prove, we can use the system itself (plus a brute-force enumeration loop and stop condition) as the compressed version of the theorem. Anything which requires more bits than the system cannot be proved in the system (otherwise it would appear in the list of theorems and wouldn't need more bits).

This latter result is interesting, since it's a rigourous proof that proofs contain no information, or equivalently, that all theorems are present in the axioms/rules. So what is the point of proving stuff? Well, the obvious reason is that it's better to publish massively redundant proofs of useful, non-obvious theorems than it is to have everyone start from theorem 1 and work up whenever they need a result. This is effectively a space/time tradeoff: we require more storage space, but reduce the amount of computation which needs to be done. Essentially, every published proof is cached; every known theorem is memoised. Maths then becomes a joint venture into finding new, independent axioms so that more theorems become accessible, and striking a balance between what to publish and what to calculate on demand.