Reducing Information Theory to Probability Theory

Note: I’m not an expert probability theorist, so this could be wrong. But I feel quite confident in it.

Taboo Information

I never got a handle on the concept of information. I could see no good reason why it had to be defined the way it was:

Information content of a message = log2(1p)

I got the feeling you shouldn’t need anything more than probability theory to solve problems, even those that apparently need “information theory”.

Why did I feel this way? Well, in real life, we want more information; it’s better than less information. That seems reasonable. You can take better actions with more information. But, then we come across this term in math called “information” too. And, somehow, without realizing it, we act as if it’s like our old information and as if more of it is better too.

That was the source of my confusion. We have these two different concepts but we treat them the same way without justifying it. (Yes, I know that it’s going to end up being the same concept as our commonsense “information”, but you have to prove it. You can’t just pull a definition out of nowhere and treat it like the old concept.)

To make it more explicit, rename the term “information” to something that doesn’t clash with our existing concepts. That shouldn’t change anything. A rose by any other name would smell as sweet and whatnot. For example, let’s call it “bazfoo”.

So, a stranger comes up to you and says, “hey, I have this new definition: bazfoo = log2(1p). Now, we should all try to get more of this value.” Would you accept it at face value and drop everything else to focus on this? I hope not. You would ask what is so special about bazfoo that you should aim to get it above all else when seeking the truth. You would ask him to talk in terms of probability theory, since that seems to be the solid mathematical base-layer in which any mind must think.

Therefore, let’s try to reduce Information Theory Bazfoo Theory to basic probability theory, so that we do away with the concepts of information and entropy. We can still use them afterwards, but only as a convenience, just like we talk about sentences and paragraphs, when in reality there are only the individual English letters underneath.

In short, let’s taboo the word “information”.

Bayesian Inference

What is the general work of thinking? Making correct predictions - putting more probability mass on the actual outcome and thus less on the rest. That’s all that matters. Knowing exactly which outcome will happen lets you exploit that knowledge. And you want to do this quickly.

How do you do that? You can imagine a whole lot of things that make predictions for any given situation. Call them hypotheses. So, should you just pick one at random and accept whatever predictions it makes? Or maybe take an average over them all? Well, the laws of probability theory dictate that you specify your probability distribution over those hypotheses - how much confidence you have in each of them - and then use their predictions weighted by their probabilities. You have exactly 1.0 of probability mass and you must distribute it among those hypotheses.

So, if you put too much probability mass on the wrong hypotheses, then their wrong predictions will get more weight and thus your final probability distribution - your prediction - will be poor.

Therefore, our aim is to put maximum probability mass on the correct hypothesis (call it \(H_0\)) and thus less on the others. Of course, we don’t know which one is the correct hypothesis. How then do we get to high probability mass on the correct hypothesis?

We get evidence from the world and update the posterior probabilities of our hypotheses using it. Say you see an outcome for which you assigned probability P(E). You now use Bayes Theorem:

P(H|E)=P(H)x(P(E|H)P(E))

where

P(H): prior probability of hypothesis H
P(H|E): posterior probability of hypothesis H
P(E|H): likelihood of E judged by this hypothesis
P(E): probability of outcome E

You thus rearrange your confidence in your hypotheses after each piece of evidence. If a hypothesis makes a correct prediction aka assigns high likelihood for the correct outcome, it will find its confidence level rising. If it made a wrong or imprecise prediction, it will find its confidence level falling. So, the correct hypothesis will find its confidence level rising inexorably because it always makes correct predictions. So, in the fullness of time, with enough evidence, all others will end up with posterior 0, and \(H_0\) will have posterior 1, which is exactly what we want.

So, we use Bayes Theorem to succeed in our quest to get full confidence in the correct hypothesis. (For more, look at Eliezer’s excellent explanation of Bayes Theorem.)

Multiplying your Confidence

Now, where does “information” fit in here? What is the “information content” of an outcome? What do you get from one particular outcome or message?

Remember, you just want to find the correct hypothesis \(H_0\). You couldn’t give less of a damn about the rest. For everything, ask how it affects your confidence in \(H_0\).

Now, you see an outcome E of probability P(E). How does that affect your confidence in \(H_0\)?

As per Bayes Theorem, P(H|E)=P(H)x(P(E|H)P(E))

But this is \(H_0\) we’re talking about. It makes perfectly accurate predictions every time. That means P(E|\(H_0\)) is always 1.

Therefore, P(H0|E)=P(H0)x(1P(E)).

So, what an outcome E really does is multiply our confidence in the correct hypothesis by 1P(E).

(Note that increasing our confidence in \(H_0\) automatically decreases our confidence in the rest since the sum of all P(H) must be 1.)

And this is why, I think, we define information as log2(1p). It’s an indirect measure of how much an outcome multiplies our confidence in the correct hypothesis. There’s nothing new or special about “information”. In the past, I kept getting confused by the fact that information seemed like a completely different beast from probability. But, all you’re doing is taking the logarithm. That shouldn’t do anything. That doesn’t add any information, if you will pardon the phrase. It’s just a convenient representation. That’s all.

Asking for the Improbable

So, why is more information better?

Well, consider what happens if P(E) is small. Then, 1P(E) will be large and thus the smug, ever-correct hypothesis \(H_0\) will get its confidence multiplied a lot. So, when low-probability outcomes occur, \(H_0\) has a bonanza. And since we want to discover \(H_0\) as soon as possible (we don’t know which one it is), we want P(E) to be as small as possible each time so that \(H_0\) rises to probability 1 very quickly.

That’s why more “information” is better than less: it multiplies your confidence in the correct hypothesis by a larger amount. Put differently, it tells you more about the subject, which is exactly in line with our intuitions. This is how the seemingly alien mathematical concept of information matches our everyday notion. The more information you have, the fewer mistakes you will make and the more accurate your predictions will be (since you will listen more to the correct hypothesis).


However, there’s a tiny problem. You can’t ask for low-probability outcomes to happen. You have assigned low probability to them! You don’t expect them to happen.

So, we want our confidence in the correct hypothesis to rise very quickly, for which we need low-probability outcomes to happen, but you can’t ask for those outcomes to happen. What do we do then? Are we doomed to taking a long time for the correct hypothesis to beat the others and get to probability 1?

Entropy is your Friend

It’s essentially a question of choice. You have two different experiments \(X_1\) and \(X_2\), and you can choose only one. Which one will you pick? Remember, our aim is to get high confidence in \(H_0\) as quickly as possible.

Well, you want the one that will multiply P(\(H_0\)) the most. But you don’t know which outcome will happen. If you knew somehow that \(X_1\) would give an outcome of probability 0.1 and \(X_2\) would give an outcome of probability 0.2, then you could tell that P(\(H_0\)) would get multiplied by 1/0.1 = 10 and 1/0.2 = 5 respectively. And you would choose \(X_1\) to get the bigger boost.

However, the same problem plagues you that you don’t know for sure which outcomes will occur. But you do have probabilistic knowledge.

Well, take the first experiment \(X_1\). You assign probabilities \(p_i\) to its various outcomes \(m_i\).

Now assume you run experiments with similar probability distributions to \(X_1\) several times. If there are n experiments, then you expect n x \(p_i\) messages of type \(m_i\). For example, if you have a biased coin where you expect heads with probability 0.8, then in 100 tosses, you expect to see around 80 heads and 20 tails.

\(H_0\), of course, will predict the correct outcome every time. So, the outcomes will all multiply your confidence in \(H_0\) by the inverse of their probability. So,

P(H0|o1,o2,)=P(H0)x1po1x1po2x

where \(o_i\) is the outcome of the ith experiment and \(p_{oi}\) is its probability.

Since there are n x \(p_i\) messages of type \(m_i\), you get

P(H0|o1,o2,)=P(H0)xi(1pi)nxpi

For example, if you expected heads with probability 0.8, then with 100 trials you would get 80 heads and 20 tails. So, your confidence in \(H_0\) would be multiplied by 1/0.8 on 80 occasions and multiplied by 1/0.2 on 20 occasions, for a total of \((1/0.8)^{80} x (1/0.2)^{20}\).

Taking the power n in common:

P(H0|o1,o2,)=P(H0)x(i(1pi)pi)n

Let’s call i(1pi)pi as A. This “A” is a function of just your probability distribution \(p_i\) over the possible messages.

P(H0|o1,o2,)=P(H0)x(A)n

So, if you get n messages from that probability distribution, for a large enough n, your confidence in the correct hypothesis will be multiplied by \(A^n\).

This is exactly what we wanted to know! To compare two experiments \(X_1\) and \(X_2\), just look at their values for A. If A for \(X_1\) is greater than A for \(X_2\), then we can expect to get our confidence multiplied more with \(X_1\). So, to get the maximum confidence-multiplier, look for the experiment for which you have the highest A.

Entropy in Disguise

What is this mysterious A? Well, it’s nothing but \(2^{entropy}\)!

Entropy is defined in information theory as

entropy = ipilog2(1pi)

Multiplying by a logarithm is the same as raising the power of the value inside. So:

entropy = i(log2(1pi)pi)

Also, sum of logarithms is the same as the logarithm of the product of the values inside.

entropy = log2(i(1pi)pi)

That inner part is nothing but our A. So,

entropy = log2(A)

Equivalently, A = \(2^{entropy}\)

Just like information was the logarithm of the confidence-multiplier, entropy is simply the logarithm of the expected confidence-multiplier.

So, the way to quickly gain confidence in the correct hypothesis is to pick the experiment with maximum entropy. In other words, the fastest way to learn is by asking questions with the largest entropy. More on this coming soon.

Not Indispensable

Previously, information and entropy were black boxes to me. I never really got why we chose such a weird definition for “information”. And the entropy formula was even more incomprehensible (ipilog2(1pi), seriously?).

Now I realize there’s nothing essential about the concepts of information and entropy. It simply boils down to multiplying our confidence in the correct hypothesis, which has always been our goal. The more an outcome multiplies our confidence, the more informative we consider it. And the more an experiment multiplies our confidence on average, the more entropy we say it has.

Notes

There exists one correct hypothesis that predicts all the past evidence with perfect accuracy and will predict all future evidence too. Of course, you may have several candidates at present that have perfect scores and would need more evidence to eliminate the rest. However, if you don’t have any hypothesis at all that has a perfect score so far - assigning a likelihood of 1 to the correct outcome each time - then that means your hypothesis set is lacking.

Lesson: When you come across important terms, rename them with nonsense words so that you don’t confuse them with existing words. Like how we confuse the information from information theory with the information from common sense. This way you can truly reduce it to its core.

Information-processing - a buzzword I always avoided - is nothing but Bayesian updating. In Bayesian terms, you take an outcome (aka message) and update on it to change your probability distribution over your hypotheses. In information theory terms, you take a message and then use its information content to multiply your confidence in the correct hypothesis.

Created: January 8, 2016
Last modified: September 18, 2016
Status: in-progress
Tags: reduction, information theory

comments powered by Disqus