Probability Theory Notes

Information Theory

Amount of information gained when you receive a message of probability p = log2(1/p)

Bits can be added because probabilities can be multiplied. Getting a message of probability 1/2 removes half of the probability mass, whatever it was. If you had 1/3 probability mass, you will now have 1/3 x 1/2 = 1/6 probability mass left. If you had 1 probability mass, you will now have 1 x 1/2 = 1/2 probability mass left. Your space has shrunk by half. It is independent of the initial amount of probability mass. So, we call it one bit for convenience. Similarly, a message of probability 1/16 removes 15/16 of the probability mass, whatever it was initially. So, we say that it gives you 4 bits of information. 4 bits of information means reducing your uncertainty by a factor of 4.

Uncertainty is the same as entropy.

I still don’t get it. What does “information” really mean?

You have a bunch of messages. The sum of their probabilities is 1. You get a message of probability p. What now? What changes?

I think it’s all about the set of hypotheses you have. Take the submarine example. You have 64 hypotheses each saying that the submarine is in one of the 64 cells. Initially, you believe in each hypothesis equally with probability 1/64. Your probability distribution over the hypotheses is what determines your probability distribution over the messages at each point.

Initially, say you ask whether the submarine is in cell 28. You will get a message saying yes or no. You weight the predictions from each of your hypotheses with your confidence in them. So, initially, all hypotheses except one predict that you will get no. Therefore, you expect yes with probability 1/64 and no with probability 63/64. If the answer is yes, then, as per Bayes Theorem, all the other hypotheses get a posterior probability of 0 and the lone hypothesis predicting that the submarine is in cell 28 gets a posterior probability of 1.

Each message you receive is strong evidence (really?). It falsifies all hypotheses that predicted the other possible messages. Wait. I don’t think this is right. What actually happens is that your hypotheses together form a probability distribution over the messages. I suspect that information captures how your initial probability distribution changes to your final distribution. But how?

Taboo “information”

What does one bit mean? In the case where we have 64 equi-probable hypotheses, we can see that one bit means eliminating half of them. But what happens when we have different levels of confidence in the hypotheses? What “half” does one bit eliminate here? This is my single biggest question about information theory.

You accumulate information. What does that mean in probability theoretic terms, reductionistically? How could you say that 6 bits would be exactly enough to identify the submarine no matter how you did it?

In other words, taboo “information” and “bits” and “entropy”, and explain it all in terms of basic probability theory.

Taking the log of probability should not do anything by itself. You can now add things (bits) that you could multiply earlier (probabilities). Big deal. Doing log(1/p) does not add any “information”, if you will pardon my phrase.

Let’s work through two examples without using “bits”.

Example 4.3. The game ‘sixty-three’. What’s the smallest number of yes/no questions needed to identify an integer x between 0 and 63?

At worst, we can ask 63 yes/no questions (since if it isn’t the first 63 of the 64 numbers, it must be the 64th number). The constraint is that we must ask the smallest number of binary questions.

Looking at it from a Bayesian point of view, we have 64 hypotheses, each saying that the integer is a particular number, like 28. And since we don’t know anything else, we assume that each of them has equal prior probability 1/64. Now, we want to reach the state where one hypothesis, which points to the correct number, has posterior probability 1 and every other hypothesis has posterior probability 0. This is what it means to “identify” the integer x.

So, the initial probability distribution over hypotheses (call it D1) is flat at 1/64, whereas the final distribution (D2) is zero everywhere except for a single peak. Our strategy should let us expect to get from D1 to D2 in as few pieces of evidence as possible, for any given integer.

What if we ask a question for which “yes” has a probability 1/16 and “no” has a probability 15/16? Like the question “Is x+1 divisible by 16?” - only 15, 31, 47, and 63 will match; the rest won’t. What do we expect will happen?

First, note that we say “yes” has a probability 1/16 here because of our prior probability distribution. We come up with the probability by taking each hypothesis’ prediction and weighting it by the hypothesis’ prior probability. Here, 1/16 comes from 4 x 1 x 1/64 + 60 x 0 x 1/64. The 4 hypotheses above predict “yes” as the answer to the question with likelihood 1 but their vote is weighted by their prior probability 1/64; similarly, the other 60 hypotheses predict “yes” with likelihood 0 and their vote too is weighted by their prior probability 1/64.

With probability 1/16, we will get a “yes” message. Then, the hypotheses pointing to 15, 31, 47, and 63 respectively will get a posterior probability = P(H) x P(E|H) / P(E) = 1/64 x 1 / (1/16) = 1/4. The other hypotheses all get a posterior probability = 1/64 x 0 / (1/16) = 0.

With probability 15/16, we will get a “no” message. Then, the 4 hypotheses get a posterior probability of 1/64 x 0 / (15/16) = 0, and the others all get a posterior probability = 1/64 x 1 / (15/16) = 1/60.

We could have used a different question, like “Is this number even?”. Half of the hypotheses predict “yes” and half predict “no”, and since we believe all of them equally, our final weighted prediction is 1/2 and 1/2 for each. What would the posterior probability distributions look like in this case?

With probability 1/2, we will see a “yes” message. Then, the even hypotheses will get a posterior probability = 1/64 x 1 / (1/2) = 1/32. The odd hypotheses get a posterior probability = 0. With probability 1/2, we will see a “no” message, and get opposite results.

Now, which question is better? Remember, we want to identify the correct integer as soon as possible, which means getting the final spiked distribution D2. So, which question gets us there faster?

I think that every probability distribution over hypotheses has some sort of “distance” away from a fully certain distribution (where you have full confidence in one hypothesis; or maybe one where you get all the probability mass in one outcome. I have to figure this out.). Maybe this is what the “uncertainty” measure in information theory captures.

Types

Information is defined for an outcome of an ensemble. Entropy is defined for an ensemble.

The thing is we can conceive of all the possible hypotheses given a set of variables. Now, it’s like reality knows the correct hypothesis and we’re getting messages from it to reduce our uncertainty and identify the right one. So, the ensemble is the set of possible hypotheses. At each point, we have a posterior probability distribution over the hypotheses. We want to get to certainty, where we believe one hypothesis fully.

Look at the Wenglish example. There you get partial messages. One type of message is “what is the first letter of the word?”. If it is z, then you have eliminated a large number of hypotheses about the word. (Here, a hypothesis is a rule accepting or rejecting each word, thus a subset of all the words.)

In this case, the first letter of the word was the ensemble. You can find its “entropy”. The answer you got (‘z’) was the outcome. You can find its “information content”.

Burning Question

Question: Why is more “information” better?

What does it actually do? Why was the “is x even” question better than the “is x+1 divisible by 16” question?

Lesson

Check for type errors when you taboo.

Aside: How much do we forget each day? How much information do we lose? Can we avert that fate by writing a lot? Does writing help us save or maybe even get more information than we would otherwise have? Why focus on writing alone? Locate whatever good hypotheses you can with your information.

TODO: Decomposability of entropy; When can you decompose? (Talk about this in terms of probability theory, not “messages”.)

TODO: Is information received = change in entropy of hypotheses? What is the significance of the entropy of a set of hypotheses? Is it the expected number of bits to get to certainty? Say it in probability-theoretic terms. I don’t think it is the change in entropy. Take a prior probability distribution of 1/4 on heads and 3/4 on tails. You get a heads. That was log4 bits of information. However, the entropy change was 0 - (1/4log4 + 3/4log(4/3)) which is not equal to log4.

Information Theory Exercise

Take two words and find their differing predictions. Like “remorseful” and “contrite” - on what variables do they differ?

Choosing vs Sampling

Here’s an example (from David Mackay’s book Information Theory, Learning, and Inference):

Say that the frequency of O blood type in a population is around 60% and that of AB blood type is 1%. What is the probability of picking two random people from the population such that, when you take their blood samples, you find the blood types O and AB?

I thought it would be P(O) x P(AB) = 60% x 1% = 0.06

But that’s wrong!

Here’s the right way. We look at the problem as sampling with replacement. We pick a person X and then we independently pick a person Y. The possibilities are that they have blood types O and O, O and AB, AB and O, and AB and AB (and, of course, the cases involving other blood types). Thus, the chances of finding blood types O and AB are 2 x P(O) x P(AB) because X could be O and Y could be AB or vice versa.

I went wrong earlier because I thought of it in terms of choosing. I looked at it as choosing two people so that I would end up getting O and AB blood types.

Take a concrete example. Say you have a population of 1000 people. As per the problem statement, 600 of them have blood type O, 100 have blood type AB, and the remaining 300 have other blood types. The total number of pairs you can have, independent of blood type, is 1000 x 999 = 999,000. Now, if you want to choose particularly such that you get one person with blood type O and one person with blood type AB, then there are 600 x 100 = 60,000 possible pairings. But if you ask how many of the total pairs are such that you have one person with blood type O and one person with blood type AB, you find that there are 2 x 600 x 100 = 120,000 such pairs.

You get twice as many pairs in the second case because, in sampling, the order matters. When you choose a pair with one O and one AB, you don’t care about whether you got a O-AB pair or an AB-O pair, all that matters is that you get one of each type. In sampling, for every O-AB pair, you’re likely to see an AB-O pair too, thus doubling your results.

The lesson is: don’t ask how you would choose the desired outcome; look at how likely the desired outcome is from the sampling.

Testing for Independence

Let’s say A has two values a1 and a2 and B has two values b1 and b2.

We say A and B are independent if P(a1|b1) = P(a1|b2) or equivalently P(a1|b1) = P(a1).

Quote

Apparently from John von Neumann to Shannon:

You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.

(temporary link: http://www.science4all.org/article/shannons-information-theory/)

Resources

Check out Everyday Life Information Seeking.

Created: December 4, 2015

Last modified: January 16, 2016

Status: in-progress notes

Tags: notes, probability theory