Introductory Machine Learning: Trees, Bagging, and Boosting, Part 1: The Information Function and Entropy.


My attempt here is to motivate and derive some basic notions of information theory in a way as to let the reader slowly get a picture of what's happening. It started off as some notes trying to get at the intuition behind some information theoretical concepts, but I felt that some of the worked-out examples may be useful to others.

Information and the Information Function.

There's a number of good ways to think about information. Before we even define what information is, we should already have one requirement for how it works; namely, the knowledge of a rare event happening is somehow more informative than the knowledge of a usual event happening.

For example, it is probably more information to tell someone that the sun did not come up this morning than to tell someone that the sun did come up this morning; the former is much rarer, and so the information contained in that message is greater in some vague sense. We'd like to make that vague sense a bit more precise at the cost of having to use more sophisticated mathematics than just a gut feeling about what things are probable. Of course, this requires probability.

We'd like, in other words, to have an information function $I(x)$ which takes in an event $x$ and spits out some number which gives us an idea of the information contained in the event. In order to build our function, let's look at three properties that we'd like our information function to have. Before you continue, think about if any of these are unnecessary or if there are any others that you might want to add &mash; and see if you can derive a similar (or maybe significantly different!) information function.

  1. Common events should contain less information than uncommon events.
  2. An event which must happen (the probability is 1) should have no information content.
  3. If two events are independent of each other, their information contents should add together. That is, if we know that $A$ happens and $B$ happens and they don't have anything to do with each other, their information content together should just be the sum of their individual information contents.

These should all seem reasonable. Let's go through them one-by-one to make sure there's nothing strange or hidden going on.

The first one is just a slightly more formal version of what we wrote above. Strictly speaking, if $x$ is more probable than $y$, then we want $I(x) \lt I(y)$. At this point, we could have these numbers be anything and the reader should attempt a few. One that seems pretty nice is using something like the probability but in reverse. Maybe we could do something like \[I_{0}(x) \overset{?}{:=} 1 - P(x).\] In this way, we'd have that an event which was common (say, $P(x) = 0.8$) and an event that was uncommon (say, $P(y) = 0.2$) satisfied $I(x) = 0.2 \lt 0.8 = I(y)$. We could also do something like \[I_{1}(x) \overset{?}{:=} \frac{1}{P(x)}.\] With this one, we'd get kind of crazy large values for things which were very probable. But it still works so far.

The second thing is more of a bounding condition. The mathematical interpretation of this is that if $P(x) = 1$ then $I(x) = 0$. This kills off $I_{1}$ above, but our first interpretation $I_{0}$ is still fine for now. We still have an (uncountably) infinite number of possible values for the information function $I$, though. Let's see if the third one limits it a bit.

This one is a little stranger than the others, but it should make sense. If we have some number of "information units" from event $A$ and it has nothing to do with event $B$, then their information units, together, should just add up. This seems intuitively realistic but note that this isn't a mathematical requirement so much as a requirement which seems like something, to our guts, that information should follow. In mathematics, this is written as $I(x \cap y) = I(x) + I(y)$. This is a form of linearity of $I$, and it restricts quite a bit the type of function we can make $I$. Let's see if our original guess $I_{0}$ works: \[I_{0}(x \cap y) = 1 - P(x\cap y) = 1 - P(x)P(y)\] but this is not equal to \[I_{0}(x) + I_{0}(y) = 1 - P(x) + 1 - P(y) = 2 - P(x) - P(y)\] Well, dang. What do we do? We're out of guesses! From above, we see that we really need a function which will take things like multiplication and turn them into addition. This statement, while subtle, is probably the most glossed over statement in many information tutorials I've seen: why do we need a function which takes multiplication and turns it into addition?

The strict answer is that we don't, but here's one potential reason why this makes a lot of things easier. We're dealing with independent events here, and in this case the probability function takes input which "looks like" addition and turns it into multiplication: if we have the events $A$ and $B$, we can think of this as "adding" the events $A$ and $B$ together, so this gives $P(A \cap B) = P(A)P(B)$. Strictly speaking, there is nothing "addition-like" about the "and" operation except the fact that it is telling us that two events are both happening, but it may appeal to your intuition and make the following seem a bit less arbitrary. Since we have this property of probability, if we want to use the probability function in our definition of the information function, then we should at least try to make the products that we're getting into sums that we want. In mathematics, we want some function $F$ such that, for independent events $A,B$ we have \[F(P(A\cap B)) = F(P(A)P(B)) = F(P(A)) + F(P(B))\] or something similar.

There's an (uncountable) infinite number of functions like $F$. Luckily, there's only a few well-known and well-documented functions. The one that should instantly come to mind is $\log$ since one of $\log$'s selling-points is that it turns multiplication into addition: $\log(a+b) = \log(a) + \log(b)$. Great. Let's try out $\log$.

So. How do we use $\log$ to make what we want? Let's just do the easiest possible thing and see if it works. \[I_{2} \overset{?}{:=} \log(P(x))\] Let's see if it works with each of the three criteria above. First, if $P(x) = 0.2$ and $P(y) = 0.8$ then we should have $I(x) > I(y)$. Plugging things in: $\log(0.2) \approx -1.609 \lt -0.223 \approx \log(0.8)$. Well, crap. This is the first test and it failed. But not to worry, if we just multiply by $(-1)$ then everything will work! This feels like grasping at straws to me, but the idea here is similar to one of our examples above: multiply the log by $-1$ is the same as taking the reciprocal of the probability, which is something that seems reasonable to try. We now have the suggestively named \[I \overset{?}{:=} -\log(P(x))\] which seems to pass the first test. Second test, we need $I(x) = 0$ if $P(x) = 1$. Does this happen? $I = -\log(1) = 0$. Fantastic! Awesome! Great. Okay, let's check the last one. If $A,B$ are independent then \[\begin{align}I(A\cap B) &= -\log(P(A)P(B))\\ &=-(\log(P(A)) + \log(P(B))\\ &= -\log(P(A)) + -\log(P(B))\\ &= I(A) + I(B)\end{align}\] Awesome. This seems like exactly what we want. Intuitively, $I$ is one particular function of many which satisfies our three requirements above. It's not obvious that it satisfies other nice properties (differentiability, increasing, etc.) but at least it's a nice start.

Okay, but what base should we put on our $\log$?

One question that comes up immediately is: what base should we use? Every $\log$ has a base, with the most common bases being $e$ and $10$. For right this second, there is no obvious choice.

When we do information theoretical stuff, the easiest possible question we can ask about something is: does event $A$ happen? In this case, the answer is binary; we can either respond 0 or 1. In this case, we'd like to be able to talk about probabilities of sequences of binary digits. Maybe $0110$ codes something that says, "Is it raining? Are you outside? Is it nice out? Are you underwater?" where the answers are "No, Yes, Yes, No" respectively. If each of these had a probability of $\frac{1}{2}$ of happening (which should be the default probability if you've no idea how probable something is), then we'd be looking at probabilities that look like $\frac{1}{2^{n}}$. Our information function for this kind of thing looks like: \[\begin{align}I(A) &= -\log(P(A))\\ &= -\log(\frac{1}{2^{n}})\\ &= \log(2^{n})\end{align}\] It would be so nice if this just returned the power of 2 that we put in. If we make the base of the $\log$ equal to 2, this is exactly what happens. So, let's just make one minor, final modification to our information function and state the formal definition. In many texts, the function $I$ is called the self-information function, so we'll follow this lead.

Definition: Self-Information Function

Given $x$ some event with a corresponding probability $P(x)$, the self-information of the event $x$ is given by \[I(x) = -\log_{2}(P(x)).\]

and there we have it. A function that takes in an event and spits out a number corresponding to the informational content (with respect to other events) of that event.

The reader should note that the choice of base is not entirely standardized: many texts use 2, but some texts will use $e$. Make sure you know what base the text is using before attempting to do calculations.

Onto Entropy for Discrete Random Variables.

It's nice to measure the value of a single event, but what if we want to know the expected value of information content for something like a discrete random variable? For example, if we have that $X$ is a discrete random variable denoting what side comes up on a die roll, what is the expected amount of information we will get?

For the discrete (non-continuous) case, we have a set of possible values $\{x_{1}, x_{2}, \dots, x_{n}\}$ for $X$. In this case, we can derive the expected value in a straight-forward way. We'll call this (Shannon) entropy and denote it by $H(X)$. \[\begin{align}H(X) &= E[I(X)]\\ &= \sum_{x_{i}}P(x_{i})I(x_{i})\\ &= -\sum_{x_{i}}P(x_{i})\log_{2}P(x_{i})\end{align}\] Nothing too tricky here, just using the definition of expected value. Let's just state this formally now that we've derived it.

Definition: (Shannon) Entropy

Given $X$ a discrete random variable with potential values $\{x_{1}, x_{2}, \dots, x_{n}\}$, the (Shannon) Entropy is given by \[H(X) = -\sum_{x_{i}}P(x_{i})\log_{2}P(x_{i}).\]

Let's just do a few examples before going on.

Example: Rolling a fair die.

As a somewhat silly example, taking our discrete random variable to represent a fair die roll, we will have that \[H(X_{\text{die roll}}) = -6\left(\frac{1}{6}\right)\log_{2}\left(\frac{1}{6})\right) = \log_{2}(6)\] which is about $2.58$. Right now, this number is meaningless to us, but let's compare it to a coin flip.

Example: Coin Flip.

A similar calculation gives us a fair coin toss. \[H(X_{\text{coin}}) = -2\left(\frac{1}{2}\right)\log_{2}\left(\frac{1}{2}\right) = log_{2}(2) = 1\]

These examples imply we get more information knowing the die roll than we do this coin flip. Intuitively, this makes sense: we have around a $16.67%$ chance of guessing a correct die roll value, but we have a 50% chance of guessing the coin flip. Let's do one more example that's a bit more extreme, just for kicks. This one will be similar to playing the lottery.

Example: Extreme.

Suppose that I flip an unfair coin where the probability of heads is $0.001$ and the probability of tails is $0.999$.

\[\begin{align}H(X_{\text{coin}}) &= -(0.001)\log_{2}(0.001) - (0.999)\log_{2}(0.999)\\&=0.0114\end{align}\]

Why is the entropy so low here? Why are we getting virtually no information? Recall that if we have a certain event $x$ then $P(x) = 1$ and so $H(X) = -\log_{2}(1) = 0$; this example gives us a value which is fairly close to zero. The reason for this is that most of the time we're going to be getting tails. Hence, on average, knowing the value of this experiment won't give us any new information if we're already assuming tails will come up. The reader should try to calculate this when the probability of heads gets smaller and smaller. Speaking of which...

One small point here: what happens if $P(x_{i}) = 0$ for some $x_{i}$? We note that the term for this in entropy would be $0\cdot\log_{2}(0)$, but $\log_{2}(0)$ is negative infinity. In this case, because the limit from the right of this term is 0, we can consider any term where $P(x_{i}) = 0$ to be zero. This makes sense intuitively as any event which is not possible should not contribute to the calculation of entropy.

Entropy for Continuous Random Variables.

If you remember your calculus, this part will not be so bad: the punchline is that we're going to change the sum from above into an integral.

Let's suppose we have a continuous random variable $X$. What's the expected value of the self-information of $X$? Above, we used potential values $x_{i}$ of $X$ and summed $P(x_{i})\log_{2}(P(x_{i}))$ together. Unfortunately, we don't have a finite number of $x_{i}$ values for continuous distributions. Because the probability of getting a single value in a continuous distribution is 0, we use what's called a probability density function or PDF. Intuitively, the PDF gives us the probability of getting around $x_{i}$ for each value $x_{i}$.

Definition: Continous (or Differential) Entropy

Given a continuous random variable $X$ from some distribution with associated probability density function $f$, the continuous (or differential) entropy is given by \[h(X) = -\int f(x)\log_{2}(f(x))\,dx.\]

The Conditional Entropy.

One cool thing about entropy containing probability functions is that we can talk about concepts in probability and translate them into concepts in entropy fairly easily. For example, given $X, Y$ random variables, we may want to know the value of $H(X|Y)$. This can be interpreted as the entropy of $X$ given that we know the value of $Y$ already, similar to how the conditional is interpreted in probability. Let's just plug some conditionals into the formula and see what pops out. We will be liberally applying Bayes' Theorem here. We'll describe the steps after.

\[\begin{align}H(X|Y) &= \sum_{j}P(Y = y_{j})H(X | Y = y_{j})\\ &=-\sum_{j}P(Y = y_{j})\sum_{i}P(X = x_{i} | Y = y_{j})\log_{2}(P(X = x_{i} | Y = y_{j}))\\ &=-\sum_{j}\sum_{i}P(X = x_{i}, Y = y_{j})\log_{2}(P(X = x_{i} | Y = y_{j}))\\ &=-\sum_{i,j}P(X = x_{i}, Y = y_{i})\log_{2}\left(\frac{P(X = x_{i}, Y = y_{j})}{P(X = x_{i})}\right)\\ &=\sum_{i,j}P(X = x_{i}, Y = y_{i})\log_{2}\left(\frac{P(X = x_{i}, Y = y_{j})}{P(X = x_{i})}\right)\\ &=\sum_{i,j}P(x_{i}, y_{j})\log_{2}\left(\frac{P(x_{i}, y_{j})}{P(x_{i})}\right)\\ \end{align}\]

Here we use a comma between terms to denote AND and the last equality is just giving the usual short-hand version of the equality.

This scary-looking series of equalities isn't as bad as it is long. Let's just briefly look at what happened. On the first line, we cut up the conditional by summing up all of the possible values for $Y$. The next line cuts things up again by summing across all values of $X$. Because $P(x_{i})\cdot P(x_{i} | y_{i}) = P(x_{i}, y_{i})$ we can pull in the left-hand factor and reduce. At this point we have a double sum and both indicies are independent of each other; for short-hand, we make this one sum with two indices. On the same line, we expand the probability in the log using Bayes' theorem. The second-to-last line just inverts this fraction and allows us to cancel a negative sign. The last line is a restatement with more concise notation and is usually the form that conditional entropy is given in.

Let's just formally state this.

Definition: Conditional Entropy

Given random variables $X, Y$, the conditional entropy is given by \[H(X|Y) = \sum_{i,j}P(x_{i}, y_{j})\log_{2}\left(\frac{P(x_{i}, y_{j})}{P(x_{i})}\right).\]

What does it mean when $H(X|Y) = 0$? This means we get no additional information from $X$ when we already have $Y$; that is, if $X$ is completely determined by $Y$, then we will have $H(X|Y) = 0$.

As one would expect, $H(X|Y) = H(X)$ if $X$ and $Y$ are independent. One might guess that there's an analog for conditional entropy like Bayes' rule — and, indeed, there is! Because we're working with logs, it's slightly different than you might expect: \[H(X|Y) = H(X,Y) - H(Y)\] The proof of this is left to the reader; it's essentially just reducing the log from the equality above. What is the interpretation of this?

In the Next Part...

In the next part, we're going to go over Kullback-Leibler Divergence, which is a measure of the difference between two probability distributions. This is, in particular, a useful metric for evaluating things like statistical models; we're going to use it to help split up things into categories with our decision trees.