In the last post we talked about the general ideas behind information and entropy. This time, we're going to introduce a measure of difference between two probability distributions called the *Kullback-Leibler divergence* that will seem somewhat similar to the work we've just done.

The Kullback-Leibler divergence will be significantly easier to motivate if we first investigate the notion of *cross entropy*.

Given two distributions $f, g$ over the same underlying set $X$, the *cross entropy* for the distribution $f$ and $g$ over $X$ is given by
\[H(f,g) = E_{f}[-\log(g)].\]
where $E_{f}$ denotes the expected value with respect to $f$.

(Note, it seems that $H(f,g)$ is also used for the joint entropy, which is another concept in information theory; we shall be extra-clear when talking about things with this notation so as to not cause too much confusion.)

Note that for discrete distributions, this means $H(f,g) = -\sum_{x}f(x)\log(g(x))$; a similar formula (but with an integral) for the continuous form. This looks familiar, no? It's almost the same as the self-informational content of $f$ or $g$, but it's somehow mixed, somehow &emdash; crossed.

The interpretation that will be helpful to us goes something like this: suppose we have a probability distribution $f$ of some event (the test scores in a class, the amount of rain per day, etc.) and we want to create a model of it; this model is, itself, a probability distribution that we'll call $m$. Then we look at $H(f,m)$ and $H(f)$; if they're close, it means the model $m$ is a fairly good approximation of $f$. If they're not all that close, then $m$ may not be the best representation for $f$.

An example might be good here. Let's do an unfair coin flip first, just to have the ease-of-calculation.

Suppose we have the following distribution on $X$ and then three models $m_{1}, m_{2},$ and $m_{3}$ attempting to describe it.

\[\begin{array}{|l|c|c|}\hline X & \text{heads} & \text{tails} \\\hline f & 0.2 & 0.8\\\hline m_{1} & 0.1 & 0.9\\ m_{2} & 0.5 & 0.5\\ m_{3} & 0.2 & 0.8\\\hline \end{array}\]Clearly, the best model is going to be $m_{3}$ since it matches $f$ exactly, but let's check each of these out one-by-one.

First, we compute the (true) entropy of $X$, for comparison purposes.

\[H(f) = -(0.2\log(0.2) + 0.8\log(0.8)) \approx 0.7219\]Great. Now, let's do the cross-entropies.

\[\begin{align} H(f,m_{1}) &= -(0.2\log(0.1) + 0.8\log(0.9)) \approx 0.786\\ H(f,m_{2}) &= -(0.2\log(0.5) + 0.8\log(0.5) \approx 1\\ H(f,m_{3}) &= -(0.2\log(0.2) + 0.8\log(0.8) = H(f) \end{align}\]It might be easier to look at if, instead of going back and forth between the entropy of $f$ and the cross entropies, we were to just subtract $H(f)$ from each. Let's just do that real quick.

\[\begin{align} H(f,m_{1}) - H(f) &= \approx 0.064\\ H(f,m_{2}) - H(f) &= \approx 0.2781\\ H(f,m_{3}) - H(f) &= 0 \end{align}\]At this point we note that the worst model was $m_{2}$ (as we might expect), and $m_{1}$ was fairly good but $m_{3}$ was right-on. Note that it is not always the case that we want an exact fit (and, in particular, this could potentially point to over-fitting).

In the example above, we looked at $H(f,g) - H(f)$ for $g$ is some model. This might be a useful measure, so let's look at it a bit closer. What is this really?

\[\begin{align} H(f,g) - H(f)& = -\left(\sum_{x}f(x)\log(g(x)) - \sum_{x}f(x)\log(f(x))\right)\\ &= -\left(\sum_{x}f(x)\log(g(x)) - f(x)\log(f(x))\right)\\ &= -\left(\sum_{x}f(x)(\log(g(x)) - log(f(x)))\right)\\ &= -\sum_{x}f(x)\log\left(\frac{g(x)}{f(x)}\right)\\ &= \sum_{x}f(x)\log\left(\frac{f(x)}{g(x)}\right) \end{align}\]Note the last line here wasn't necessary; it's just flipping the fraction inside the $\log$, but it's going to tie in well with the next section. In fact, let's not delay this any longer; with this equality in mind, let's just jump right in to the next topic.

Don't let the notation scare you.

For two probability distributions $f$ and $g$ for a discrete random variable $X$, the *Kullback-Leibler Divergence* (also called *relative entropy*) is given by
\[D(f||g) = \sum_{x\in X}f(x)\log_{2}\left(\frac{f(x)}{g(x)}\right).\]

This should look somewhat familiar. In fact, this is just $H(f,g) - H(f)$ from above! The Kullback-Leibler Divergence is exactly this measure that we used before to see how nicely different models fit our distribution.

In the next post, we'll talk about decision trees and how to use the KL-divergence to judge which ones are good and which ones aren't as good.