Introductory Machine Learning: Trees, Bagging, and Boosting, Part 3: Decision Trees.



Introduction.

In the last post we talked about the ways that we can use entropy to measure the difference between probability distributions. This time, we're going to apply what we know to construct a model for making decisions. This is one of the more useful tools in a data analysts toolkit so we're going to spend a bit of time on it.

What is a Decision Tree?

A decision tree is similar to a flow-chart: you start at some node and follow edges which represent different possible decisions. For example, below is a decision tree that looks at the features of three different kinds of flowers (this is the popular Iris dataset). If a reader has a new flower and looks at those features, they could guess what type of flower they have by using the decision tree.

The reader would start at the top and measure the width of the petal of the flower; if the width is less than or equal to 0.6 then we go down the left-hand route and note that our flower is most likely an Iris Setosa. On the other hand, if our flower has a petal width of more than 0.6, then we go down the right-hand side and go to the next step. At this step, we split again on petal width but on a different value. If we have to go left, we split on a different variable: petal length. We continue until we're able to classify our flower. The decision tree has helped us decide what flower we have.

The decision tree can also be written in ASCII in the following form (from a tool called Weka):

petalwidth <= 0.6: Iris-setosa
petalwidth > 0.6
|   petalwidth <= 1.7
|   |   petallength <= 4.9: Iris-versicolor
|   |   petallength > 4.9
|   |   |   petalwidth <= 1.5: Iris-virginica
|   |   |   petalwidth > 1.5: Iris-versicolor
|   petalwidth > 1.7: Iris-virginica

We'll be using something which looks like the ASCII version in this post, just because it's easier for me to create. The reader should try to trace along and see what's going on here: if the decision is made (as in the first line) the criteria is followed by a colon and then the name of the flower (or whatever you're deciding); if not, then we tab and create a new branch which follows the same rules as before.

We can also make somewhat silly examples of decision trees, like the following one:

raining? == True: don't play baseball
raining? == False
|	have any friends? == True
|	|	have a bat? == True: play baseball
|	|	have a bat? == False: don't play baseball
|	have any friends? == False: don't play baseball

Here, we can ask: is it raining? If so, don't play baseball; if not, then we're gonna go to the next branch. In this branch, it asks if you have any friends. If you don't, you're not going to play baseball; if you do, we go to the next branch. The last, innermost branch asks if you have a bat: if you do, play baseball; if you don't, don't play baseball. This is a quick and easy way to read the decision tree out.

The question is then something like this: given a bunch of data (like petalwidth and petallength) and a bunch of labels associated to them (like the type of flower), can we make a decision tree like the ones above? How?

Creating Decision Trees with ID3.

There's a number of methods to make a decision tree, but we're going to focus on one particularly popular algorithm: ID3. Note, we will call the variables that we split on "nodes" (for example, above, "petalwidth" and "petallength" are both notes) as is standard in the literature.

As usual, we're going to try to motivate why this algorithm is natural to use by attempting to derive it from certain properties that we'd like it to have. Luckily, there's only one thing that we need to figure out: what variable should we start splitting on, and where should we split it?

For example, the above decision tree split on petalwidth first, and used less than or equal to 0.6 and greater than 0.6 to split. Why did it start with petalwidth? Why did it break at 0.6? We need to figure these things out! We'll use the following solution to this question:

At each node, we should use a concept like entropy to analyze what the most informative split would be.

Entropy is great at telling us what kind of information we're getting from things, and we'd like to get the most information possible from each split. For example, if we used petalwidth < 100, it would be fairly silly since there are no flowers that would have petalwidth > 100; every flower would just go down the left-hand side of the diagram which doesn't tell us a whole lot.

As an example of how we're going to use entropy, suppose we have the following (commonly used) test data:

OUTLOOK  | WINDY | PLAY
=====================================================
sunny    | false | Don't Play
sunny    | true  | Don't Play
overcast | false | Play
rain     | false | Play
rain     | false | Play
rain     | true  | Don't Play
overcast | true  | Play
sunny    | false | Don't Play
sunny    | false | Play
rain     | false | Play
sunny    | true  | Play
overcast | true  | Play
overcast | false | Play
rain     | true  | Don't Play

This is similar to our baseball data. For Outlook we have three values: sunny, overcast, rain. For Windy we have two values: true or false. We'd like to be able to predict (given a new day) if we play or don't play, based on this data.

The first thing we need is a top node. What should we split on first?

Information Gain.

Recall that KL-divergence is defined as \[\begin{align} D(f||g) &= \sum_{x\in X}f(x)\log_{2}\left(\frac{f(x)}{g(x)}\right)\\ &= H(f) - \sum_{x\in X}f(x)\log_{2}(g(x)) \end{align}\] for $f,g$ both probability distributions over the same set $X$. We're going to do something slightly shifty here: we're going to measure $D(f||g)(A,S)$ where $S$ is going to be our entire training set (so, for example, it's every row in the dataset above) and $A$ is going to be a single attribute. This will make more sense if we give an example here.

Also, it's exhausting to keep writing $D(f||g)$, so we're going to call this the information gain and refer to it as $IG(f,g)$. This is a bit confusing, but seems to be standard notation. Note that $IG(f,g)$ is not any different from $D(f||g)$.

A Long Example.

Example.

Step 1.

Using the data above, we start with the entire dataset as and look at each of the attributes, looking specifically at the information gain from $S$, the entire dataset for now, and $A$, the attribute. Because this can get a bit confusing (especially with $A$), we're going to go slow for the first one.

\[IG(S,\text{Wind}) = H(S) - \sum_{a}P(A=a)H(S\,|\,A = a)\]

Okay, what is this madness? First, note that we're taking the entropy of $S$ on the left-hand side and that should seem reasonable. Let's do a quick calculation to see what the entropy of this is. Remember, we're going to just see the probability of Playing or Not Playing, and plug that into the Entropy formula: \[H(S) = - \left(\frac{9}{14}\right)\log_{2}\left(\frac{9}{14}\right) - \left(\frac{5}{14}\right)\log_{2}\left(\frac{5}{14}\right) \approx 0.94.\] This right-hand side is a little strange only because we're breaking apart the different classes of $A$. For example, we want to sum over all possible values of $A$; in this case, it could be windy or not windy. Hence, either $A = \text{true}$ or $A = \text{ false}$ are possible values here.

\[\begin{align}IG(S, \text{Wind}) &= H(S)\\ &-P(a = \text{true})H(S\,|\,A = \text{true})\\ &-P(a = \text{ false})H(S\,|\,A = \text{ false})\end{align}\]

This is just the expanded-out version of what we said above. The two options are right there on the second line. Let's look at them individually: \[P(a= \text{true})H(S\,|\,A = \text{true})= \left(\frac{6}{14}\right)H(S\,|\,A = \text{true})\] Let's stop right here and just ask ourselves: what is $H(S\,|\,A = \text{true})$? This is the entropy of $S$ given that $A = \text{true}$. This is the important part. We can cast out everything except when $A$ is true on the dataset. To calculate entropy of this set, it's enough to look at only the data points where $\text{wind}$ is $\text{true}$, and see if we play or not. That is, \[H(S\,|\,A = \text{true}) = \left(\frac{3}{6}\right)\log_{2}\left(\frac{3}{6}\right) + \left(\frac{3}{6}\right)\log_{2}\left(\frac{3}{6}\right)\] Slightly unfortunate that they're the same value, but this is because there are exactly 3 times when we don't play with wind as true, and exactly 3 times we do play with wind as true. Hence, \[H(S\,|\,A = \text{true}) = 1\] Great. But what about when wind is false? We have the same kind of deal: \[P(a = \text{ false})H(S\,|\,A = \text{false})\left(\frac{8}{14}\right)H(S\,|\,A = \text{false})\] and \[H(S\,|\,A = \text{false}) = \left(\frac{6}{8}\right)\log_{2}\left(\frac{6}{8}\right) + \left(\frac{2}{8}\right)\log_{2}\left(\frac{2}{8}\right)\] which is due to us playing all but two of the times wind is false. This gives us that \[H(S\,|\,A = \text{false}) \approx 0.811.\] Let's go back up now to our original equality and plug everything in \[\begin{align}IG(S, \text{Wind}) &= H(S)\\ &-P(a = \text{true})H(S\,|\,A = \text{true})\\&-P(a = \text{false})H(S\,|\,A = \text{false})\\ &\\ &= H(S) - \left(\left(\frac{6}{14}\right)(1) + \left(\frac{8}{14}\right)(0.811)\right)\\ &= H(S) - 0.892\\ &= 0.94 - 0.892\\ &= 0.048\\ \end{align}\] Whew! Exhausting. This is the information gain that we get for the wind.

This was just the wind, though. We also need to do the outlook. Exhausting. What is this going to look like? We're going to use a slightly condensed notation here just to save space.

\[\begin{align}IG(S, \text{Outlook}) &= H(S)\\ &-P(\text{sunny})H(S\,|\,\text{sunny})\\ &-P(\text{overcast})H(S\,|\,\text{overcast})\\&-P(\text{rain})H(S\,|\,\text{rain}) \end{align}\] As before, let's do each of these things. \[\begin{align}P(\text{sunny})&H(S\,|\,\text{sunny}) = \left(\frac{5}{14}\right)H(S\,|\,\text{sunny})\\ &= \left(\frac{5}{14}\right)\left(-\left(\frac{3}{5}\right)\log_{2}\left(\frac{3}{5}\right) - \left(\frac{2}{5}\right)\log_{2}\left(\frac{2}{5}\right)\right)\\ &\approx 0.3468 &\\ &\\ P(\text{overcast})&H(S\,|\,\text{overcast}) = \left(\frac{4}{14}\right)H(S\,|\,\text{overcast})\\ &= \left(\frac{4}{14}\right)\left(-1\log_{2}(1) - 0\right)\\ &=0 &\\ &\\ P(\text{rain})&H(S\,|\,\text{rain}) = \left(\frac{5}{14}\right)H(S\,|\,\text{rain})\\ &= \left(\frac{5}{14}\right)\left(-\left(\frac{3}{5}\right)\log_{2}\left(\frac{3}{5}\right) - \left(\frac{2}{5}\right)\log_{2}\left(\frac{2}{5}\right)\right)\\ &\approx 0.3468 \end{align}\] Which means that \[IG(S, \text{Outlook}) = 0.94 - (0.3468 + 0.3468) = 0.2467.\] To sum all of this up, we have: \[\begin{align}IG(S, \text{Outlook}) &= 0.2467\\ IG(S, \text{Wind}) & = 0.048 \end{align}\] If we want to split by using the attribute which maximizes the information that we'd get by splitting on it, then we want to maximize $IG$, which means we ought to split on $\text{Outlook}$. Pretty reasonable.

Step 2. Okay. We've split once. In ASCII, our diagram looks like this:

outlook == Sunny
|	???
outlook == Overcast
|	???
outlook == Rain
|	???

Note that the only other attribute is Wind (though, in general, we'd have many more attributes), so we will be splitting on that. If we had more than one attribute, we'd go down to each branch and do exactly the same thing as step 1, but restricting our set $S$ to the branch we're on. This step is important. Let's just note what this means in a bit more detail:

Suppose you have a bunch more attributes after splitting on outlook. Then we'd start at the "outlook == Sunny" branch and restrict our dataset such that the only things we're looking at were those data-points where outlook is Sunny. In other words, $S$ is now every data-point where outlook is sunny. Once you find where to split by repeating the process in step 1, you go down to "outlook == Overcast" and repeat.

There's some art involved in when to stop but in this post we'll just stop when we're out of attributes. We'll talk a bit later on the proper way to build a decision trees so as to not over-fit or make it too complex.

Back to the example, we now split on Wind. It looks like this:

outlook == Sunny
|	wind == True:
|	wind == False:
|
outlook == Overcast
|	wind == True:
|	wind == False:
|
outlook == Rain
|	wind == True:
|	wind == False:

But note we've left off which thing to pick when the wind is true or false on each of these branches. At this point, we'll just count and use a majority; if we get a tie then we'll just pick one at random.

For example, for all of the "outlook == sunny" data-points when wind is true we get one Play and one Don't Play; we'll just choose Play here (at random). If "outlook == sunny" and wind is false, we get another tie, so we'll just pick Play. For overcast, if wind is true we get 2 Plays and 0 Don't Plays, so we pick Play. We continue like this until we've filled the entire thing out:

outlook == Sunny
|	wind == True: Play
|	wind == False: Play
|
outlook == Overcast
|	wind == True: Play
|	wind == False: Play
|
outlook == Rain
|	wind == True: Don't Play
|	wind == False: Play

Note, at this point, that our tree is pretty nice — but it's a little redundant, no? We have that, no matter what option for the wind, if the outlook is sunny or overcast then we'll play. We may want to prune the tree — that is, take off some leaves and branches — and make it a bit more simple. For example:

outlook == Sunny: Play
|
outlook == Overcast: Play
|
outlook == Rain
|	wind == True: Don't Play
|	wind == False: Play

Much nicer!

What Just Happened?

In the example above, we used the ID3 algorithm to figure out what attributes to branch at so that we could make a nice decision tree. In brief, the algorithm goes like this:

Algorithm: ID3

  1. Start with $S$ is your entire dataset.
  2. For each attribute $A$, calculate $IG(S,A)$.
  3. Pick the attribute $A$ with the greatest information gain, split on that (i.e., make that the root node and for each possible value that it attains make branches).
  4. For each branch, restrict $S$ to only those elements that satisfy the branch's criteria (i.e., they have that particular value for that particular attribute which is represented by the branch).
  5. Repeat all steps above until either you run out of attributes or you are content with the tree.

What's next?

The next post is a similar example to the one we have here, but with another dataset. It's meant for readers who want one more quick example of how ID3 works. Feel free to skip it if that isn't your cup of tea.