Introductory Machine Learning: Trees, Bagging, and Boosting, Part 3a: Another Decision Tree Example.



Introduction.

In the last post we talked about constructing a decision tree (by hand!) with KL-divergence. We gave a fairly long example, but I'd like to take some time to do one more quick example. We'll do the first two levels of a decision tree, show that it matches up with the "real" answer (given by python's sklearn library), and we'll end with showing some code to easily make decision trees with Python. Because this post is optional, we will also go over the code to construct decision trees in later posts as well.

Another Example.

For this, we're going to use a fairly nice dataset (one of the more structured datasets I've seen), the lenses dataset. The information is here, but I've made a csv with headers here. Download and check it out.

The numbers look scary, but this is actually a dataset with categories. Each number means some category (see the page for details of this), but we won't need the category name to make our decision tree.

Note that the last column (the one marked "class_fitted") is our classification column.

Let's jump right in.

Let's get the Entropy of the whole thing first.

We just calculate entropy as usual using the last column and the entire dataset.

\[\begin{align}H(S) &= -\left(\frac{4}{24}\right)\log_{2}\left(\frac{4}{24}\right)\\ &-\left(\frac{5}{24}\right)\log_{2}\left(\frac{5}{24}\right)\\ &-\left(\frac{15}{24}\right)\log_{2}\left(\frac{15}{24}\right)\\ &\\ &\approx 1.326 \end{align}\]

Let's get the Information Gains next.

For this first series of information gains, we'd like to get the following values: $IG(S, \text{age})$, $IG(S, \text{spectacle_perscription})$, $IG(\text{tear_production_rate})$. We're only going to do age in detail, but will give values for the others.

Looking at $IG(S, \text{age})$, we only need to calculate three new things: $H(S,\,|\,\text{age} = 1)$, $H(S,\,|\,\text{age} = 2)$ , $H(S\,|\,\text{age} = 3)$.

To calculate this, recall that we just restrict $S$ to our data-points which have $\text{age} = 1$ and then look at the class values and do entropy like we normally would. \[\begin{align} H(S,\,|\,\text{age} = 1) &= \left(\frac{2}{8}\right)\log_{2}\left(\frac{2}{8}\right)\\ &+\left(\frac{2}{8}\right)\log_{2}\left(\frac{2}{8}\right)\\ &+\left(\frac{4}{8}\right)\log_{2}\left(\frac{4}{8}\right)\\ &\\ &= 1.5 \end{align}\] \[\begin{align} H(S,\,|\,\text{age} = 2) &= \left(\frac{1}{8}\right)\log_{2}\left(\frac{1}{8}\right)\\ &+\left(\frac{2}{8}\right)\log_{2}\left(\frac{2}{8}\right)\\ &+\left(\frac{5}{8}\right)\log_{2}\left(\frac{5}{8}\right)\\ &\\ &\approx 1.299 \end{align}\] \[\begin{align} H(S,\,|\,\text{age} = 3) &= \left(\frac{1}{8}\right)\log_{2}\left(\frac{1}{8}\right)\\ &+\left(\frac{1}{8}\right)\log_{2}\left(\frac{1}{8}\right)\\ &+\left(\frac{6}{8}\right)\log_{2}\left(\frac{6}{8}\right)\\ &\\ &\approx 1.061 \end{align}\] We then weight these values and subtract them from $H(S)$, \[\begin{align} IG(S\,|\,\text{age} = 1) &= H(S)- \left(\frac{8}{24}\right)(1.5)\\ &- \left(\frac{8}{24}\right)(1.299) - \left(\frac{8}{24}\right)(1.061)\\ &\\ & = 0.0393. \end{align}\] This gives us the information gain for the age attribute. Neato. The other two, which we won't do in detail, have values: \[\begin{align} IG(S\,|\,\text{spectacle_prescription}) &= 0.03945\\ IG(S\,|\,\text{astigmatic}) &= 0.3769\\ IG(S \,|\,\text{tear_production_rate}) &= 0.5487\\ \end{align}\] Note here that the information gain from tear production rate is huge! One of the potential reasons that the reader may have stumbled upon is that for reduced tear rates (that is, reduced tear rate is equal to 1) the only class value is 3 (which is "don't prescribe contact lenses" and makes sense if you have a reduced tear rate). This allows us to classify a large chunk of our dataset all at once nicely!

The first node we will split at is reduced_tear_rate.

We have the following split so far:

reduced_tear_rate == 1: Class 3.
reduced_tear_rate == 2
|   ????

We can now proceed down the tree.

The Second Branch.

We go down the branch where $\text{reduced_tear_rate} == 2$. This means that we cut out every data-point where the reduced tear rate is not equal to 2. For notation sake, let's call this new dataset $S_{2}$. You should have 12 elements in $S_{2}$.

We've got to do the same thing as above now (this is why we usually use computers for this) except we replace $S$ with $S_{2}$. This time we have three attributes left to calculate the gain for: age, spectacle_prescription, astigmatic.

First, we note that \[\begin{align}H(S_{2}) &= -\left(\frac{4}{12}\right)\log_{2}\left(\frac{4}{12}\right)\\ &-\left(\frac{5}{12}\right)\log_{2}\left(\frac{5}{12}\right)\\ &-\left(\frac{3}{12}\right)\log_{2}\left(\frac{3}{12}\right)\\ &\\ &\approx 1.555 \end{align}\]

Looking at $IG(S, \text{age})$, we only need to calculate three new things: $H(S,\,|\,\text{age} = 1)$, $H(S,\,|\,\text{age} = 2)$ , $H(S\,|\,\text{age} = 3)$.

As before, we'll calculate age, but give the values for the rest.

\[\begin{align} H(S,\,|\,\text{age} = 1) &= \left(\frac{2}{4}\right)\log_{2}\left(\frac{2}{4}\right)\\ &+\left(\frac{2}{4}\right)\log_{2}\left(\frac{2}{4}\right)+0\\ &\\ &= -1 \end{align}\] \[\begin{align} H(S,\,|\,\text{age} = 2) &= \left(\frac{1}{4}\right)\log_{2}\left(\frac{1}{4}\right)\\ &+\left(\frac{2}{4}\right)\log_{2}\left(\frac{2}{4}\right)\\ &+\left(\frac{1}{4}\right)\log_{2}\left(\frac{1}{4}\right)\\ &\\ &=-1.5 \end{align}\] \[\begin{align} H(S,\,|\,\text{age} = 3) &= \left(\frac{1}{4}\right)\log_{2}\left(\frac{1}{4}\right)\\ &+\left(\frac{1}{4}\right)\log_{2}\left(\frac{1}{4}\right)\\ &+\left(\frac{2}{4}\right)\log_{2}\left(\frac{2}{4}\right)\\ &\\ &=-1.5 \end{align}\] Which gives us \[\begin{align} IG(S, \,|\,\text{age}) = 0.555 \end{align}\] The other attributes have the following information gains: \[\begin{align} IG(S,\,|\,\text{spectacle_prescription}) &= 1.068\\ IG(S,\,|\,\text{astigmatic}) &= 1.303\\ \end{align}\]

We see that astigmatic is the clear winner here, so we split on astigmatic next. Our decision tree looks like the following:

reduced_tear_rate == 1: Class 3.
reduced_tear_rate == 2
|   astigmatic == 1
|	|	?????
|	astigmatic == 2
|	|	?????

But now, in each of those two lowest branches, we have two attributes left: age and spectacle_prescription.

What's the whole thing look like?

Let's stop; the remaining branches are created in the same way as what we've done here. Just so that you know I'm not fibbing, here's the printout from the Python library scikit-learn which makes a nice decision tree (using a similar algorithm) for us:

Here, the values look strange: they're plotting this in terms of a dataframe, so the first column is $X[0]$, the second is $X[1]$, and so on. A quick check will tell you that the top of this tree matches almost exactly with our tree — there's some strangeness with how the branches are defined (with less than and greater than signs) but these should be reasonable. Ultimately, we've made the same tree.

How did I make this in Python?

Because this is not a post about python, I won't go through all of the code in great detail, but let's just put it down and talk about it. This code will make a decision tree similar to the ones we've been talking about .

Even if you don't know Python, look at the figure so that you can follow along in the next few paragraphs.


import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.externals.six import StringIO

# import the data, drop the unnecessary id number column.
df_raw = pd.read_csv("/location of this file/fitted_contact_lenses.csv")
df_raw.drop('id', axis=1, inplace=True)

# make the dataframe for features and the one for the classes.
df_attrs = df_raw[["age", "spectacle_prescription", "astigmatic", "tear_production_rate"]]
df_class_fitted = df_raw.class_fitted

# make the decision tree, fit it with out data.
dtree = tree.DecisionTreeClassifier(random_state=0)
dtree.fit(df_attrs, df_class_fitted)

# Optional: export it as a dot file, for visualization.
with open("lens_dtree.dot", "w") as f:
    f = tree.export_graphviz(dtree, out_file=f)

After running this, the script makes a file in the current directory called "lens_dtree.dot". The dot format is kind of strange but it's how this kind of thing exports; we use graphviz and the command line to change this dot file into another file that's easier for us to look at. In a terminal or command line, go to the folder that contains this dot file and run

dot -Tpng lens_dtree.dot -o lens_dtree.png

which outputs a png. It should have the same content as the one we posted above.

There's a lot of stuff on this tree. What is all this?

At first glance, the tree is a bit intimidating. The top line is the easiest: it's just the attribute (given as something that looks like $X[i]$) that we're splitting on. Remember that Python begins counting at 0, so $X[4]$ is the 5th column. Similarly, the "samples = ..." label just tells us how many data-points are represented in this particular branch. The two that may not be so obvious are "value" and "gini".

"Value" is a bit easier to describe. In this case, it is the number of elements in that leaf by category. For example, in the upper-left branch there is a label "value = [0,0,12]". This tells us that all of the data-points that fell into this leaf were the third category (hence the 12 in the third spot) and there are no other data-points from other categories (hence the 0s in the first and second spot).

The most mysterious of these labels is "gini". This number, called the Gini Impurity (not the Gini Coefficient) is a measure which is defined by how often a randomly chosen element in your set would be labeled incorrectly if it were to be labeled using one of the labels in the set according to the distribution of those labels. We pick an element at random, but when we guess the category for that element this is not random, it's a guess based on the distribution of categories for elements in that set. For example, if our set had 99 dogs and 1 cat, then we choose an animal at random but we will assign it a dog much more frequently (99 times out of 100, in fact).

Before we formally define this, let's look at a quick example.

Example: Gini Impurity Example

Let's say we have 10 fruits in our set: 2 apples, 3 oranges, 5 bananas.

The probability of picking an apple is $\frac{2}{10}$ and the probability of incorrectly guessing another fruit using the distribution of labels of fruits we have is $\frac{8}{10}$. Because these events are independent, the probability that we pick an apple and incorrectly classify it is $\frac{2}{10}\cdot\frac{8}{10}$.

For each fruit, the probability will be \[P(\text{that_fruit})\cdot (1 - P(\text{that_fruit})).\] Hence, the Gini Impurity is simply the sum of these: \[I_{G}(f) = \sum_{\text{fruit}\in \text{fruits}}P(\text{that_fruit})\cdot (1 - P(\text{that_fruit}))\] In this case, the reader should check that \[I_{G}(f) = 0.62.\]

We can now formally define the Gini Impurity:

Definition: Gini Impurity

Given a set $S$ with categories $\{x_{1}, x_{2}, \dots, x_{n}\}$ with distribution $f$, the Gini Impurity of $S$ is given by \[I_{G}(f) = \sum_{i=1}^{n}P(x_{i})(1-P(x_{i})).\] Note that by doing some mild manipulations , we can get the equivalent, and slightly more useful, formula: \[I_{G}(f) = 1 - \sum_{i=1}^{n}P(x_{i})^{2}.\]

Prove to yourself that if there is only one category in a set, the Gini impurity is exactly 0; this explains why the Gini impurity on any of the leaf nodes of our decision tree is zero.

The top node of our decision tree has 24 elements with three classifications: we have four 1's, five 2's, and fifteen 3's. This means that the gini impurity is given by \[\begin{align}I_{G}(S) &= 1 - \left(\left(\frac{4}{24}\right)^{2} + \left(\frac{5}{24}\right)^{2} + \left(\frac{15}{24}\right)^{2}\right)\\ &\\ &\approx 0.5382 \end{align}\] which agrees with what our decision tree says.

Let's just do one more, just for kicks. Look node on the second-to-bottom row, all the way to the right. It notes $X[0] \lt= 1.5000$ and tells us we have 3 samples. These three samples (given by the leaves below it) are one 1 and two 3's. If we call this set $S_{1}$ just for the sake of notation, we can compute the gini impurity of this as well: \[I_{G}(S_{1}) = 1 - \left(\frac{1}{9} + \frac{4}{9}\right) \approx 0.444\] Which agrees with what our graph says.

It's instructive to look at the Gini impurity index on these plots. Ideally, we'd like the gini index to always go down, but this isn't always possible (the first branch of our decision tree has the gini impurity going up on one of the branches).

At this point we're not going to worry too much about the gini score. One potential use is that the lower a gini score on a node, the more homogeneous the node is, the better candidate this node may be for a potential leaf during a pruning process. We can safely ignore it for now, but keep in mind what it means. For example, you should be able to look at the third row of nodes from the top in our picture and say which node has a less diverse set of elements, label-wise.

Why is the Python Plot using Gini and not Information Gain?

At this point, I should make a confession: this decision tree wasn't made in the same way as the ones we made above. Instead of splitting using the Information Gain, this was split using the Gini Impurity Index. Does this change the decision tree? Maybe. But it is good to know that different users use different measures in different instances. There is a way to use information gain by using the criterion="entropy" option in our code like this:

# make the decision tree, fit it with out data.
dtree = tree.DecisionTreeClassifier(random_state=0, criterion="entropy")
dtree.fit(df_attrs, df_class_fitted)

Using the dot process again (as above) we generate a png that looks like this:

The documentation given here will give some of the options that one can use with these decision tree models.

Next Time.

In the next post, we're going to continue our adventure with information theory and talk about bagging and boosting. These methods will help us improve our accuracy for certain datasets and will introduce the reader to methods of "averaging" different decision trees and other models together — this is commonly called an ensemble, and is a common and important concept to know about!