Introductory Machine Learning: Trees, Bagging, and Boosting, Part 4: Bagging and Boosting.



Introduction.

Decision trees are fairly good at classifying things but we can do a bit to improve their accuracy and stability. One disadvantage of a decision tree is that small changes in data may drastically change the tree (if the change, for example, changes the branch of higher node). Moreover, and perhaps as a corollary to this, if one has a large amount of data and attempts to sample it to make a decision tree with, then one may find drastically different trees depending on the sample data used.

Two popular methods are available to us: bagging and boosting. The gist is that in bagging we will be making a number of simple trees and "averaging" them together in a nice way. Boosting is similar, but we will be iterating on each tree that we make and focus on misclassifications to make our model more accurate. We'll go over these in detail below.

Bagging.

Bagging, or Bootstrap aggregation (the -ing is just because, you know, it's an act), is done in the following way.

Suppose we have some classifier $C(S, x)$ which takes in some training set $S$, creates a classification, then spits out the classification of a new point $x$. For simplicity, for this post we can think of $C$ as a decision tree classifier. If we fed in training data $S$ that stood for symptoms of particular illnesses, then if $x$ was a new point that stood for, "shortness of breath, coughing, wheezing, chest tightness" then $C(S,x)$ might be equal to "asthma".

But sometimes $S$ is big and noisy. One way in statistics that one deals with such a thing is to take samples and then look at a sample distribution. We do a similar thing here. Let's say that $S_{1}, S_{2}, \dots, S_{m}$ are $m$ samples of size $N$ of $S$ with replacement. This part is important: to make $S_{1}$, for example, we pull an element out of $S$, copy it into $S_{1}$, put it back into $S$, and then do this again $N$ times (for $N$ is the size of $S_{1}$). We make $m$ sample sets this way.

At this point, we have smaller sets that we can classify, so let's do this. Suppose we want to classify some new element $x$, as above. Then we can classify $x$ using each of our sample sets. These values are given by: \[C(S_{1}, x), C(S_{2}, x), \dots, C(S_{m}, x)\] These are probably going to be different values. For example, with the $x$ above, we might get something like, \[\text{asthma}, \text{asthma}, \text{asthma}, \text{heart attack}, \text{asthma}\] and we'd have to deal with this somehow. The easiest way to deal with something like this is to take a majority vote and just return the value which comes up most frequently (for ties, returning one of the tied values at random). Let's call this the $\text{Maj}$ for majority since the mode will typically not return a random tied element in the case of a tie. Then, \[C_{bag}(x) = \text{Maj}(\{C(S_{i},x)\}_{i=1}^{m}).\]

The math might look intimidating but just remember that this just means to make a bunch of classifiers with samples (with replacement), then just take the value that comes up most frequently.

A Comparison: One Tree vs. Bagging.

I've written a minimal program in Python to compare a single tree vs. bagging trees. Feel free to skim this if you don't know python, I'll explain the results below it.

import random
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import export_graphviz
from sklearn import cross_validation
from sklearn.datasets import load_digits

# getting digit dataset, cross validation init.
i = random.randint(1,100000)
digits = load_digits(n_class=10)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(digits.data, digits.target, test_size=0.4, random_state=i)

# Decision tree classifier.
clf = DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

# Bagging classifier, 
bclf = BaggingClassifier()
bclf = bclf.fit(X_train, y_train)
bclf.score(X_test, y_test)

The bagging classifier can be read about here, but running this a few times gives me the following: \[\begin{array}{lc} \text{Decision Tree Accuracy Mean:} & 0.838859527121\\ \text{Decision Tree Accuracy StDev:} & 0.0139637555478 \end{array}\] \[\begin{array}{lc} \text{Bagging Accuracy Mean:} & 0.923838664812\\ \text{Bagging Accuracy StDev:} & 0.0143262891852 \end{array}\] As we see, the accuracy for bagging is significantly higher than the accuracy for the single decision tree: 83.9% vs. 92.4% is a fairly large jump. As always, we should note there is a danger in over-fitting models that we should keep in the back of our heads: if this model turned out to be 99% correct, we would want to do a bit of further investigation to see why we were doing so well.

Wait, what's the 'Bootstrap' in Bootstrap Aggregation mean?

The bootstrap part is because we're taking bootstrap samples, which is a method of sampling in statistics where we sample with replacement.

Wait, what about Random Forests? Isn't that the same thing as this?

Random forests are similar to bagged decision trees, but not exactly equivalent.

In bagged trees, we made trees exactly like the original decision tree — just using bootstrap samples of the original sample. We're using all of the same attributes each time.

For random forests we change one thing: instead of using the standard decision tree algorithm we modify it slightly by choosing a random sample of the features at each node.

For example, if we have features $1,2,3,\dots,64$, which could stand for pieces on a chess board at some point in a game, then our modified tree algorithm would look at the first node and take a subset of these features — for example, we could take $1, 2, 6, 9, 13, 53$ — and split based on these. On the next two nodes (the ones that split off the first), for each of them we'll take a random sample of the features again and keep splitting.

That's it. The rest is the same. We are just creating trees in a slightly different way than we're used to. If you've ever done a tree by hand, you'll realize the importance of taking subsets of the feature set, especially if the set of features is extremely large. As a rule of thumb, if a dataset has $m$ features one may use $\sqrt{m}$ as the number of features in the random forest classification.

On the same dataset as above, I get the following accuracy from the classification: \[\begin{array}{lc} \text{Decision Tree Accuracy Mean:} & 0.939443671766\\ \text{Decision Tree Accuracy StDev:} & 0.00847469791998 \end{array}\] which is a slight improvement from just bagging alone. In general, it's considered the case that \[\text{decision tree} \lt \text{bagging}\lt \text{random forests}\] though there is a trade-off in time and memory as we get better and better classifications.

Though random forests are important, we're not going to spend a whole lot of time on them here.

Boosting.

Similar to bagging, we're going to create a bunch of trees and average them together in some nice way. But unlike bagging, the way that we create trees is going to be a bit different each time: after creating one tree, we're going to put more weight on the elements that particular tree misclassified.

The most popular algorithm for boosting a classifier is AdaBoost (Adapative Boost). The details are a bit tedious, but the gist is the following:

Algorithm: AdaBoost

For Adaboost, we require binary classifiers; that is, we require the classes for each data-point to be either $-1$ or $1$. Several algorithms exist to expand this to a general classification.

Given classifiers $C_{i}$

  1. Every element starts with the same weight. Normalize these weights.
  2. For m = 1 to M,
    1. Fit $C_{m}(x)$ to the training data utilizing the current weights.
    2. Compute the weighted error of the classifier, which is given by the sum of the weights of misclassified items divided by the sum of of all of the weights. We call the weighted error $e_{i}$.
    3. Compute $\alpha_{i} = \log\left(\frac{1 - e_{i}}{e_{i}}\right)$. We'll use this at the end.
    4. Update each weight: things which were misclassified get a larger weight, things which were classified correctly get a smaller weight. We then normalize the weights.
  3. Output \[C_{ada}(x) = \text{Sign}\left(\sum_{i=1}^{M}\alpha_{i}C_{i}(x)\right).\]

The output might seem strange here, but remember that our original classifier was either $+1$ or $-1$. This basically says to sum up each of the decision trees with a weight attached, and see if we get a more positive or negative response from this. The weights attached to this model were the $\alpha_{i}$ variables above which, loosely speaking, exponentially gives positive values the fewer misclassified points are in the model and exponentially gives negative values the more misclassified points are in the model.

We're being vague on purpose: there's quite a bit of calculation here that's being done behind the scenes that we can easily get lost in if we're not careful. The interested reader can learn much more by reading this paper by one of the original authors of the AdaBoost algorithm.

Is AdaBoost the only boosting algorithm?

No way. Currently, one of the most popular boosting methods is called Gradient Boosting. We won't go through the algorithm here, but we note that the idea is fairly similar: we're going to take a sum of classifiers with weights, and we'd like the "best" weights possible.

Does Adaboost always do better than non-boosted classifiers?

Glad you asked.

No.

For the digits dataset in Sklearn, I've reproduced my results below. Here's a link to the Python notebook I used for this, so that you can plug in your own dataset.

* SKLEARN DIGIT DATASET *

Random Tree accuracy mean: 0.8406397774687067
Random Tree accuracy median: 0.8414464534075105
Random Tree accuracy stdev: 0.014129231244981898

Bagging accuracy mean: 0.9208623087621696
Bagging accuracy median: 0.9235048678720446
Bagging accuracy stdev: 0.012656467315716274

Random Forest accuracy mean: 0.9376077885952712
Random Forest accuracy median: 0.9381084840055632
Random Forest accuracy stdev: 0.009350659629664393

AdaBoost accuracy mean: 0.8442837273991654
AdaBoost accuracy median: 0.844923504867872
AdaBoost accuracy stdev: 0.024737784096652803

Gradient Boosting accuracy mean: 0.9573574408901251
Gradient Boosting accuracy median: 0.9568845618915159
Gradient Boosting accuracy stdev: 0.006924051898632598

Interestingly, the Adaboost algorithm, for the datasets that I was working with (digit classification and the iris dataset) did relatively poor compared to the more basic algorithms we've just seen; this is why it's important to try different methods and see how each does. Note that despite the poor performance of AdaBoost, the Gradient Boosting algorithm did extremely well — though, it did take quite a bit longer to compute even this relatively small dataset.

Of course, the digit dataset is not representative of every machine learning problem. It is a good exercise to go through a number of different datasets and ask yourself, "Why is this classifier doing so poorly/well on this problem?"

What now?

There's a ton of boosting algorithms out there that are nice to look at. We may investigate some in a future post here. Here's a few things that you should try at this point:

  1. Use your favorite programming language to recreate a Decision Tree algorithm.
  2. Find a bagging and boosting library for your favorite programming language (for Python, this will most likely be scikit-learn) and compare the different methods for different datasets.
  3. Find a dataset for which decision trees fail miserably: what caused this to happen? If you understand what kinds of data do not work well with decision trees, you may learn something profound about decision trees. You may become a Decision Tree Zen Master. You never know.