Title: Boosting and Additive Trees (Part 1)
1Boosting and Additive Trees(Part 1)
- Ch. 10
- Presented by Tal Blum
2Overview
- Ensemble methods and motivations
- Describing Adaboost.M1 algorithm
- Show that Adaboost maximizes the exponential loss
- Other loss functions for classification and
regression
3Ensemble Learning Additive Models
- INTUITION Combining Predictions of an ensemble
is more accurate than a single classifier. - Justification ( Several reasons)
- easy to find quite correct rules of thumb
however hard to find single highly accurate
prediction rule. - If the training examples are few and the
hypothesis space is large then there are several
equally accurate classifiers. (model uncertainty) - Hypothesis space does not contain the true
function, but a linear combination of hypotheses
might. - Exhaustive global search in the hypothesis space
is expensive so we can combine the predictions of
several locally accurate classifiers. - Examples Bagging, HME, Splines
4Boosting (explaining)
5Example
learning curve for Y 1 if ? X2j gt
?210(0.5) 0 otherwise
6Adaboost.M1 Algorithn
- W(x) is the distribution of weights over the N
training points ? W(xi)1 - Initially assign uniform weights W0(x) 1/N for
all x. - At each iteration k
- Find best weak classifier Ck(x) using weights
Wk(x) - Compute ek the error rate as
- ek ? W(xi ) I(yi ? Ck(xi )) / ? W(xi
) - weight ak the classifier Cks weight in the final
hypothesis Set ak log ((1 ek )/ek ) - For each xi , Wk1(xi ) Wk(xi ) expakI(yi
? Ck(xi )) - CFINAL(x) sign ? ak Ck (x)
7Boosting asan Additive Model
- The final prediction in boosting f(x) can be
expressed as an additive expansion of individual
classifiers - The process is iterative and can be expressed as
follows. - Typically we would try to minimize a loss
function on the training examples
8Forward Stepwise Additive Modeling - algorithm
- Initialize f0(x)0
- For m 1 to M
- Compute
- Set
9Forward Stepwise Additive Modeling
- Sequentially adding new basis functions without
adjusting the parameters of the previously chosen
functions - Simple case Squared-error loss
- Forward stage-wise modeling amounts to just
fitting the residuals from previous iteration. - Squared-error loss not robust for classification
10Exponential Lossand Adaboost
- AdaBoost for Classification
- L(y, f (x)) exp(-y f (x)) - the exponential
loss function
11Exponential Lossand Adaboost
12Finding the best ?
13(No Transcript)
14Historical Notes
- Adaboost was first presented in ML theory as a
way to boost a week classifier - At first people thought it defies the no free
lunch theorem and doesnt overfitt. - Connection between Adaboost and stepwise additive
modeling was only recently discovered.
15Why Exponential Loss?
- Mainly Computational
- Derivatives are easy to compute
- Optimal classifiers minimizes the weighted sample
- Under mild assumptions the instances weights
decrease exponentially fast. - Statistical
- Exp. loss is not necessary for success of
boosting On Boosting and exponential loss
(Wyner) - We will see in the next slides
16Why Exponential Loss?
- Population minimizer (Friedman 2000)
- This justifies using its sign as a classification
rule.
17Why Exponential Loss?
- For exponential loss
- Interpreting f as a
- logit transform
- The population maximizers and
are the same
18Loss Functions and Robustness
- For a finite dataset exp. loss and binomial
deviance are not the same. - Both criterion are monotonic decreasing functions
of the margin. - Examples with negative margin yf(x)lt0 are
classified incorrectly.
19Loss Functions and Robustness
- The problem Classification error is not
differentiable and with derivative 0 where it is
differentiable. - We want a criterion which is efficient and as
close as possible to the true classification
lost. - Any loss criterion used for classification should
give higher weights to misclassified examples. - Therefore the square loss function is not
appropriate for classification.
20Loss Functions and Robustness
- Both functions can be though of as a continuous
approximation to the misclassification loss - Exponential lost grows exponentially fast for
instances with high margin - Such instances weight increases exponentially
- This makes Adaboost very sensitive to mislabeled
examples - Deviation generalizes to K classes, exp loss not.
21Robust Loss FunctionsFor Regression
- The relationship between square loss and absolute
loss is analogous to that of exp. loss and
deviance. - The solutions are the mean and median.
- Absolute loss is more robust.
- For regression MSE leads to Adaboost for
regression - For Gaussian errors and robustness to outliers
- Huber loss
22Sample of UCI datasets Comparison
Dataset Name J48 J48bagging(10) Adaboost\w Decision stumps SVMSMO B Net NB NN 1 LBMA LBMADEVIANCE
colic 85.1 82.8 81.08 78.38 78.37 79.73 77 85.1 82.43
anneal(70) 96.6 97.4 84.07 97.04 92.22 91.8 94.07 97 97.04
credit-a(x10) 84.49 86.67 85.94 85.65 85.07 85.36 80 86.22 84.06
iris-(disc5)x10 93.3 94 87.3 94 93.3 93.3 93.3 94.67 94
soybean-9x2 84.87 79.83 27.73 86.83 83.19 84.59 80.11 87.36 87.68
soybean-37 90.51 85.4 24.09 93.43 90.51 88.32 82.48 92.7 94.16
labor-(disc5) 70.18 78.95 87.82 87.72 94.74 91.23 85.96 94.74 94.74
autos-(disc5)x2 70.73 64.39 44.88 73.17 61.95 61.46 77.07 65.35 76.1
credit-g(70) 74.33 73.67 74.33 74.67 77 76.67 67.67 74.33 76.67
glassx5 57.94 56.54 42.06 57.94 56.54 54.67 55.14 58.41 57.48
diabetes 68.36 68.49 71.61 70.18 70.31 69.92 64.45 68 69.4
audiology 76.55 76.55 46.46 80.97 75.22 71.24 73.45 79.6 80.09
breast-cancer 74.13 68.18 72.38 69.93 72.03 72.73 68.18 75.52 76.22
heart-c-disc 77.56 81.19 84.49 83.17 84.16 83.83 76.57 80.21 84.16
vowel x 5 71.92 71.92 17.97 86.46 63.94 63.94 90.7 94.04 93.84
Average 78.44 77.732 62.1473 81.3 79 77.9 77.74 82.22 83.205
23Next Presentation