Boosting and Additive Trees (Part 1) - PowerPoint PPT Presentation

About This Presentation

Title:

Boosting and Additive Trees (Part 1)

Description:

Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum Overview Ensemble methods and motivations Describing Adaboost.M1 algorithm Show that Adaboost ... – PowerPoint PPT presentation

Number of Views:124

Avg rating:3.0/5.0

Slides: 24

Provided by: Schoolo202

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Boosting and Additive Trees (Part 1)

1
Boosting and Additive Trees(Part 1)

Ch. 10
Presented by Tal Blum

2
Overview

Ensemble methods and motivations
Describing Adaboost.M1 algorithm
Show that Adaboost maximizes the exponential loss
Other loss functions for classification and
regression

3
Ensemble Learning Additive Models

INTUITION Combining Predictions of an ensemble
is more accurate than a single classifier.
Justification ( Several reasons)
easy to find quite correct rules of thumb
however hard to find single highly accurate
prediction rule.
If the training examples are few and the
hypothesis space is large then there are several
equally accurate classifiers. (model uncertainty)
Hypothesis space does not contain the true
function, but a linear combination of hypotheses
might.
Exhaustive global search in the hypothesis space
is expensive so we can combine the predictions of
several locally accurate classifiers.
Examples Bagging, HME, Splines

4
Boosting (explaining)
5
Example
learning curve for Y 1 if ? X2j gt
?210(0.5) 0 otherwise
6
Adaboost.M1 Algorithn

W(x) is the distribution of weights over the N
training points ? W(xi)1
Initially assign uniform weights W0(x) 1/N for
all x.
At each iteration k
Find best weak classifier Ck(x) using weights
Wk(x)
Compute ek the error rate as
ek ? W(xi ) I(yi ? Ck(xi )) / ? W(xi
)
weight ak the classifier Cks weight in the final
hypothesis Set ak log ((1 ek )/ek )
For each xi , Wk1(xi ) Wk(xi ) expakI(yi
? Ck(xi ))
CFINAL(x) sign ? ak Ck (x)

7
Boosting asan Additive Model

The final prediction in boosting f(x) can be
expressed as an additive expansion of individual
classifiers
The process is iterative and can be expressed as
follows.
Typically we would try to minimize a loss
function on the training examples

8
Forward Stepwise Additive Modeling - algorithm

Initialize f0(x)0
For m 1 to M
Compute
Set

9
Forward Stepwise Additive Modeling

Sequentially adding new basis functions without
adjusting the parameters of the previously chosen
functions
Simple case Squared-error loss
Forward stage-wise modeling amounts to just
fitting the residuals from previous iteration.
Squared-error loss not robust for classification

10
Exponential Lossand Adaboost

AdaBoost for Classification
L(y, f (x)) exp(-y f (x)) - the exponential
loss function

11
Exponential Lossand Adaboost

Assuming ? ? 0

12
Finding the best ?
13
(No Transcript)
14
Historical Notes

Adaboost was first presented in ML theory as a
way to boost a week classifier
At first people thought it defies the no free
lunch theorem and doesnt overfitt.
Connection between Adaboost and stepwise additive
modeling was only recently discovered.

15
Why Exponential Loss?

Mainly Computational
Derivatives are easy to compute
Optimal classifiers minimizes the weighted sample
Under mild assumptions the instances weights
decrease exponentially fast.
Statistical
Exp. loss is not necessary for success of
boosting On Boosting and exponential loss
(Wyner)
We will see in the next slides

16
Why Exponential Loss?

Population minimizer (Friedman 2000)
This justifies using its sign as a classification
rule.

17
Why Exponential Loss?

For exponential loss
Interpreting f as a
logit transform
The population maximizers and
are the same

18
Loss Functions and Robustness

For a finite dataset exp. loss and binomial
deviance are not the same.
Both criterion are monotonic decreasing functions
of the margin.
Examples with negative margin yf(x)lt0 are
classified incorrectly.

19
Loss Functions and Robustness

The problem Classification error is not
differentiable and with derivative 0 where it is
differentiable.
We want a criterion which is efficient and as
close as possible to the true classification
lost.
Any loss criterion used for classification should
give higher weights to misclassified examples.
Therefore the square loss function is not
appropriate for classification.

20
Loss Functions and Robustness

Both functions can be though of as a continuous
approximation to the misclassification loss
Exponential lost grows exponentially fast for
instances with high margin
Such instances weight increases exponentially
This makes Adaboost very sensitive to mislabeled
examples
Deviation generalizes to K classes, exp loss not.

21
Robust Loss FunctionsFor Regression

The relationship between square loss and absolute
loss is analogous to that of exp. loss and
deviance.
The solutions are the mean and median.
Absolute loss is more robust.
For regression MSE leads to Adaboost for
regression
For Gaussian errors and robustness to outliers
Huber loss

22
Sample of UCI datasets Comparison
Dataset Name J48 J48bagging(10) Adaboost\w Decision stumps SVMSMO B Net NB NN 1 LBMA LBMADEVIANCE
colic 85.1 82.8 81.08 78.38 78.37 79.73 77 85.1 82.43
anneal(70) 96.6 97.4 84.07 97.04 92.22 91.8 94.07 97 97.04
credit-a(x10) 84.49 86.67 85.94 85.65 85.07 85.36 80 86.22 84.06
iris-(disc5)x10 93.3 94 87.3 94 93.3 93.3 93.3 94.67 94
soybean-9x2 84.87 79.83 27.73 86.83 83.19 84.59 80.11 87.36 87.68
soybean-37 90.51 85.4 24.09 93.43 90.51 88.32 82.48 92.7 94.16
labor-(disc5) 70.18 78.95 87.82 87.72 94.74 91.23 85.96 94.74 94.74
autos-(disc5)x2 70.73 64.39 44.88 73.17 61.95 61.46 77.07 65.35 76.1
credit-g(70) 74.33 73.67 74.33 74.67 77 76.67 67.67 74.33 76.67
glassx5 57.94 56.54 42.06 57.94 56.54 54.67 55.14 58.41 57.48
diabetes 68.36 68.49 71.61 70.18 70.31 69.92 64.45 68 69.4
audiology 76.55 76.55 46.46 80.97 75.22 71.24 73.45 79.6 80.09
breast-cancer 74.13 68.18 72.38 69.93 72.03 72.73 68.18 75.52 76.22
heart-c-disc 77.56 81.19 84.49 83.17 84.16 83.83 76.57 80.21 84.16
vowel x 5 71.92 71.92 17.97 86.46 63.94 63.94 90.7 94.04 93.84
Average 78.44 77.732 62.1473 81.3 79 77.9 77.74 82.22 83.205
23
Next Presentation

Write a Comment

User Comments (0)