Dynamics of AdaBoost - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

Dynamics of AdaBoost

Description:

A Story about AdaBoost. AdaBoost was introduced in 1997 by Yoav Freund and ... Webpage classification (search engines), email filtering, ... automatic .mp3 sorting ... – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 61

Provided by: cynthi121

Category:

more less

Transcript and Presenter's Notes

Title: Dynamics of AdaBoost

1
Dynamics of AdaBoost

Cynthia Rudin, PhD
NSF Postdoc, BIO Division
Center for Neural Science, NYU

Joint work with Ingrid Daubechies and Robert E.
Schapire Princeton University
May 2005, TTI
2
A Story about AdaBoost

AdaBoost was introduced in 1997 by Yoav Freund
and Robert E. Schapire. It is a classification
algorithm.
AdaBoost often tends not to overfit. (Breiman
96, Cortes and Drucker 97, etc.)
As a result, the margin theory (Schapire,
Freund, Bartlett and Lee 98) developed, which is
based on loose generalization bounds.
Note margin for boosting is not the same as
margin for svm.
Remember, AdaBoost was invented before the
margin theory.

The question remained (until recently) Does
AdaBoost maximize the margin?
Margin is between -1 and 1
3
The question remained (until recently) Does
AdaBoost maximize the margin?

Empirical results on convergence of AdaBoost
AdaBoost seemed to maximize the margin in the
limit (Grove and Schuurmans 98, and others)

Seems very much like yes
4
The question remained (until recently) Does
AdaBoost maximize the margin?

Theoretical results on convergence of AdaBoost
AdaBoost generates a margin that is at least ½?,
where ? is the maximum margin. (Schapire, Freund,
Bartlett, and Lee 98)
seems like yes

?
AdaBoosts margin is at least this much
?/2 (Schapire et al. 98)
true margin
5
The question remained (until recently) Does
AdaBoost maximize the margin?

Theoretical results on convergence of AdaBoost
2) AdaBoost generates a margin that is at least
?(?) ½?. (Rätsch and Warmuth 02).
even closer to yes

?
Y(?) (RätschWarmuth 02)
AdaBoosts margin is at least this much
?/2 (Schapire et al. 98)
true margin
6
The question remained (until recently) Does
AdaBoost maximize the margin?
2) AdaBoost generates a margin that is at least
?(?) ½?. (Rätsch and Warmuth 02).

Two cases of interest
optimal case
the weak learning algorithm chooses the best
weak classifier at each iteration.
e.g., BoosTexter
non-optimal case
the weak learning algorithm is only required to
choose a sufficiently good weak classifier at
each iteration, not necessarily the best one.
e.g., weak learning algorithm is a decision tree
or neural network

7
The question remained (until recently) Does
AdaBoost maximize the margin?
2) AdaBoost generates a margin that is at least
?(?) ½?. (Rätsch and Warmuth 02).
This bound conjectured to be tight for the
non-optimal case (based on numerical evidence).
(Rätsch and Warmuth 02).
perhaps yes for the optimal case, but no for
non-optimal case
8
The question remained (until recently) Does
AdaBoost maximize the margin?

Hundreds of papers were published using AdaBoost
between 1997-2004, even though fundamental
convergence properties were not understood! Even
after 7 years, this problem was still open!
AdaBoost is difficult to analyze because the
margin does not increase at every iteration the
usual tricks dont work!
A new approach was needed in order to understand
the convergence of this algorithm.

9
The question remained (until recently) Does
AdaBoost maximize the margin?
The answer is
Theorem (R, Daubechies, Schapire 04) AdaBoost
may converge to a margin that is significantly
below maximum.
The answer is no. Its the opposite of what
everyone thought! ?
Theorem (R, Daubechies, Schapire 04) The bound
of (Rätsch and Warmuth 02) is tight, i.e.,
non- optimal AdaBoost will converge to a margin
of ?(?) whenever lim rt ?. (Note this is a
specific case of a more general theorem.)
10
? Overview of Talk ?

History of Margin Theory for Boosting (done)
Introduction to AdaBoost
Proof of the Theorem Reduce AdaBoost to a
dynamical system to understand its convergence!

11
? A Sample Problem ?
12
Say you have a database of news articles
where articles are labeled 1 if the category
is entertainment, and -1 otherwise.
Your goal is Given a new article ,
find its label.
13
Examples of Classification Tasks

Optical Character Recognition (OCR) (post office,
banks), object recognition in images.
Webpage classification (search engines), email
filtering, document retrieval
Bioinformatics (analysis of gene array data,
protein classification, etc.)
Speech recognition, automatic .mp3 sorting

Huge number of applications, but all have high
dimensional data
14
Examples of classification algorithms

SVMs (Support Vector Machines large margin
classifiers)
Neural Networks
Decision Trees / Decision Stumps (CART)
RBF Networks
Nearest Neighbors
Bayes Net
Boosting
used by itself via stumps (e.g.
BoosTexter),
or
as a wrapper for another algorithm
(e.g boosted Decision Trees, boosted Neural
Networks)

15
Training Data (xi,yi)i1..m where (xi,yi) is
chosen iid from an unknown probability
distribution on X?-1,1.
space of all possible articles
labels
X

_
_

?
_
16
How do we construct a classifier?

Divide the space X into two sections, based on
the sign of a function f X?R.
Decision boundary is the zero-level set of f.

f(x)0

-
X

_
_

?
_
17
Say we have a weak learning algorithm

A weak learning algorithm produces weak
classifiers.
(Think of a weak classifier as a rule of
thumb)

Examples of weak classifiers for entertainment
application
Wouldnt it be nice to combine the weak
classifiers?
18
Boosting algorithms combine weak classifiers in a
meaningful way (Schapire 89).
Example
A boosting algorithm takes as input - the
weak learning algorithm which produces the weak
classifiers - a large training database
So if the article contains the term movie, and
the word drama, but not the word actor
The value of f is sign.4-.3.3
sign.41, so we label it 1.
and outputs - the coefficients of the weak
classifiers to make the combined
classifier
19
AdaBoost (Freund and Schapire 96)

Start with a uniform distribution (weights)
over training examples.
(The weights tell the weak learning algorithm
which examples are important.)
Obtain a weak classifier from the weak learning
algorithm, hjtX?-1,1.
Increase the weights on the training examples
that were misclassified.
(Repeat)

At the end, make (carefully!) a linear
combination of the weak classifiers obtained at
all iterations.
20
AdaBoost
Define
matrix of weak classifiers and data
Enumerate every possible weak classifier which
can be produced by weak learning algorithm
h1 hj hn
movie actor drama
1 i m
Mij
of training examples
The matrix M has too many columns to actually be
enumerated. M acts as the only input to AdaBoost.
21
AdaBoost
Define
distribution (weights) over examples at time
t
22
AdaBoost
Define
coeffs of weak classifiers for the linear
combination
23
AdaBoost
M
matrix of weak classifiers and training
instances
coefficients for final combined classifier
weights on the training instances
coefficients on the weak classifiers to form the
combined classifier
24
AdaBoost
M
matrix of weak classifiers and training examples
coefficients for final combined classifier
weights on the training examples
rt the edge
coefficients on the weak classifiers to form the
combined classifier
ds cycle, lambdas converge
25
Does AdaBoost choose ?final so that the margin µ(
f ) is maximized? That is, does AdaBoost maximize
the margin? No!

-

X

_
_

_
26
The question remained (until recently) Does
AdaBoost maximize the margin?
The answer is
Theorem (R, Daubechies, Schapire 04) AdaBoost
may converge to a margin that is significantly
below maximum.
The answer is no. Its the opposite of what
everyone thought! ?
27
About the proof

AdaBoost is difficult to analyze
We use a dynamical systems approach to study
this problem.
Reduce AdaBoost to a dynamical system
Analyze the dynamical system in simple cases
remarkably find stable cycles!
Convergence properties can be completely
understood in
these cases.

28
The key to answering this open question
A set of examples where AdaBoosts convergence
properties can be completely understood.
29
? Analyzing AdaBoost using Dynamical Systems ?

Reduced Dynamics

Compare to AdaBoost
Iterated map for directly updating dt. Reduction
uses the fact that M is binary.
The existence of this map enables study of
low-dim cases.
30
Smallest Non-Trivial Case
t1 ? ? ? ? ? ? t50
31
(No Transcript)
32
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
33
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
34
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
35
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
36
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
37
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
38
Smallest Non-Trivial Case
To solve simply assume 3-cycle exists.
Convergence to 3-cycle is really strong.
39
Two possible stable cycles!
x
Maximum margin solution is attained!
x
x
t1 ? ? ? ? ? ? t50
To solve simply assume 3-cycle exists. AdaBoost
achieves max margin here, so conjecture true in
at least one case. The edge, r_t, is the golden
ratio minus 1.
40
Generalization of smallest non-trivial case

Case of m weak classifiers, each misclassifies
one point
Existence of at least (m-1)! stable cycles, each
yields a
maximum margin solution.

Cant solve for cycle exactly, but can prove our
equation has a unique solution for each cycle.
41
Generalization of smallest non-trivial case

Stable manifolds of 3-cycles.

42
? Empirically Observed Cycles ?
43
? Empirically Observed Cycles ?
t1 ? ? ? ? ? ? t300
44
? Empirically Observed Cycles ?
t1 ? ? ? ? ? ? t400
45
? Empirically Observed Cycles ?
t1 ? ? ? ? ? ? t400
46
? Empirically Observed Cycles ?
t1 ? ? ? ? ? ? t300
47
? Empirically Observed Cycles ?
t1 ? ? ? ? ? ? t5500 (only plotted every
20th iterate)
48
? Empirically Observed Cycles ?
t1 ? ? ? ? ? ? t400
49

If AdaBoost cycles, we can calculate the margin
it will asymptotically converge to in terms of
the edge values

50
The question remained (until recently) Does
AdaBoost maximize the margin?

AdaBoost does not always produce a maximum
margin classifier!
Proof

There exists an 8x8 matrix M where AdaBoost
provably converges to a non-maximum margin
solution.

Convergence to a manifold of strongly attracting
stable 3-cycles.
Margin produced by AdaBoost is 1/3,
but maximum margin is 3/8!

51
Approximate Coordinate Ascent Boosting (R,
Schapire, Daubechies, 04)
AdaBoost
52
Recap of main result

AdaBoost does not always produce a maximum margin
classifier! (Contradicts what everyone thought!)
A new algorithm, Approximate Coordinate Ascent
Boosting, always provably does.
(And it has a fast convergence rate!)
Im not going to talk about this much
todaysorry!
AdaBoost is still interesting to study though!

53
A couple more interesting results
54
Theorem (R, Schapire, Daubechies 04) The bound
of (Rätsch and Warmuth 02) is tight, i.e.,
AdaBoost may converge to a margin within ?(?),
?(?em) in the non-optimal case. Namely, if
the edge is within ?, ?em, AdaBoosts
margin will converge to the interval ?(?),
?(?em).

We can coerce AdaBoost to converge to any
margin wed like,
in the non-optimal case!

Theorem (R, Schapire, Daubechies 04) For any
given em,it is possible to construct a case in
which the edge is within ?, ?em.
55
? Summary for Dynamics of AdaBoost ?

Analyzed the AdaBoost algorithm using an unusual
technique, namely dynamical systems.
Found remarkable stable cycles.
Answered the question of whether AdaBoost
converges to a maximum margin solution.
Key is a set of examples in which AdaBoosts
convergence could be completely understood.

56
A rather bizarre result
57
Consider the following weak classifier
Pretty silly way to classify, huh.
Add this into our set of weak classifiers, and
AdaBoost turns into a ranking algorithm.
58
AdaBoost and Ranking

Cortes Mohri, and Caruana Niculescu-Mizil
have observed that AdaBoost is good for ranking
(in addition to RankBoost).
(It tends to achieve a high AUC value.)

- Define F-skew it measures the imbalance of the
loss between positive and negative examples.
AdaBoost and RankBoost Theorem (R, Cortes, Mohri,
Schapire, 05) Whenever the F-skew vanishes,
AdaBoost converges to the minimum of RankBoosts
objective function.
Theorem The F-skew vanishes whenever the
constant hypothesis is included in the set of
weak classifiers. So
59
AdaBoost is also useful for designing pretty
pictures of cyclic dynamics and chaos.
60
(No Transcript)
61
edge(t)
t
62
? Thank you ?
Thanks to NSF Postdoctoral Fellowship BIO
Division current supervisor
Eero Simoncelli, CNS at NYU
My webpage www.cns.nyu.edu/rudin

Write a Comment

User Comments (0)