Bumping and Stacking - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Bumping and Stacking

Description:

(Least Median of Squares) Choosing , we get approximate estimate version of LMS estimator. ... Problem is one of surface-fitting of simple math functions. ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 28

Provided by: arunk7

Category:

more less

Transcript and Presenter's Notes

Title: Bumping and Stacking

1
Bumping and Stacking

Graduate Seminar on Communication and Signal
Processing -- Statistical Learning, ENEE 698A
Oct-29-2003
Arunkumar M.

2
Combining models

Aim of combining models is to achieve better
accuracy and robustness.
Bumping, bagging, model averaging, stacking etc.
can be viewed as various ways of combining models.

3
Bumping

Bumping is Bootstrap Umbrella of Model
Parameters
It is a technique of combining classifiers/estimat
ors for better performance
It is a convenient method for finding better
local minima
Finds use in optimization under constraints

4
The bumping procedure

We have a training sample
The model for the data depends on
where
is a target criterion
Let us estimate by drawing bootstrap samples
estimating via
from each sample

5
The bumping procedure

is chosen as the value among the
producing the smallest value of
Three distinct scenarios to choose and
is smooth but possesses many local minima.
Choose and hope that bumping
produces a
better local minimum
is not smooth and hence difficult to
minimize
numerically. Then minimization of a smoother
working
criterion, is convenient.

6
The bumping procedure

Constrained optimization problems have
difficult to minimize numerically.
can be chosen to be simply the unconstrained
version of .
Notes
By convention the original data set is always
included in the set of bootstrap samples.
Suppose our minimization of R finds global
minimum.
Then bumping also gives that global minimum by
choosing

7
Model Complexity

For the comparison of values to make
sense the, s should be of the same
complexity or the must include a factor of
complexity of the model.
For tree models, the cost complexity criterion
used is
where is a penalty parameter

8
Example classification problem

Consider the classification problem known as XOR
problem.
There are two classes in each dimension and two
such dimensions.
The greedy CART algorithm is too short-sighted to
give the correct result.
By bootstrap sampling bumping break the balance
in the classes and will by chance produce at
least one tree with initial split near the middle.

9
Example classification problem contd

Regular 4-node tree Bumped
4-node tree

10
Bumping for resistant fitting

Suppose we have data points
where
is a vector.
A robust estimate of is obtained by
choosing,
, (Least Median of Squares)
Choosing ,
we get approximate estimate version of LMS
estimator.
There is no guarantee that will minimize R.
Since any outlier in the sample will by
chance be left out of some bootstrap samples, the
estimate from bumping will be close to the true
minimizer of R.

11
Bumping for resistant fitting contd

Since each observation appears in roughly
of the
observations, the number of bootstrap samples
required so that at least one will not contain
any of k given points is . For k1,2,3,4,5,6,7
this equals approximately 3,7,20,55,148,403,1096.
Hence for a reasonable number of bootstrap
samples one would only achieve protection against
a few outliers.

12
Example for resistant fitting contd

Generated 20 datasets of size 50 from the model
where X is a vector 6 standard
normal variates with
,
,
On average there 2.5 outliers per sample.
MSE for a test sample with no outliers is shown
in the table.
, where
is the variance in the model.

13
Example for resistant fitting contd

MSE for the example. What is given is
average(standard deviation) over 20 simulations.

14
Conclusion

Bumping often improves the performance of the
model estimate
It improves the robustness against outliers
It can simplify the optimization procedure,
especially in the case of constrained
optimization
It can improve the chances of finding a global
minimum against a local minimum

15
Stacking

Also known as stacked generalization
Is another way of combining a number of models
(possibly of different kind) to achieve a better
model.

16
Stacked Generalization

An estimator or a classifier can be viewed as a
generalizer which guesses a parent function based
on a set of its sample mappings, known as the
learning set.
In stacked generalization, generalizers are
stacked, in the sense that outputs of
generalizers at one level feed the generalizers
at another level.

17
Stacked Generalization contd

A learning set L of m points

-
18
Stacked Generalization contd

is the learning set
for the level 0
is a learning set for
the level 1
An example
Consider the parent function
outputsum of the three input components
Level 0 learning set L(0,0,00),(1,0,01),(1,2,0
3),(1,1,13),
(1,-2,43)
(1,0,01),(1,2,03),(1,1,13),(1,-2,
43)
(0,0,0)

19
Stacked Generalization contd

And similarly other and
Assume that M2, i.e.. 2 level 0 generalizers
To get a learning input of level 1 say ,
train the generalizers with and find their
output for input
coupled with the i th output ,
constitutes one learning sample of the first
level.
A set of generalizers at level 1 can be trained
using this learning set. This process can be
continued.

20
Decision process

A question for decision at level 0 is passed
through the M generalizers at level 0 to get a
question at the level 1. This can be continued to
the next higher level and so on.
The final decision is obtained from the decision
of the generalizer at the highest level.

21
Another example

Level 0 input is 1-dimensional
Problem is one of surface-fitting of simple math
functions. Level 0 generalizer linearly connects
the dots of the learning set to make an
input-output function.
There are only two levels. The level 1
generalizer returns a normalized weighted sum of
the outputs of the 3 nearest neighbors of the
point under question amongst the learning set.
Weighting factor is the reciprocal of the
distance between that neighbor and the point
under question.

22
Example contd

A 100-point learning set was chosen randomly, and
then a separate 100-point testing set of input
values was chosen randomly. Both sets belonged to
-10,10. Error obtained on the testing set was
recorded for the stacked generalizer and compared
to the errors of the level 0 generalizer run by
itself.
Average(squared error of the level 0 generalizer
alone)-(squared error of the stacked
generalizer)81.49
Standard deviation of the error was 10.34 .

23
Example contd
Black line Level 0 generalizers guessing
curve Red line Parent function
Element of level 0
learning set q0 Level 0
question
Output
error
q0
input
24
Example contd
Black line Level 0 generalizers guessing
curve Red line Parent

function left-in element of
level 0 learning set left-out element of
level 0 learning set q1 Input component of
left-out point a level 1 question. Guessing
error forms the corresponding level 1 output.

Output
Guessing error
q1
input
25
Example contd

Guessing error (Level 1 output)

level 1 question Element of level 1
learning set
Level 1 input
26
Conclusion

Stacked generalizer is a scheme of feeding
information from one generalizer to another
before forming the final guess.
When used with single generalizer, it is a scheme
for estimating and correcting the errors of that
generalizer.
The generalizers at the subsequent levels get to
learn the biases of the generalizers at the lower
levels and hence can improve them.
For many generalization problems stacked
generalization can be expected to reduce the
generalization error rate.

27
References

The elements of statistical learning Data Mining,
Inference and Prediction
Trevor Hastie, Robert Tibshirani and Jerome
Friedman
Model search and inference by bootstrap bumping
Robert Tibshirani and Keith Knight,
Technical Report, Nov 1995, University of Toronto
Stacked Generalization
David H. Wolpert, Technical Report, Complex
Systems Group, Center for Non-linear Studies.