Title: Bumping and Stacking
1Bumping and Stacking
- Graduate Seminar on Communication and Signal
Processing -- Statistical Learning, ENEE 698A
Oct-29-2003 - Arunkumar M.
2Combining models
- Aim of combining models is to achieve better
accuracy and robustness. - Bumping, bagging, model averaging, stacking etc.
can be viewed as various ways of combining models.
3Bumping
- Bumping is Bootstrap Umbrella of Model
Parameters - It is a technique of combining classifiers/estimat
ors for better performance - It is a convenient method for finding better
local minima - Finds use in optimization under constraints
4The bumping procedure
- We have a training sample
- The model for the data depends on
- where
is a target criterion - Let us estimate by drawing bootstrap samples
- estimating via
from each sample -
5The bumping procedure
- is chosen as the value among the
producing the smallest value of - Three distinct scenarios to choose and
- is smooth but possesses many local minima.
- Choose and hope that bumping
produces a - better local minimum
- is not smooth and hence difficult to
minimize - numerically. Then minimization of a smoother
working - criterion, is convenient.
-
-
6The bumping procedure
- Constrained optimization problems have
- difficult to minimize numerically.
can be chosen to be simply the unconstrained
version of . - Notes
- By convention the original data set is always
included in the set of bootstrap samples. - Suppose our minimization of R finds global
minimum. - Then bumping also gives that global minimum by
choosing
7Model Complexity
- For the comparison of values to make
sense the, s should be of the same
complexity or the must include a factor of
complexity of the model. - For tree models, the cost complexity criterion
used is -
- where is a penalty parameter
8Example classification problem
- Consider the classification problem known as XOR
problem. - There are two classes in each dimension and two
such dimensions. - The greedy CART algorithm is too short-sighted to
give the correct result. - By bootstrap sampling bumping break the balance
in the classes and will by chance produce at
least one tree with initial split near the middle.
9Example classification problem contd
- Regular 4-node tree Bumped
4-node tree
10Bumping for resistant fitting
- Suppose we have data points
- where
is a vector. - A robust estimate of is obtained by
choosing, -
, (Least Median of Squares) - Choosing ,
we get approximate estimate version of LMS
estimator. - There is no guarantee that will minimize R.
- Since any outlier in the sample will by
chance be left out of some bootstrap samples, the
estimate from bumping will be close to the true
minimizer of R.
11Bumping for resistant fitting contd
- Since each observation appears in roughly
of the
observations, the number of bootstrap samples
required so that at least one will not contain
any of k given points is . For k1,2,3,4,5,6,7
this equals approximately 3,7,20,55,148,403,1096. - Hence for a reasonable number of bootstrap
samples one would only achieve protection against
a few outliers.
12Example for resistant fitting contd
- Generated 20 datasets of size 50 from the model
where X is a vector 6 standard
normal variates with - ,
- ,
- On average there 2.5 outliers per sample.
- MSE for a test sample with no outliers is shown
in the table. - , where
is the variance in the model.
13Example for resistant fitting contd
- MSE for the example. What is given is
average(standard deviation) over 20 simulations.
14Conclusion
- Bumping often improves the performance of the
model estimate - It improves the robustness against outliers
- It can simplify the optimization procedure,
especially in the case of constrained
optimization - It can improve the chances of finding a global
minimum against a local minimum
15Stacking
- Also known as stacked generalization
- Is another way of combining a number of models
(possibly of different kind) to achieve a better
model.
16Stacked Generalization
- An estimator or a classifier can be viewed as a
generalizer which guesses a parent function based
on a set of its sample mappings, known as the
learning set. - In stacked generalization, generalizers are
stacked, in the sense that outputs of
generalizers at one level feed the generalizers
at another level.
17Stacked Generalization contd
- A learning set L of m points
-
18Stacked Generalization contd
- is the learning set
for the level 0 - is a learning set for
the level 1 - An example
- Consider the parent function
- outputsum of the three input components
- Level 0 learning set L(0,0,00),(1,0,01),(1,2,0
3),(1,1,13), - (1,-2,43)
- (1,0,01),(1,2,03),(1,1,13),(1,-2,
43) - (0,0,0)
-
19Stacked Generalization contd
- And similarly other and
- Assume that M2, i.e.. 2 level 0 generalizers
- To get a learning input of level 1 say ,
train the generalizers with and find their
output for input - coupled with the i th output ,
constitutes one learning sample of the first
level. - A set of generalizers at level 1 can be trained
using this learning set. This process can be
continued.
20Decision process
- A question for decision at level 0 is passed
through the M generalizers at level 0 to get a
question at the level 1. This can be continued to
the next higher level and so on. - The final decision is obtained from the decision
of the generalizer at the highest level.
21Another example
- Level 0 input is 1-dimensional
- Problem is one of surface-fitting of simple math
functions. Level 0 generalizer linearly connects
the dots of the learning set to make an
input-output function. - There are only two levels. The level 1
generalizer returns a normalized weighted sum of
the outputs of the 3 nearest neighbors of the
point under question amongst the learning set.
Weighting factor is the reciprocal of the
distance between that neighbor and the point
under question.
22Example contd
-
- A 100-point learning set was chosen randomly, and
then a separate 100-point testing set of input
values was chosen randomly. Both sets belonged to
-10,10. Error obtained on the testing set was
recorded for the stacked generalizer and compared
to the errors of the level 0 generalizer run by
itself. - Average(squared error of the level 0 generalizer
alone)-(squared error of the stacked
generalizer)81.49 - Standard deviation of the error was 10.34 .
23Example contd
Black line Level 0 generalizers guessing
curve Red line Parent function
Element of level 0
learning set q0 Level 0
question
Output
error
q0
input
24Example contd
Black line Level 0 generalizers guessing
curve Red line Parent
function left-in element of
level 0 learning set left-out element of
level 0 learning set q1 Input component of
left-out point a level 1 question. Guessing
error forms the corresponding level 1 output.
Output
Guessing error
q1
input
25Example contd
- Guessing error (Level 1 output)
level 1 question Element of level 1
learning set
Level 1 input
26Conclusion
- Stacked generalizer is a scheme of feeding
information from one generalizer to another
before forming the final guess. - When used with single generalizer, it is a scheme
for estimating and correcting the errors of that
generalizer. - The generalizers at the subsequent levels get to
learn the biases of the generalizers at the lower
levels and hence can improve them. - For many generalization problems stacked
generalization can be expected to reduce the
generalization error rate.
27References
- The elements of statistical learning Data Mining,
Inference and Prediction - Trevor Hastie, Robert Tibshirani and Jerome
Friedman - Model search and inference by bootstrap bumping
- Robert Tibshirani and Keith Knight,
Technical Report, Nov 1995, University of Toronto - Stacked Generalization
- David H. Wolpert, Technical Report, Complex
Systems Group, Center for Non-linear Studies. -