Bumping and Stacking - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Bumping and Stacking

Description:

(Least Median of Squares) Choosing , we get approximate estimate version of LMS estimator. ... Problem is one of surface-fitting of simple math functions. ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 28
Provided by: arunk7
Category:
Tags: bumping | stacking

less

Transcript and Presenter's Notes

Title: Bumping and Stacking


1
Bumping and Stacking
  • Graduate Seminar on Communication and Signal
    Processing -- Statistical Learning, ENEE 698A
    Oct-29-2003
  • Arunkumar M.

2
Combining models
  • Aim of combining models is to achieve better
    accuracy and robustness.
  • Bumping, bagging, model averaging, stacking etc.
    can be viewed as various ways of combining models.

3
Bumping
  • Bumping is Bootstrap Umbrella of Model
    Parameters
  • It is a technique of combining classifiers/estimat
    ors for better performance
  • It is a convenient method for finding better
    local minima
  • Finds use in optimization under constraints

4
The bumping procedure
  • We have a training sample
  • The model for the data depends on
  • where
    is a target criterion
  • Let us estimate by drawing bootstrap samples
  • estimating via
    from each sample

5
The bumping procedure
  • is chosen as the value among the
    producing the smallest value of
  • Three distinct scenarios to choose and
  • is smooth but possesses many local minima.
  • Choose and hope that bumping
    produces a
  • better local minimum
  • is not smooth and hence difficult to
    minimize
  • numerically. Then minimization of a smoother
    working
  • criterion, is convenient.




6
The bumping procedure
  • Constrained optimization problems have
  • difficult to minimize numerically.
    can be chosen to be simply the unconstrained
    version of .
  • Notes
  • By convention the original data set is always
    included in the set of bootstrap samples.
  • Suppose our minimization of R finds global
    minimum.
  • Then bumping also gives that global minimum by
    choosing

7
Model Complexity
  • For the comparison of values to make
    sense the, s should be of the same
    complexity or the must include a factor of
    complexity of the model.
  • For tree models, the cost complexity criterion
    used is
  • where is a penalty parameter

8
Example classification problem
  • Consider the classification problem known as XOR
    problem.
  • There are two classes in each dimension and two
    such dimensions.
  • The greedy CART algorithm is too short-sighted to
    give the correct result.
  • By bootstrap sampling bumping break the balance
    in the classes and will by chance produce at
    least one tree with initial split near the middle.

9
Example classification problem contd
  • Regular 4-node tree Bumped
    4-node tree

10
Bumping for resistant fitting
  • Suppose we have data points
  • where
    is a vector.
  • A robust estimate of is obtained by
    choosing,

  • , (Least Median of Squares)
  • Choosing ,
    we get approximate estimate version of LMS
    estimator.
  • There is no guarantee that will minimize R.
  • Since any outlier in the sample will by
    chance be left out of some bootstrap samples, the
    estimate from bumping will be close to the true
    minimizer of R.

11
Bumping for resistant fitting contd
  • Since each observation appears in roughly
    of the
    observations, the number of bootstrap samples
    required so that at least one will not contain
    any of k given points is . For k1,2,3,4,5,6,7
    this equals approximately 3,7,20,55,148,403,1096.
  • Hence for a reasonable number of bootstrap
    samples one would only achieve protection against
    a few outliers.

12
Example for resistant fitting contd
  • Generated 20 datasets of size 50 from the model
    where X is a vector 6 standard
    normal variates with
  • ,
  • ,
  • On average there 2.5 outliers per sample.
  • MSE for a test sample with no outliers is shown
    in the table.
  • , where
    is the variance in the model.

13
Example for resistant fitting contd
  • MSE for the example. What is given is
    average(standard deviation) over 20 simulations.

14
Conclusion
  • Bumping often improves the performance of the
    model estimate
  • It improves the robustness against outliers
  • It can simplify the optimization procedure,
    especially in the case of constrained
    optimization
  • It can improve the chances of finding a global
    minimum against a local minimum

15
Stacking
  • Also known as stacked generalization
  • Is another way of combining a number of models
    (possibly of different kind) to achieve a better
    model.

16
Stacked Generalization
  • An estimator or a classifier can be viewed as a
    generalizer which guesses a parent function based
    on a set of its sample mappings, known as the
    learning set.
  • In stacked generalization, generalizers are
    stacked, in the sense that outputs of
    generalizers at one level feed the generalizers
    at another level.

17
Stacked Generalization contd
  • A learning set L of m points

-
18
Stacked Generalization contd
  • is the learning set
    for the level 0
  • is a learning set for
    the level 1
  • An example
  • Consider the parent function
  • outputsum of the three input components
  • Level 0 learning set L(0,0,00),(1,0,01),(1,2,0
    3),(1,1,13),
  • (1,-2,43)
  • (1,0,01),(1,2,03),(1,1,13),(1,-2,
    43)
  • (0,0,0)

19
Stacked Generalization contd
  • And similarly other and
  • Assume that M2, i.e.. 2 level 0 generalizers
  • To get a learning input of level 1 say ,
    train the generalizers with and find their
    output for input
  • coupled with the i th output ,
    constitutes one learning sample of the first
    level.
  • A set of generalizers at level 1 can be trained
    using this learning set. This process can be
    continued.

20
Decision process
  • A question for decision at level 0 is passed
    through the M generalizers at level 0 to get a
    question at the level 1. This can be continued to
    the next higher level and so on.
  • The final decision is obtained from the decision
    of the generalizer at the highest level.

21
Another example
  • Level 0 input is 1-dimensional
  • Problem is one of surface-fitting of simple math
    functions. Level 0 generalizer linearly connects
    the dots of the learning set to make an
    input-output function.
  • There are only two levels. The level 1
    generalizer returns a normalized weighted sum of
    the outputs of the 3 nearest neighbors of the
    point under question amongst the learning set.
    Weighting factor is the reciprocal of the
    distance between that neighbor and the point
    under question.

22
Example contd
  • A 100-point learning set was chosen randomly, and
    then a separate 100-point testing set of input
    values was chosen randomly. Both sets belonged to
    -10,10. Error obtained on the testing set was
    recorded for the stacked generalizer and compared
    to the errors of the level 0 generalizer run by
    itself.
  • Average(squared error of the level 0 generalizer
    alone)-(squared error of the stacked
    generalizer)81.49
  • Standard deviation of the error was 10.34 .

23
Example contd
Black line Level 0 generalizers guessing
curve Red line Parent function
Element of level 0
learning set q0 Level 0
question
Output
error
q0
input
24
Example contd
Black line Level 0 generalizers guessing
curve Red line Parent




function left-in element of
level 0 learning set left-out element of
level 0 learning set q1 Input component of
left-out point a level 1 question. Guessing
error forms the corresponding level 1 output.

Output
Guessing error
q1
input
25
Example contd
  • Guessing error (Level 1 output)

level 1 question Element of level 1
learning set
Level 1 input
26
Conclusion
  • Stacked generalizer is a scheme of feeding
    information from one generalizer to another
    before forming the final guess.
  • When used with single generalizer, it is a scheme
    for estimating and correcting the errors of that
    generalizer.
  • The generalizers at the subsequent levels get to
    learn the biases of the generalizers at the lower
    levels and hence can improve them.
  • For many generalization problems stacked
    generalization can be expected to reduce the
    generalization error rate.

27
References
  • The elements of statistical learning Data Mining,
    Inference and Prediction
  • Trevor Hastie, Robert Tibshirani and Jerome
    Friedman
  • Model search and inference by bootstrap bumping
  • Robert Tibshirani and Keith Knight,
    Technical Report, Nov 1995, University of Toronto
  • Stacked Generalization
  • David H. Wolpert, Technical Report, Complex
    Systems Group, Center for Non-linear Studies.
Write a Comment
User Comments (0)
About PowerShow.com