Boosting and Additive Trees (2) - PowerPoint PPT Presentation

About This Presentation
Title:

Boosting and Additive Trees (2)

Description:

Boosting and Additive Trees (2) Yi Zhang , Kevyn Collins-Thompson Advanced Statistical Seminar 11-745 Oct 29, 2002 Recap: Boosting (1) Background: Ensemble Learning ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 27
Provided by: YiZh3
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Boosting and Additive Trees (2)


1
Boosting and Additive Trees (2)
  • Yi Zhang , Kevyn Collins-Thompson
  • Advanced Statistical Seminar 11-745
  • Oct 29, 2002

2
Recap Boosting (1)
  • Background Ensemble Learning
  • Boosting Definitions, Example
  • AdaBoost
  • Boosting as an Additive Model
  • Boosting Practical Issues
  • Exponential Loss
  • Other Loss Functions
  • Boosting Trees
  • Boosting as Entropy Projection
  • Data Mining Methods

3
Outline for This Class
  • Find the solution based on numerical optimization
  • Control the model complexity and avoid over
    fitting
  • Right sized trees for boosting
  • Number of iterations
  • Regularization
  • Understand the final model (Interpretation)
  • Single variable
  • Correlation of variables

4
Numerical Optimization
  • Goal Find f that minimize the loss function over
    training data
  • Gradient Descent Search in the unconstrained
    function space to minimize the loss on training
    data
  • Loss on training data converges to zero

5
Gradient Search on Constrained Function Space
Gradient Tree Boosting
  • Introduce a tree at the mth iteration whose
    predictions tm are as close as possible to the
    negative gradient
  • Advantage compared with unconstrained gradient
    search Robust, less likely for over fitting

6
Algorithm 3 MART
7
View Boosting as Linear Model
  • Basis expansion
  • use basis function Tm (m1..M, each Tm is a weak
    learner) to transform inputs vector X into T
    space, then use linear models in this new space
  • Special for Boosting Choosing of basis function
    Tm depends on T1, Tm-1

8
Improve Boosting as Linear Model
  • Recap Linear Models in Chapter 3
  • Bias Variance trade off
  • Subset selection (feature selection, discrete)
  • Coefficient shrinkage (smoothing ridge, lasso)
  • Using derived input direction (PCA, PLA)
  • Multiple outcome shrinkage and selection
  • Exploit correlations in different outcomes
  • This Chapter Improve Boosting
  • Size of the constituent trees J
  • Number of boosting iterations M (subset
    selection)
  • Regularization (Shrinkage)

9
Right Size Tree for Boosting (?)
  • The Best for one step is not the best in long run
  • Using very large tree (such as C4.5) as weak
    learner to fit the residue assumes each tree is
    the last one in the expansion. Usually degrade
    performance and increase computation
  • Simple approach restrict all trees to be the
    same size J
  • J limits the input features interaction level of
    tree-based approximation
  • In practice low-order interaction effects tend to
    dominate, and empirically 4?J ?8 works well (?)

10
(No Transcript)
11
Number of Boosting Iterations(subset selection)
  • Boosting will over fit as M -gt ?
  • Use validation set
  • Other methods (later)

12
Shrinkage
  • Scale the contribution of each tree by a factor
    0lt?lt1 to control the learning rate
  • Both ? and M control prediction risk on the
    training data, and operate dependently
  • ? ??M?

13
(No Transcript)
14
Penalized Regression
  • Ridge regression or Lasso regression

15
Algorithm 4 Forward stagewise linear
16
If is monotone in ?, we have ?k?k
? M, and the solution for algorithm 4 is
identical to result of lasso regression as
described in page 64.
(? , M ) lasso regression S/t/?
17
More about algorithm 4
  • Algorithm 4 ? Algorithm 3 Shrinkage
  • L1 norm vs. L2 norm more details later
  • Chapter 12 after learning SVM

18
Interpretation Understanding the final model
  • Single decision trees are easy to interpret
  • Linear combination of trees is difficult to
    understand
  • Which features are important?
  • Whats the interaction between features?

19
Relative Importance of Individual Variables
  • For a single tree, define the importance of xl as
  • For additive tree, define the importance of xl as
  • For K-class classification, just treat as K
    2-class classification task

20
(No Transcript)
21
Partial Dependence Plots
  • Visualize dependence of approximation f(x) on the
    joint values of important features
  • Usually the size of the subsets is small (1-3)
  • Define average or partial dependence
  • Can be estimated empirically using the training
    data

22
10.50 vs. 10.52
  • Same if predictor variables are independent
  • Why use 10.50 instead of 10.52 to Measure Partial
    Dependency?
  • Example 1 f(X)h1(xs) h2(xc)
  • Example 2 f(X)h1(xs) h2(xc)

23
(No Transcript)
24
(No Transcript)
25
Conclusion
  • Find the solution based on numerical optimization
  • Control the model complexity and avoid over
    fitting
  • Right sized trees for boosting
  • Number of iterations
  • Regularization
  • Understand the final model (Interpretation)
  • Single variable
  • Correlation of variables

26
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com