Boosting and Additive Trees (2) - PowerPoint PPT Presentation

About This Presentation

Title:

Boosting and Additive Trees (2)

Description:

Boosting and Additive Trees (2) Yi Zhang , Kevyn Collins-Thompson Advanced Statistical Seminar 11-745 Oct 29, 2002 Recap: Boosting (1) Background: Ensemble Learning ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 27

Provided by: YiZh3

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Boosting and Additive Trees (2)

1
Boosting and Additive Trees (2)

Yi Zhang , Kevyn Collins-Thompson
Advanced Statistical Seminar 11-745
Oct 29, 2002

2
Recap Boosting (1)

Background Ensemble Learning
Boosting Definitions, Example
AdaBoost
Boosting as an Additive Model
Boosting Practical Issues
Exponential Loss
Other Loss Functions
Boosting Trees
Boosting as Entropy Projection
Data Mining Methods

3
Outline for This Class

Find the solution based on numerical optimization
Control the model complexity and avoid over
fitting
Right sized trees for boosting
Number of iterations
Regularization
Understand the final model (Interpretation)
Single variable
Correlation of variables

4
Numerical Optimization

Goal Find f that minimize the loss function over
training data
Gradient Descent Search in the unconstrained
function space to minimize the loss on training
data
Loss on training data converges to zero

5
Gradient Search on Constrained Function Space
Gradient Tree Boosting

Introduce a tree at the mth iteration whose
predictions tm are as close as possible to the
negative gradient
Advantage compared with unconstrained gradient
search Robust, less likely for over fitting

6
Algorithm 3 MART
7
View Boosting as Linear Model

Basis expansion
use basis function Tm (m1..M, each Tm is a weak
learner) to transform inputs vector X into T
space, then use linear models in this new space
Special for Boosting Choosing of basis function
Tm depends on T1, Tm-1

8
Improve Boosting as Linear Model

Recap Linear Models in Chapter 3
Bias Variance trade off
Subset selection (feature selection, discrete)
Coefficient shrinkage (smoothing ridge, lasso)
Using derived input direction (PCA, PLA)
Multiple outcome shrinkage and selection
Exploit correlations in different outcomes

This Chapter Improve Boosting
Size of the constituent trees J
Number of boosting iterations M (subset
selection)
Regularization (Shrinkage)

9
Right Size Tree for Boosting (?)

The Best for one step is not the best in long run
Using very large tree (such as C4.5) as weak
learner to fit the residue assumes each tree is
the last one in the expansion. Usually degrade
performance and increase computation
Simple approach restrict all trees to be the
same size J
J limits the input features interaction level of
tree-based approximation
In practice low-order interaction effects tend to
dominate, and empirically 4?J ?8 works well (?)

10
(No Transcript)
11
Number of Boosting Iterations(subset selection)

Boosting will over fit as M -gt ?
Use validation set
Other methods (later)

12
Shrinkage

Scale the contribution of each tree by a factor
0lt?lt1 to control the learning rate
Both ? and M control prediction risk on the
training data, and operate dependently
? ??M?

13
(No Transcript)
14
Penalized Regression

Ridge regression or Lasso regression

15
Algorithm 4 Forward stagewise linear
16
If is monotone in ?, we have ?k?k
? M, and the solution for algorithm 4 is
identical to result of lasso regression as
described in page 64.
(? , M ) lasso regression S/t/?
17
More about algorithm 4

Algorithm 4 ? Algorithm 3 Shrinkage
L1 norm vs. L2 norm more details later
Chapter 12 after learning SVM

18
Interpretation Understanding the final model

Single decision trees are easy to interpret
Linear combination of trees is difficult to
understand
Which features are important?
Whats the interaction between features?

19
Relative Importance of Individual Variables

For a single tree, define the importance of xl as
For additive tree, define the importance of xl as
For K-class classification, just treat as K
2-class classification task

20
(No Transcript)
21
Partial Dependence Plots

Visualize dependence of approximation f(x) on the
joint values of important features
Usually the size of the subsets is small (1-3)
Define average or partial dependence
Can be estimated empirically using the training
data

22
10.50 vs. 10.52