Title: Boosting and Additive Trees (2)
1Boosting and Additive Trees (2)
- Yi Zhang , Kevyn Collins-Thompson
- Advanced Statistical Seminar 11-745
- Oct 29, 2002
-
2Recap Boosting (1)
- Background Ensemble Learning
- Boosting Definitions, Example
- AdaBoost
- Boosting as an Additive Model
- Boosting Practical Issues
- Exponential Loss
- Other Loss Functions
- Boosting Trees
- Boosting as Entropy Projection
- Data Mining Methods
3Outline for This Class
- Find the solution based on numerical optimization
- Control the model complexity and avoid over
fitting - Right sized trees for boosting
- Number of iterations
- Regularization
- Understand the final model (Interpretation)
- Single variable
- Correlation of variables
4Numerical Optimization
- Goal Find f that minimize the loss function over
training data - Gradient Descent Search in the unconstrained
function space to minimize the loss on training
data - Loss on training data converges to zero
5Gradient Search on Constrained Function Space
Gradient Tree Boosting
- Introduce a tree at the mth iteration whose
predictions tm are as close as possible to the
negative gradient - Advantage compared with unconstrained gradient
search Robust, less likely for over fitting
6Algorithm 3 MART
7View Boosting as Linear Model
- Basis expansion
- use basis function Tm (m1..M, each Tm is a weak
learner) to transform inputs vector X into T
space, then use linear models in this new space - Special for Boosting Choosing of basis function
Tm depends on T1, Tm-1
8Improve Boosting as Linear Model
- Recap Linear Models in Chapter 3
- Bias Variance trade off
- Subset selection (feature selection, discrete)
- Coefficient shrinkage (smoothing ridge, lasso)
- Using derived input direction (PCA, PLA)
- Multiple outcome shrinkage and selection
- Exploit correlations in different outcomes
- This Chapter Improve Boosting
- Size of the constituent trees J
- Number of boosting iterations M (subset
selection) - Regularization (Shrinkage)
9Right Size Tree for Boosting (?)
- The Best for one step is not the best in long run
- Using very large tree (such as C4.5) as weak
learner to fit the residue assumes each tree is
the last one in the expansion. Usually degrade
performance and increase computation - Simple approach restrict all trees to be the
same size J - J limits the input features interaction level of
tree-based approximation - In practice low-order interaction effects tend to
dominate, and empirically 4?J ?8 works well (?)
10(No Transcript)
11Number of Boosting Iterations(subset selection)
- Boosting will over fit as M -gt ?
- Use validation set
- Other methods (later)
12Shrinkage
- Scale the contribution of each tree by a factor
0lt?lt1 to control the learning rate - Both ? and M control prediction risk on the
training data, and operate dependently - ? ??M?
13(No Transcript)
14Penalized Regression
- Ridge regression or Lasso regression
15Algorithm 4 Forward stagewise linear
16If is monotone in ?, we have ?k?k
? M, and the solution for algorithm 4 is
identical to result of lasso regression as
described in page 64.
(? , M ) lasso regression S/t/?
17More about algorithm 4
- Algorithm 4 ? Algorithm 3 Shrinkage
- L1 norm vs. L2 norm more details later
- Chapter 12 after learning SVM
18Interpretation Understanding the final model
- Single decision trees are easy to interpret
- Linear combination of trees is difficult to
understand - Which features are important?
- Whats the interaction between features?
19Relative Importance of Individual Variables
- For a single tree, define the importance of xl as
- For additive tree, define the importance of xl as
- For K-class classification, just treat as K
2-class classification task
20(No Transcript)
21Partial Dependence Plots
- Visualize dependence of approximation f(x) on the
joint values of important features - Usually the size of the subsets is small (1-3)
- Define average or partial dependence
- Can be estimated empirically using the training
data
2210.50 vs. 10.52
- Same if predictor variables are independent
- Why use 10.50 instead of 10.52 to Measure Partial
Dependency? - Example 1 f(X)h1(xs) h2(xc)
- Example 2 f(X)h1(xs) h2(xc)
23(No Transcript)
24(No Transcript)
25Conclusion
- Find the solution based on numerical optimization
- Control the model complexity and avoid over
fitting - Right sized trees for boosting
- Number of iterations
- Regularization
- Understand the final model (Interpretation)
- Single variable
- Correlation of variables
26(No Transcript)