Additive Models and Trees - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Additive Models and Trees

Description:

CART: Classification and Regression Trees. MARS: Multiple Adaptive Regression Splines ... MARS & CART relationship. IF. replace piecewise linear basis by step ... – PowerPoint PPT presentation

Number of Views:198
Avg rating:3.0/5.0
Slides: 35
Provided by: Nila1
Category:
Tags: additive | cart | models | trees

less

Transcript and Presenter's Notes

Title: Additive Models and Trees


1
Additive Models and Trees
  • Lecture Notes for CMPUT 466/551
  • Nilanjan Ray

Principal Source Department of Statistics, CMU
2
Topics to cover
  • GAM Generalized Additive Models
  • CART Classification and Regression Trees
  • MARS Multiple Adaptive Regression Splines

3
Generalized Additive Models
What is GAM?
The functions fj are smoothing functions in
general, such as splines, kernel functions,
linear functions, and so on Each function could
be different, e.g., f1 can be linear, f2 can be a
natural spline, etc.
Compare GAM with Linear Basis Expansions (Ch. 5
of HTF) Similarities? Dissimilarities? Any
similarity (in principle) with Naïve Bayes model?
4
Smoothing Functions in GAM
  • Non-parametric functions (linear smoother)
  • Smoothing splines (Basis expansion)
  • Simple k-nearest neighbor (raw moving average)
  • Locally weighted average by using kernel
    weighting
  • Local linear regression, local polynomial
    regression
  • Linear functions
  • Functions of more than one variables (interaction
    term)
  • Example

5
Learning GAM Backfitting
  • Backfitting algorithm
  • Initialize
  • Cycle j 1,2,, p,,1,2,, p,, (m cycles)
  • Until the functions change less than a
    prespecified threshold

6
Backfitting Points to Ponder
Computational Advantage? Convergence? How to
choose fitting functions?
7
Example Generalized Logistic Regression
Model
8
Additive Logistic Regression Backfitting
Fitting logistic regression (P99)
Fitting additive logistic regression (P262)
1. where
1.
2.
2.
Iterate
Iterate
a.
a.
b.
b.
Using weighted least squares to fit a linear
model to zi with weights wi, give new estimates
c.
c. Using weighted backfitting algorithm to fit
an additive model to zi with weights wi, give new
estimates
3. Continue step 2 until converge
3.Continue step 2 until converge
9
SPAM Detection via Additive Logistic Regression
  • Input variables (predictors)
  • 48 quantitative variables percentage of words in
    the email that match a given word. Examples
    include business, address, internet, etc.
  • 6 quantitative variables percentage of
    characters in the email that match a given
    character, such as ch, ch(, etc.
  • The average length of uninterrupted sequences of
    capital letters
  • The length of the longest uninterrupted sequence
    of capital letters
  • The sum of length of uninterrupted length of
    capital letters
  • Output variable SPAM (1) or Email (0)
  • fjs are taken as cubic smoothing splines

10
SPAM Detection Results
Sensitivity Probability of predicting spam given
true state is spam Specificity Probability
of predicting email given true state is email
11
GAM Summary
  • Useful flexible extensions of linear models
  • Backfitting algorithm is simple and modular
  • Interpretability of the predictors (input
    variables) are not obscured
  • Not suitable for very large data mining
    applications (why?)

12
CART
  • Overview
  • Principle behind Divide and conquer
  • Partition the feature space into a set of
    rectangles
  • For simplicity, use recursive binary partition
  • Fit a simple model (e.g. constant) for each
    rectangle
  • Classification and Regression Trees (CART)
  • Regress Trees
  • Classification Trees
  • Popular in medical applications

13
CART
  • An example (in regression case)

14
Basic Issues in Tree-based Methods
  • How to grow a tree?
  • How large should we grow the tree?

15
Regression Trees
  • Partition the space into M regions R1, R2, ,
    RM.

Note that this is still an additive model
16
Regression Trees Grow the Tree
  • The best partition to minimize the sum of
    squared error
  • Finding the global minimum is computationally
    infeasible
  • Greedy algorithm at each level choose variable j
    and value s as
  • The greedy algorithm makes the tree unstable
  • The error made at the upper level will be
    propagated to the lower level

17
Regression Tree how large should we grow the
tree ?
  • Trade-off between bias and variance
  • Very large tree overfit (low bias, high
    variance)
  • Small tree (low variance, high bias) might not
    capture the structure
  • Strategies
  • 1 split only when we can decrease the error
    (usually short-sighted)
  • 2 Cost-complexity pruning (preferred)

18
Regression Tree - Pruning
  • Cost-complexity pruning
  • Pruning collapsing some internal nodes
  • Cost complexity
  • Choose best alpha weakest link pruning (p.270,
    HTF)
  • Each time collapse an internal node which add
    smallest error
  • Choose from this tree sequence the best one by
    cross-validation

19
Classification Trees
  • Classify the observations in node m to the major
    class in the node
  • Pmk is the proportion of observation of class k
    in node m
  • Define impurity for a node
  • Misclassification error
  • Entropy
  • Gini index

20
Classification Trees
Node impurity measures versus class proportion
for 2-class problem
  • Entropy and Gini are more sensitive
  • To grow the tree use Entropy or Gini
  • To prune the tree use Misclassification rate (or
    any other method)

21
Tree-based Methods Discussions
  • Categorical Predictors
  • Problem Consider splits of sub tree t into tL
    and tR based on categorical predictor x which has
    q possible values 2(q-1)-1 ways !
  • Treat the categorical predictor as ordered by say
    proportion of class 1

22
Tree-based Methods Discussions
  • Linear Combination Splits
  • Split the node based on
  • Improve the predictive power
  • Hurt interpretability
  • Instability of Trees
  • Inherited from the hierarchical nature
  • Bagging (section 8.7 of HTF) can reduce the
    variance

23
Bootstrap Trees
Construct B number of trees from B bootstrap
samples bootstrap trees
24
Bootstrap Trees
25
Bagging The Bootstrap Trees
is computed from the bth bootstrap sample
in this case a tree
Bagging reduces the variance of the original tree
by aggregation
26
Bagged Tree Performance
Majority vote
Average
27
MARS
  • In multi-dimensional spline the basis functions
    grow exponentially curse of dimensionality
  • A partial remedy is a greedy forward search
    algorithm
  • Create a simple basis-construction dictionary
  • Construct basis functions on-the-fly
  • Choose the best-fit basis function at each step

28
Basis functions
  • 1-dim linear spline (t represents the knot)
  • Basis collections C
  • C 2 N p

29
The MARS procedure (1st stage)
  • Initialize basis set M with a constant function
  • Form candidates (cross-product of M with set C)
  • Add the best-fit basis pair (decrease residual
    error the most) into M
  • Repeat from step 2 (until e.g. M gt threshold)

M (new)
M (old)
C
30
The MARS procedure (2nd stage)
  • The final model M typically overfits the data
  • gtNeed to reduce the model size ( of terms)
  • Backward deletion procedure
  • Remove term which causes the smallest increase in
    residual error
  • Compute
  • Repeat step 1
  • Choose the model size with minimum GCV.

31
Generalized Cross Validation (GCV)
  • M(.) measures effective of parameters
  • r of linearly independent basis functions
  • K of knots selected
  • c 3

32
Discussion
  • Piecewise linear reflected basis
  • Allow operation on local region
  • Fitting N reflected basis pairs takes O(N)
    instead of O(N2)
  • Left-part is zero, right-part differs by a
    constant

Xi
Xi1
Xi-1
Xi2
33
Discussion (continue)
  • Hierarchical model (reduce search computation)
  • High-order term exists gt some lower-order
    footprints exist
  • Restriction Each input appear at most once in a
    product
  • e.g. (Xj - t1) (Xj - t1) is not considered
  • Set upper limit on order of interaction
  • Upper limit of 1 gt additive model
  • MARS for classification
  • Use multi-response Y (NK indicator matrix)
  • Masking problem may occur
  • Better solution optimal scoring (Chapter 12.5
    of HTF)

34
MARS CART relationship
  • IF
  • replace piecewise linear basis by step functions
  • keep only the newly formed product terms in M
    (leaf nodes of a binary tree)
  • THEN
  • MARS forward procedure
  • CART tree growing procedure
Write a Comment
User Comments (0)
About PowerShow.com