Boosted Lasso - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Boosted Lasso

Description:

Boosted Lasso (Blasso) for feature/subset selection. How is feature selection done? ... Boosting was originated from Probability Approximate Correct (PAC) learning ... – PowerPoint PPT presentation

Number of Views:200
Avg rating:3.0/5.0
Slides: 25
Provided by: gan107
Category:
Tags: boosted | lasso

less

Transcript and Presenter's Notes

Title: Boosted Lasso


1
Boosted Lasso
Bin Yu Statistics Department University of
California, Berkeley http//www.stat.berkeley.edu
/users/binyu Joint Work with Peng Zhao
2
IT is Providing a Golden time for Statistics
  • Massive data sets are collected in the IT age and
    need to be stored, transmitted, and analyzed
  • Functional genomics, Remote sensing, Sensor
    networks
  • Internet tomography, Neuroscience, Finance,
  • Statistics is the science indispensable for
    making sense of these massive data, and
    computationally feasible data reduction/feature
    selection is the key in most if not all IT
    statistical problems.

3
How is feature selection done?
  • Feature selection based on subject knowledge
  • Use of most powerful computer
  • Feature selection through model selection
    criteria (AIC, BIC, MDL, )
  • Expensive combinatorial search
  • Todays talk
  • Boosted Lasso (Blasso) for feature/subset
    selection

4
Gene Subset (Feature) Selection
  • Microarray technologies provide thousands of
    (correlated) gene expressions to choose from,
    say, for prediction of tumor status of a cell

  • p gt n
  • Model selection criteria seek concise and
    interpretable models which can also predict well
    on tumor status
  • MML (Minimum Message Length for
    classification)
  • (Wallace and Boulton,
    68)
  • Cp (Mallows, 73)
  • AIC (Akaike, 73, 74)
  • BIC (Schwartz, 78)
  • MDL (Minimum Description Length)
  • (Rissanen, 78)

5
Computational hurdle for Model Selection
  • Jornsten and Yu (2003) use MDL for
    simultaneous gene
  • selection and sample classification with good
    results. They
  • pre-selected about 100 or so genes from 6000.
  • There were still 21001.3 1030 possible
    subsets.
  • Combinatorial search over all subsets is too
    expensive.
  • A recent alternative
  • continuous embedding into a convex
    optimization problem through
  • Boosting -- a third generation computational
    method in
  • statistics or machine learning.

6
Computation for Statistical Inference
  • First generation computation in statistics
    before computers
  • use parametric models with closed form
    solutions for
  • maximum likelihood estimators or Bayes
    estimators.
  • Second generation computation with computers
  • design statistically optimal procedures and
    worry about
  • computation later. Call optimization routines.
  • Third generation computation
  • form statistical goals with computation in
    mind and take
  • advantage of special features of statistical
    computation.

7
Boosting computation and estimation in one
  • Boosting was originated from Probability
    Approximate Correct (PAC) learning (Schapire,
    1991), but made practical by Freund and Schapire
    (1996) (AdaBoost).
  • It has enjoyed impressive empirical successes.
  • Boosting idea
  • choose a base procedure and apply it
    sequentially to modified data and then linearly
    combine the resulted estimators.
  • Cross-validation or a test set are commonly
    used to stop the iterations early.

8
Boosting a gradient view
  • An important step forward for understanding
    boosting was made via the gradient view of
    Boosting by Breiman, Mason et al, Friedman
    et al, -- 2000
  • L2 boosting or boosting with squared loss in
    the regression case refitting of residuals
    (Twicing, Tukey, 1972).
  • Number of iteration steps is the
    regularization parameter.
  • Minimax optimality of L2 boosting in 1-dim
    case (Buhlmann Yu, 03).

9
L2 Boosting (with fixed small steps)
  • Regression set-up
  • standardized data
  • Xi a p-dimensional predictor
  • Yi response variable
  • We consider loss function
    and let j index the jth
  • predictor for j1,, p.
  • L2 Boosting with a fixed step size (Forward
    Stagewise Fitting, Efron et al, 2004)
  • where (small) is the step
    size.

10
Lasso (Tibshirani, 1996)
  • Lasso (Least Absolute Sum of Squares Operator)
  • where is a regularization
    parameter.
  • Lasso often gives sparse solutions due to the
  • L1 penalty so is an alternative to model or
  • subset selection (cf. Donoho et al, 2004).
  • For each fixed l, it is a quadratic
    programming
  • problem, but we need to choose a l based on
    data to use.
  • Until recently, many QPs are run for
    different l values
  • and cross-validation is often used to choose
    one l .
  • So Lasso computation is not cheap via QP and
    CV.

11
Striking Similarity (Efron et al, 2004)
  • Lasso L2
    Boosting with small steps

The paths are not always the same.
12
However, L2 Boosting -- Often Too Greedy
L2 Boosting
Lasso
13
Lasso trade-off between empirical loss and penalty
  • If we run coordinate gradient descent with a
    fixed step size on the
  • Lasso loss, at each iteration t

Trade-off
14
Forward and Backward
  • Two ways one can reduce the Lasso loss
  • Forward Step -- Reduce Empirical Loss (L2
    boosting)
  • Backward Step Reduce Penalty
  • Lets start with a forward step to get ,
    and use an initial

15
Going Back and Forth Blasso (Zhao Y, 2004)
  • Find the backward direction that leads to the
    minimal empirical loss
  • Take the direction if it leads to a decrease in
    the Lasso loss
  • Otherwise force a forward step
  • Relax ? whenever necessary.

16
BLasso Converges to Lasso Path as e goes to zero
  • BLasso
    Lasso

17
Example Classification
-norm SVM (Zhu et al, 2003) nondifferentiable
loss function (Zhu et al, 2004)
18
Example Generalized BLassoLinear Regression
with L? Penalty
Bridge Parameter ? 1
19
LARS, Boosting and BLasso
  • LARS (Least Angle Regression, Efron et al 2004)
  • Very efficient for giving the exact Lasso path
    for Least Squares problem when given small number
    of predictors.
  • Does not deal with other loss functions.
  • Not adequate when given a large or 1 number of
    predictors, e.g. trees, wavelets and splines.
    (nonparametric estimation)
  • Not adaptive, i.e. when data changes, need to
    re-run the algorithm .
  • Boosting (with fixed small step size)
  • Deals with different loss functions.
  • Deals with nonparametric estimation.
  • Often too greedy (always descent), not good
    approximation for Lasso.
  • Blasso
  • Boostings pros convergence to Lasso.
  • Adaptive Learning in Time Series context. (Our
    current research)
  • Connection between statistical estimation and
    Barrier Method.

20
Barrier Method (one of the Interior Point Methods)
  • Objective To minimize convex loss
  • under convex constraints,
    i1,,m.
  • Method Using barrier functions
    , minimize
  • for each ?gt0. As ? goes to 0, this become
    equivalent to the constraint optimization problem
    above.
  • Algorithm
  • (Inner Loop) Minimize (1) starting from last
    iterations solution.
  • (Outer Loop) ? ?? where constant ?lt1.
  • Iterate till ? close to 0.

21
Comparing Blasso with Barrier method
  • Similarities
  • Both have a tunning parameter and an inner loop
    and an outer loop
  • (inner for fixed l and outer for updating l). The
    inner
  • and outer loops are tied together to save on
    computation.
  • Differences
  • Desirable tunning parameter value
  • Barrier method known (0).
  • Blasso unknown. determined via say early
    stopping (CV or gMDL).
  • Inner loop computation cost
  • Barrier method expensive (Newton method)
  • Blasso inexpensive (approximate gradient using
    differences)

22
Statistical computation is special
  • Even in parametric inference, exact optimization
    is unnecessary
  • A root-n consistent estimator can be made
    efficient (achieving Fisher information lower
    bound) by one Newton step to optimize a
    likelihood function. And root-n consistent
    estimators are obtainable with inexpensive
    methods such as the method of moments.
  • In nonparameric inference, exact optimization is
    also not
  • necessary and many good regularization paths
    exit.
  • In boosting or neural net estimation, early
    stopping is used and even the objective function
    could be a surrogate (L2 in
  • the classification case). In Blasso, 0 is
    the exact solution for
  • tunning parameter infty, and it moves so
    slowly that it is always in the neighborhood of
    optimal solution where gradient is very fast.

23
Summary
  • It has forward (Boosting) and backward (new)
    steps.
  • BLasso approximates the Lasso path in all
    situations. (convergence proved for continuously
    differentiable loss functions)
  • Generalized BLasso deals with other loss and
    penalty functions.

24
Current Works
  • Methodology
  • Computational Efficient Early Stopping Rules
    (e.g. gMDL, CV)
  • Grouped Lasso
  • Online Learning through BLasso
  • General Algorithmic Regularization path (e.g.
    Graphical Models)
  • Boosting and Semi-supervised Learning for High
    Dimensional Data
  • Application
  • Neuroscience
  • Sensor Network
  • Finance
Write a Comment
User Comments (0)
About PowerShow.com