Boosted Lasso - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Boosted Lasso

Description:

Boosted Lasso (Blasso) for feature/subset selection. How is feature selection done? ... Boosting was originated from Probability Approximate Correct (PAC) learning ... – PowerPoint PPT presentation

Number of Views:200

Avg rating:3.0/5.0

Slides: 25

Provided by: gan107

Learn more at: https://www.stat.colostate.edu

Category:

more less

Transcript and Presenter's Notes

Title: Boosted Lasso

1
Boosted Lasso
Bin Yu Statistics Department University of
California, Berkeley http//www.stat.berkeley.edu
/users/binyu Joint Work with Peng Zhao
2
IT is Providing a Golden time for Statistics

Massive data sets are collected in the IT age and
need to be stored, transmitted, and analyzed
Functional genomics, Remote sensing, Sensor
networks
Internet tomography, Neuroscience, Finance,
Statistics is the science indispensable for
making sense of these massive data, and
computationally feasible data reduction/feature
selection is the key in most if not all IT
statistical problems.

3
How is feature selection done?

Feature selection based on subject knowledge
Use of most powerful computer
Feature selection through model selection
criteria (AIC, BIC, MDL, )
Expensive combinatorial search
Todays talk
Boosted Lasso (Blasso) for feature/subset
selection

4
Gene Subset (Feature) Selection

Microarray technologies provide thousands of
(correlated) gene expressions to choose from,
say, for prediction of tumor status of a cell
p gt n
Model selection criteria seek concise and
interpretable models which can also predict well
on tumor status
MML (Minimum Message Length for
classification)
(Wallace and Boulton,
68)
Cp (Mallows, 73)
AIC (Akaike, 73, 74)
BIC (Schwartz, 78)
MDL (Minimum Description Length)
(Rissanen, 78)

5
Computational hurdle for Model Selection

Jornsten and Yu (2003) use MDL for
simultaneous gene
selection and sample classification with good
results. They
pre-selected about 100 or so genes from 6000.
There were still 21001.3 1030 possible
subsets.
Combinatorial search over all subsets is too
expensive.
A recent alternative
continuous embedding into a convex
optimization problem through
Boosting -- a third generation computational
method in
statistics or machine learning.

6
Computation for Statistical Inference

First generation computation in statistics
before computers
use parametric models with closed form
solutions for
maximum likelihood estimators or Bayes
estimators.
Second generation computation with computers
design statistically optimal procedures and
worry about
computation later. Call optimization routines.
Third generation computation
form statistical goals with computation in
mind and take
advantage of special features of statistical
computation.

7
Boosting computation and estimation in one

Boosting was originated from Probability
Approximate Correct (PAC) learning (Schapire,
1991), but made practical by Freund and Schapire
(1996) (AdaBoost).
It has enjoyed impressive empirical successes.
Boosting idea
choose a base procedure and apply it
sequentially to modified data and then linearly
combine the resulted estimators.
Cross-validation or a test set are commonly
used to stop the iterations early.

8
Boosting a gradient view

An important step forward for understanding
boosting was made via the gradient view of
Boosting by Breiman, Mason et al, Friedman
et al, -- 2000
L2 boosting or boosting with squared loss in
the regression case refitting of residuals
(Twicing, Tukey, 1972).
Number of iteration steps is the
regularization parameter.
Minimax optimality of L2 boosting in 1-dim
case (Buhlmann Yu, 03).

9
L2 Boosting (with fixed small steps)

Regression set-up
standardized data
Xi a p-dimensional predictor
Yi response variable
We consider loss function
and let j index the jth
predictor for j1,, p.
L2 Boosting with a fixed step size (Forward
Stagewise Fitting, Efron et al, 2004)
where (small) is the step
size.

10
Lasso (Tibshirani, 1996)

Lasso (Least Absolute Sum of Squares Operator)
where is a regularization
parameter.
Lasso often gives sparse solutions due to the
L1 penalty so is an alternative to model or
subset selection (cf. Donoho et al, 2004).
For each fixed l, it is a quadratic
programming
problem, but we need to choose a l based on
data to use.
Until recently, many QPs are run for
different l values
and cross-validation is often used to choose
one l .
So Lasso computation is not cheap via QP and
CV.

11
Striking Similarity (Efron et al, 2004)

Lasso L2
Boosting with small steps

The paths are not always the same.
12
However, L2 Boosting -- Often Too Greedy
L2 Boosting
Lasso
13
Lasso trade-off between empirical loss and penalty

If we run coordinate gradient descent with a
fixed step size on the
Lasso loss, at each iteration t

Trade-off
14
Forward and Backward

Two ways one can reduce the Lasso loss
Forward Step -- Reduce Empirical Loss (L2
boosting)
Backward Step Reduce Penalty
Lets start with a forward step to get ,
and use an initial

15
Going Back and Forth Blasso (Zhao Y, 2004)

Find the backward direction that leads to the
minimal empirical loss
Take the direction if it leads to a decrease in
the Lasso loss
Otherwise force a forward step
Relax ? whenever necessary.

16
BLasso Converges to Lasso Path as e goes to zero

BLasso
Lasso

17
Example Classification
-norm SVM (Zhu et al, 2003) nondifferentiable
loss function (Zhu et al, 2004)
18
Example Generalized BLassoLinear Regression
with L? Penalty
Bridge Parameter ? 1
19
LARS, Boosting and BLasso

LARS (Least Angle Regression, Efron et al 2004)
Very efficient for giving the exact Lasso path
for Least Squares problem when given small number
of predictors.
Does not deal with other loss functions.
Not adequate when given a large or 1 number of
predictors, e.g. trees, wavelets and splines.
(nonparametric estimation)
Not adaptive, i.e. when data changes, need to
re-run the algorithm .
Boosting (with fixed small step size)
Deals with different loss functions.
Deals with nonparametric estimation.
Often too greedy (always descent), not good
approximation for Lasso.
Blasso
Boostings pros convergence to Lasso.
Adaptive Learning in Time Series context. (Our
current research)
Connection between statistical estimation and
Barrier Method.

20
Barrier Method (one of the Interior Point Methods)

Objective To minimize convex loss
under convex constraints,
i1,,m.
Method Using barrier functions
, minimize
for each ?gt0. As ? goes to 0, this become
equivalent to the constraint optimization problem
above.
Algorithm
(Inner Loop) Minimize (1) starting from last
iterations solution.
(Outer Loop) ? ?? where constant ?lt1.
Iterate till ? close to 0.

21
Comparing Blasso with Barrier method

Similarities
Both have a tunning parameter and an inner loop
and an outer loop
(inner for fixed l and outer for updating l). The
inner
and outer loops are tied together to save on
computation.
Differences
Desirable tunning parameter value
Barrier method known (0).
Blasso unknown. determined via say early
stopping (CV or gMDL).
Inner loop computation cost
Barrier method expensive (Newton method)
Blasso inexpensive (approximate gradient using
differences)

22
Statistical computation is special

Even in parametric inference, exact optimization
is unnecessary
A root-n consistent estimator can be made
efficient (achieving Fisher information lower
bound) by one Newton step to optimize a
likelihood function. And root-n consistent
estimators are obtainable with inexpensive
methods such as the method of moments.
In nonparameric inference, exact optimization is
also not
necessary and many good regularization paths
exit.
In boosting or neural net estimation, early
stopping is used and even the objective function
could be a surrogate (L2 in
the classification case). In Blasso, 0 is
the exact solution for
tunning parameter infty, and it moves so
slowly that it is always in the neighborhood of
optimal solution where gradient is very fast.

23
Summary

It has forward (Boosting) and backward (new)
steps.
BLasso approximates the Lasso path in all
situations. (convergence proved for continuously
differentiable loss functions)
Generalized BLasso deals with other loss and
penalty functions.

24
Current Works