Title: Boosted Lasso
1Boosted Lasso
Bin Yu Statistics Department University of
California, Berkeley http//www.stat.berkeley.edu
/users/binyu Joint Work with Peng Zhao
2IT is Providing a Golden time for Statistics
- Massive data sets are collected in the IT age and
need to be stored, transmitted, and analyzed - Functional genomics, Remote sensing, Sensor
networks -
- Internet tomography, Neuroscience, Finance,
-
- Statistics is the science indispensable for
making sense of these massive data, and
computationally feasible data reduction/feature
selection is the key in most if not all IT
statistical problems.
3How is feature selection done?
-
- Feature selection based on subject knowledge
- Use of most powerful computer
- Feature selection through model selection
criteria (AIC, BIC, MDL, ) - Expensive combinatorial search
- Todays talk
-
- Boosted Lasso (Blasso) for feature/subset
selection
4Gene Subset (Feature) Selection
- Microarray technologies provide thousands of
(correlated) gene expressions to choose from,
say, for prediction of tumor status of a cell -
-
p gt n - Model selection criteria seek concise and
interpretable models which can also predict well
on tumor status -
- MML (Minimum Message Length for
classification) - (Wallace and Boulton,
68) - Cp (Mallows, 73)
- AIC (Akaike, 73, 74)
- BIC (Schwartz, 78)
- MDL (Minimum Description Length)
- (Rissanen, 78)
5Computational hurdle for Model Selection
- Jornsten and Yu (2003) use MDL for
simultaneous gene - selection and sample classification with good
results. They - pre-selected about 100 or so genes from 6000.
- There were still 21001.3 1030 possible
subsets. - Combinatorial search over all subsets is too
expensive. -
- A recent alternative
- continuous embedding into a convex
optimization problem through - Boosting -- a third generation computational
method in - statistics or machine learning.
6 Computation for Statistical Inference
- First generation computation in statistics
before computers - use parametric models with closed form
solutions for - maximum likelihood estimators or Bayes
estimators. - Second generation computation with computers
- design statistically optimal procedures and
worry about - computation later. Call optimization routines.
- Third generation computation
- form statistical goals with computation in
mind and take - advantage of special features of statistical
computation. -
-
7Boosting computation and estimation in one
- Boosting was originated from Probability
Approximate Correct (PAC) learning (Schapire,
1991), but made practical by Freund and Schapire
(1996) (AdaBoost). - It has enjoyed impressive empirical successes.
-
-
- Boosting idea
- choose a base procedure and apply it
sequentially to modified data and then linearly
combine the resulted estimators. - Cross-validation or a test set are commonly
used to stop the iterations early. -
8Boosting a gradient view
- An important step forward for understanding
boosting was made via the gradient view of
Boosting by Breiman, Mason et al, Friedman
et al, -- 2000 - L2 boosting or boosting with squared loss in
the regression case refitting of residuals
(Twicing, Tukey, 1972). - Number of iteration steps is the
regularization parameter. -
- Minimax optimality of L2 boosting in 1-dim
case (Buhlmann Yu, 03).
9L2 Boosting (with fixed small steps)
- Regression set-up
- standardized data
- Xi a p-dimensional predictor
- Yi response variable
-
- We consider loss function
and let j index the jth - predictor for j1,, p.
- L2 Boosting with a fixed step size (Forward
Stagewise Fitting, Efron et al, 2004) -
- where (small) is the step
size.
10 Lasso (Tibshirani, 1996)
- Lasso (Least Absolute Sum of Squares Operator)
-
-
- where is a regularization
parameter. - Lasso often gives sparse solutions due to the
- L1 penalty so is an alternative to model or
- subset selection (cf. Donoho et al, 2004).
- For each fixed l, it is a quadratic
programming - problem, but we need to choose a l based on
data to use. - Until recently, many QPs are run for
different l values - and cross-validation is often used to choose
one l . - So Lasso computation is not cheap via QP and
CV. -
11Striking Similarity (Efron et al, 2004)
- Lasso L2
Boosting with small steps
The paths are not always the same.
12However, L2 Boosting -- Often Too Greedy
L2 Boosting
Lasso
13Lasso trade-off between empirical loss and penalty
- If we run coordinate gradient descent with a
fixed step size on the - Lasso loss, at each iteration t
Trade-off
14Forward and Backward
- Two ways one can reduce the Lasso loss
- Forward Step -- Reduce Empirical Loss (L2
boosting) - Backward Step Reduce Penalty
- Lets start with a forward step to get ,
and use an initial
15Going Back and Forth Blasso (Zhao Y, 2004)
- Find the backward direction that leads to the
minimal empirical loss - Take the direction if it leads to a decrease in
the Lasso loss - Otherwise force a forward step
- Relax ? whenever necessary.
16BLasso Converges to Lasso Path as e goes to zero
17Example Classification
-norm SVM (Zhu et al, 2003) nondifferentiable
loss function (Zhu et al, 2004)
18Example Generalized BLassoLinear Regression
with L? Penalty
Bridge Parameter ? 1
19LARS, Boosting and BLasso
- LARS (Least Angle Regression, Efron et al 2004)
- Very efficient for giving the exact Lasso path
for Least Squares problem when given small number
of predictors. - Does not deal with other loss functions.
- Not adequate when given a large or 1 number of
predictors, e.g. trees, wavelets and splines.
(nonparametric estimation) - Not adaptive, i.e. when data changes, need to
re-run the algorithm . - Boosting (with fixed small step size)
- Deals with different loss functions.
- Deals with nonparametric estimation.
- Often too greedy (always descent), not good
approximation for Lasso. - Blasso
- Boostings pros convergence to Lasso.
- Adaptive Learning in Time Series context. (Our
current research) - Connection between statistical estimation and
Barrier Method.
20Barrier Method (one of the Interior Point Methods)
- Objective To minimize convex loss
- under convex constraints,
i1,,m. - Method Using barrier functions
, minimize - for each ?gt0. As ? goes to 0, this become
equivalent to the constraint optimization problem
above. - Algorithm
- (Inner Loop) Minimize (1) starting from last
iterations solution. - (Outer Loop) ? ?? where constant ?lt1.
- Iterate till ? close to 0.
21Comparing Blasso with Barrier method
- Similarities
- Both have a tunning parameter and an inner loop
and an outer loop - (inner for fixed l and outer for updating l). The
inner - and outer loops are tied together to save on
computation. - Differences
- Desirable tunning parameter value
- Barrier method known (0).
- Blasso unknown. determined via say early
stopping (CV or gMDL). - Inner loop computation cost
- Barrier method expensive (Newton method)
- Blasso inexpensive (approximate gradient using
differences)
22Statistical computation is special
- Even in parametric inference, exact optimization
is unnecessary - A root-n consistent estimator can be made
efficient (achieving Fisher information lower
bound) by one Newton step to optimize a
likelihood function. And root-n consistent
estimators are obtainable with inexpensive
methods such as the method of moments. - In nonparameric inference, exact optimization is
also not - necessary and many good regularization paths
exit. - In boosting or neural net estimation, early
stopping is used and even the objective function
could be a surrogate (L2 in - the classification case). In Blasso, 0 is
the exact solution for - tunning parameter infty, and it moves so
slowly that it is always in the neighborhood of
optimal solution where gradient is very fast.
23Summary
- It has forward (Boosting) and backward (new)
steps. - BLasso approximates the Lasso path in all
situations. (convergence proved for continuously
differentiable loss functions) - Generalized BLasso deals with other loss and
penalty functions.
24Current Works
- Methodology
- Computational Efficient Early Stopping Rules
(e.g. gMDL, CV) - Grouped Lasso
- Online Learning through BLasso
- General Algorithmic Regularization path (e.g.
Graphical Models) - Boosting and Semi-supervised Learning for High
Dimensional Data - Application
- Neuroscience
- Sensor Network
- Finance