Regression - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Regression

Description:

Derivation from minimizing the sum of squares. Probabilistic interpretation ... Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani,'Least Angle ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 26
Provided by: publi5
Category:
Tags: iain | regression

less

Transcript and Presenter's Notes

Title: Regression


1
Regression
Jieping Ye Department of Computer Science
Engineering Arizona State University
2
Outline
  • Least Squares regression
  • Derivation from minimizing the sum of squares
  • Probabilistic interpretation
  • Ridge Regression
  • Lasso

3
Classification
X Y
  • Anything
  • continuous (?, ?d, )
  • discrete (0,1, 1,k, )
  • structured (tree, string, )
  • discrete
  • 0,1 binary
  • 1,k multi-class
  • tree, etc. structured

4
Classification
X
  • Anything
  • continuous (?, ?d, )
  • discrete (0,1, 1,k, )
  • structured (tree, string, )

5
Classification
X
  • Anything
  • continuous (?, ?d, )
  • discrete (0,1, 1,k, )
  • structured (tree, string, )

6
Regression
X Y
  • Anything
  • continuous (?, ?d, )
  • discrete (0,1, 1,k, )
  • structured (tree, string, )
  • continuous
  • ?, ?d

7
Examples
  • Temperature
  • Power consumption
  • Energy
  • Stock price
  • Location

8
Linear regression
40
26
24
Temperature
22
20
20
30
40
20
30
20
10
0
10
0
10
20
0
0
Given examples
given a new point
Predict
9
Linear regression
40
Temperature
20
0
0
20
10
Ordinary Least Squares (OLS)
Error or residual
Observation
Prediction
0
0
20
11
Minimize the sum squared error
Sum squared error
Linear equation
Linear system
12
Alternative derivation
n
d
Solve the system (its better not to invert the
matrix)
13
Probabilistic interpretation
0
0
20
Likelihood
14
Gauss-Markov Theorem
  • The least squares estimates of parameters w have
    the smallest variance among all linear unbiased
    estimates. Thus, it has the smallest mean squared
    error among all unbiased estimators.
  • However, there may well exist a biased estimator
    with smaller variance.

15
(No Transcript)
16
Issues with Least Squares Regression
  • Issues with least squares
  • Prediction accuracy Low bias, high variance.
  • Interpretation Model difficult to interpret when
    the number of regressors is large.
  • Consider alternatives like variable selection and
    coefficient shrinkage.
  • Minimize mean squared error by appropriate trade
    off between bias and variance.
  • Find the strongest regressors.

17
Subset Selection
  • Find a subset of size k p parameters that gives
    smallest residual sum of squares (RSS).
  • Choosing k involves a tradeoff between bias and
    variance.
  • k is chosen such that the expected prediction
    error is minimized.
  • Methods
  • Forward stepwise selection, which starts with the
    intercept and sequentially adds to the model, the
    predictor that most improves the fit.
  • Backward stepwise selection, starts with full
    model and sequentially deletes predictors.
  • Hybrid stepwise selection combines both, by
    selecting the best move, i.e. an add or a drop.
  • Best subset selection chooses the best subset.

18
Ridge Regression
  • Ridge regression shrinks the regression
    coefficients by imposing a penalty on their size.
    Thus, ridge coefficients minimize a penalized
    RSS.
  • Larger ? means greater shrinkage.
  • Alternatively
  • There is one-to-one correspondence between s and
    ?.

19
Ridge Regression
  • Equivalently
  • The ridge solutions are not equivariant under
    scaling of the inputs. So, inputs must be
    standardized.
  • It is a continuous process as compared to subset
    selection, where variables are either retained or
    discarded discretely.
  • The formulation is nonsingular, even if XTX is
    not full rank. This was the main motivation
    behind the introduction of this approach in
    statistics.
  • For orthogonal inputs the ridge estimates are
    just a scaled version of the least squares
    estimates, i.e,

20
Lasso Least Absolute Shrinkage and Selection
Operator
  • Similar to ridge reduction
  • The constraints make the solutions nonlinear in
    the yi.
  • Quadratic programming is used to compute the
    parameters.
  • Making t sufficiently small causes some of the
    coefficients to become zero. Thus, it is like a
    continuous subset selection.

21
Lasso vs Ridge Regression
refers to w in our discussion
22
Lasso vs Ridge Regression
23
Some Results
24
Computation of the Lasso Solutions
  • Quadratic programming.
  • Least Angle Regression - simultaneous solution
    for all values of s
  • Bradley Efron, Trevor Hastie, Iain Johnstone and
    Robert Tibshirani,"Least Angle Regression,
    Annals of Statistics, 2004.
  • A modification to the least angle regression can
    generate the entire path of lasso solution.

25
Next Class
  • Topic
  • Gaussian Process Regression
  • Reading
  • Prediction With Gaussian Processes From Linear
    Regression To Linear Prediction And Beyond
Write a Comment
User Comments (0)
About PowerShow.com